[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Flags:   (was: Important)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
>

[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808663#comment-15808663
 ] 

Shuai Lin commented on SPARK-19123:
---

IIUC {{KeyProviderException}} means the storage account key is not configured 
properly. Are you sure the way you specify the key is correct? Have you checked 
the azure developer docs for it?

BTW I don't think this is an "critical issue", so I changed it to "minor".

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAcce

[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Labels: newbie  (was: features newbie test)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:

[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Priority: Minor  (was: Critical)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: features, newbie, test
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.

[jira] [Updated] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18941:

Assignee: Dongjoon Hyun

> "Drop Table" command doesn't delete the directory of the managed Hive table 
> when users specifying locations
> ---
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: luat
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18941:

Component/s: (was: Java API)
 SQL

> "Drop Table" command doesn't delete the directory of the managed Hive table 
> when users specifying locations
> ---
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: luat
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18941.
-
   Resolution: Resolved
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> "Drop Table" command doesn't delete the directory of the managed Hive table 
> when users specifying locations
> ---
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: luat
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-07 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-19122:

Description: 
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k#101, j#100, 200)
  +- *Project [i#99, j#100, k#101]
 +- *Filter (isnotnull(j#100) && isnotnull(k#101))
+- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}

  was:
`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1],
 PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2],
 PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], 
PartitionFilters: [], PushedFilters: [IsNotNull(k), Is

[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2017-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808507#comment-15808507
 ] 

Tejas Patil commented on SPARK-18067:
-

[~hvanhovell] :

* You are right about the data distribution. Both approaches are prone to OOMs 
but my approach is more likely to OOM based on the data. If the tables are 
bucketed + sorted on a single column, both approaches will distribute the data 
in same manner. When I checked at Hive behavior, its doing the same thing that 
I mentioned and it works across all our workloads. 
* Adding shuffle makes the job to be less performant in general irrespective of 
data distribution. I assume that "significant performance problems" will only 
happen if the data is heavily skewed on the join key ?
* I have seen OOMs happen with current spark approach (which hashes 
everything). I personally feel that the row buffering should be backed by disk 
spill to be more reliable. It will run a bit slower but at least will be 
reliable and not page people in the middle of the night due to OOM.

I have seen some other scenario in SPARK-19122 but the proposed solution there 
will help this issue as well.

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808485#comment-15808485
 ] 

Tejas Patil edited comment on SPARK-19122 at 1/8/17 1:44 AM:
-

[~hvanhovell] : In a broader level, I see that when nodes for physical 
operators are created, the `requiredChildDistribution` and 
`requiredChildOrdering` is pretty much fixed at that point. While the planning 
is done and the child nodes are inspected, there is no way to change it.

For fixing this, my take would be :
* Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided 
based on its children. ie. change `requiredChildOrdering()` to 
`requiredChildOrdering(childOrderings)`. Default behavior would be to ignore 
the input `childOrderings`. 
* In case of operators like `SortMergeJoinExec`, where the distribution and 
ordering requirement is way stricter, the operator could itself take a decision 
what ordering needs to be used.

||child output ordering||ordering of join keys in query||shuffle+sort needed||
| a, b | a, b | No |
| a, b | b, a | No |
| a, b, c, d | a, b | No |
| a, b, c, d | b, c | Yes |
| a, b | a, b, c, d | Yes |
| b, c | a, b, c, d | Yes |

Even SPARK-18067 will benefit from this change. Let me know if you have any 
opinion about this approach OR have something better in mind.



was (Author: tejasp):
[~hvanhovell] : In a broader level, I see that when nodes for physical 
operators are created, the `requiredChildDistribution` and 
`requiredChildOrdering` is pretty much fixed at that point. While the planning 
is done and the child nodes are inspected, there is no way to change it.

For fixing this, my take would be :
* Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided 
based on its children. ie. change `requiredChildOrdering()` to 
`requiredChildOrdering(childOrderings)`. Default behavior would be to ignore 
the input `childOrderings`. 
* In case of operators like `SortMergeJoinExec`, where the distribution and 
ordering requirement is way stricter, the operator could itself take a decision 
what ordering needs to be used.

||child output ordering||ordering of join keys in query||shuffle+sort needed||
| a, b | a, b | No |
| a, b | b, a | No |
| a, b, c, d | a, b | No |
| a, b, c, d | b, c | Yes |
| a, b | a, b, c, d | Yes |
| b, c | a, b, c, d | Yes |

Even SPARK-17271 will benefit from this change. Let me know if you have any 
opinion about this approach OR have something better in mind.


> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], 
> ReadSchema: struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2],
>  PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], 
> ReadSchema: struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], fa

[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808485#comment-15808485
 ] 

Tejas Patil commented on SPARK-19122:
-

[~hvanhovell] : In a broader level, I see that when nodes for physical 
operators are created, the `requiredChildDistribution` and 
`requiredChildOrdering` is pretty much fixed at that point. While the planning 
is done and the child nodes are inspected, there is no way to change it.

For fixing this, my take would be :
* Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided 
based on its children. ie. change `requiredChildOrdering()` to 
`requiredChildOrdering(childOrderings)`. Default behavior would be to ignore 
the input `childOrderings`. 
* In case of operators like `SortMergeJoinExec`, where the distribution and 
ordering requirement is way stricter, the operator could itself take a decision 
what ordering needs to be used.

||child output ordering||ordering of join keys in query||shuffle+sort needed||
| a, b | a, b | No |
| a, b | b, a | No |
| a, b, c, d | a, b | No |
| a, b, c, d | b, c | Yes |
| a, b | a, b, c, d | Yes |
| b, c | a, b, c, d | Yes |

Even SPARK-17271 will benefit from this change. Let me know if you have any 
opinion about this approach OR have something better in mind.


> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], 
> ReadSchema: struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2],
>  PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], 
> ReadSchema: struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], 
> ReadSchema: struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], 
> ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808479#comment-15808479
 ] 

Tejas Patil commented on SPARK-19122:
-

When a `SortMergeJoinExec` node is created, the join keys in both `left` and 
`right` relations are extracted in their order of appearance in the query (see 
0). Later, the list of keys are used as-is to define the required distribution 
(see 1) and ordering (see 2) for the sort merge join node. Since the ordering 
matters here (ie. `ClusteredDistribution(a,b) != ClusteredDistribution(b,a)`), 
this mismatches with the distribution and ordering of the children... thus 
`EnsureRequirements` ends up adding shuffle + sort.

0 : 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L103
1 : 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L80
2 : 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L85



> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], 
> ReadSchema: struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2],
>  PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], 
> ReadSchema: struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], 
> ReadSchema: struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], 
> ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Saulo Ricci (JIRA)
Saulo Ricci created SPARK-19123:
---

 Summary: KeyProviderException when reading Azure Blobs from Apache 
Spark
 Key: SPARK-19123
 URL: https://issues.apache.org/jira/browse/SPARK-19123
 Project: Spark
  Issue Type: Question
  Components: Input/Output, Java API
Affects Versions: 2.0.0
 Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
version 3.5 with Hadoop version 2.7.3
Reporter: Saulo Ricci
Priority: Critical


I created a Spark job and it's intended to read a set of json files from a 
Azure Blob container. I set the key and reference to my storage and I'm reading 
the files as showed in the snippet bellow:

{code:java}
SparkSession
sparkSession =
SparkSession.builder().appName("Pipeline")
.master("yarn")
.config("fs.azure", 
"org.apache.hadoop.fs.azure.NativeAzureFileSystem")

.config("fs.azure.account.key..blob.core.windows.net","")
.getOrCreate();

Dataset txs = sparkSession.read().json("wasb://path_to_files");
{code}
The point is that I'm unfortunately getting a 
`org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
the azure storage. According to the trace showed bellow it seems the header too 
long but still trying to figure out what exactly that means:

{code:java}
17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
org.apache.hadoop.fs.azure.AzureException: 
org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: 
Error reading S/MIME message
140473279682200:error:0D07207B:asn1 encoding 
routines:ASN1_get_object:header too long:asn1_lib.c:157:
140473279682200:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode 
error:asn_mime.c:192:
140473279682200:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 
parse error:asn_mime.c:517:

org.apache.hadoop.fs.azure.AzureException: 
org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: 
Error reading S/MIME message
140473279682200:error:0D07207B:asn1 encoding 
routines:ASN1_get_object:header too long:asn1_lib.c:157:
140473279682200:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode 
error:asn_mime.c:192:
140473279682200:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 
parse error:asn_mime.c:517:

at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
at 
taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: org.apache.hadoop.fs.azure.KeyProviderException: 
ExitCodeException exitCode=2: Error reading S/MIME message
{code}
I'm using a Apache Spark 2.0.0 cluster set on top of a HDInsight Azure cluster 
and I'd like to find a solution to this 

[jira] [Created] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-07 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-19122:
---

 Summary: Unnecessary shuffle+sort added if join predicates 
ordering differ from bucketing and sorting order
 Key: SPARK-19122
 URL: https://issues.apache.org/jira/browse/SPARK-19122
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2
Reporter: Tejas Patil


`table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
respective order)

This is how they are generated:
{code}
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
"k").coalesce(1)
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table1")
df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", 
"k").sortBy("j", "k").saveAsTable("table2")
{code}

Now, if join predicates are specified in query in *same* order as bucketing and 
sort order, there is no shuffle and sort.

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
a.k=b.k").explain(true)

== Physical Plan ==
*SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
:- *Project [i#60, j#61, k#62]
:  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
: +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1],
 PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Project [i#99, j#100, k#101]
   +- *Filter (isnotnull(j#100) && isnotnull(k#101))
  +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: 
ORC, Location: 
InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2],
 PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}


The same query with join predicates in *different* order from bucketing and 
sort order leads to extra shuffle and sort being introduced

{code}
scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
").explain(true)

== Physical Plan ==
*SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
:- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(k#62, j#61, 200)
: +- *Project [i#60, j#61, k#62]
:+- *Filter (isnotnull(k#62) && isnotnull(j#61))
:   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], 
PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
struct
+- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k#101, j#100, 200)
  +- *Project [i#99, j#100, k#101]
 +- *Filter (isnotnull(j#100) && isnotnull(k#101))
+- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], 
PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
struct
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19120:


Assignee: Xiao Li  (was: Apache Spark)

> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test (a STRING)
> |STORED AS PARQUET
>   """.stripMargin)
>spark.table("test").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test
>""".stripMargin)
> spark.table("test").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19121:


Assignee: Apache Spark  (was: Xiao Li)

> No need to refresh metadata cache for non-partitioned Hive tables
> -
>
> Key: SPARK-19121
> URL: https://issues.apache.org/jira/browse/SPARK-19121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Only partitioned Hive tables need a metadata refresh after insertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808406#comment-15808406
 ] 

Apache Spark commented on SPARK-19120:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16500

> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test (a STRING)
> |STORED AS PARQUET
>   """.stripMargin)
>spark.table("test").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test
>""".stripMargin)
> spark.table("test").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19120:


Assignee: Apache Spark  (was: Xiao Li)

> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test (a STRING)
> |STORED AS PARQUET
>   """.stripMargin)
>spark.table("test").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test
>""".stripMargin)
> spark.table("test").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808407#comment-15808407
 ] 

Apache Spark commented on SPARK-19121:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16500

> No need to refresh metadata cache for non-partitioned Hive tables
> -
>
> Key: SPARK-19121
> URL: https://issues.apache.org/jira/browse/SPARK-19121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Only partitioned Hive tables need a metadata refresh after insertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19121:


Assignee: Xiao Li  (was: Apache Spark)

> No need to refresh metadata cache for non-partitioned Hive tables
> -
>
> Key: SPARK-19121
> URL: https://issues.apache.org/jira/browse/SPARK-19121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Only partitioned Hive tables need a metadata refresh after insertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19120:

Description: 
{noformat}
sql(
  """
|CREATE TABLE test (a STRING)
|STORED AS PARQUET
  """.stripMargin)

   spark.table("test").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test
   """.stripMargin)

spark.table("test").show()
{noformat}

The returned result is empty after table loading. We should refresh the 
metadata cache after loading the data to the table. 

  was:
{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should refresh the 
metadata cache after loading the data to the table. 

This only happens if users manually add the partition by {{ALTER TABLE ADD 
PARTITION}}



> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test (a STRING)
> |STORED AS PARQUET
>   """.stripMargin)
>spark.table("test").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test
>""".stripMargin)
> spark.table("test").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19120:

Summary: Returned an Empty Result after Loading a Hive Table  (was: 
Returned an Empty Result after Loading a Hive Partitioned Table)

> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test_added_partitions (a STRING)
> |PARTITIONED BY (b INT)
> |STORED AS PARQUET
>   """.stripMargin)
> // Create partition without data files and check whether it can be 
> read
> sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
> spark.table("test_added_partitions").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test_added_partitions PARTITION(b='1')
>""".stripMargin)
> spark.table("test_added_partitions").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 
> This only happens if users manually add the partition by {{ALTER TABLE ADD 
> PARTITION}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables

2017-01-07 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19121:
---

 Summary: No need to refresh metadata cache for non-partitioned 
Hive tables
 Key: SPARK-19121
 URL: https://issues.apache.org/jira/browse/SPARK-19121
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


Only partitioned Hive tables need a metadata refresh after insertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2017-01-07 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808216#comment-15808216
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen]  build/tests has passed in jenkins, what do you think?  Should we 
commit a little at a time to keep this moving forward or should I make the next 
change, I prefer to create new pull requests for each unit test area that I fix 
so my preference would be to commit this little change for ContextCleaner.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17204:


Assignee: Apache Spark

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Apache Spark
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.

[jira] [Assigned] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17204:


Assignee: (was: Apache Spark)

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808197#comment-15808197
 ] 

Apache Spark commented on SPARK-17204:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/16499

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStage

[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808173#comment-15808173
 ] 

gagan taneja commented on SPARK-19118:
--

I had filed original JIRA to provide the functionality for both Percentile and 
Approx Percentile
After review the code i realized that both the function implementations very 
different and it would be best to divide them into two sub tasks 
I have created one sub task for adding support in Percentile function 
For implementing the functionality in Approx Percentile i would need to work 
with Developer to find the right approach and make the code changes. This is 
tracked with subtask https://issues.apache.org/jira/browse/SPARK-19119

> Percentile support for frequency distribution table
> ---
>
> Key: SPARK-19118
> URL: https://issues.apache.org/jira/browse/SPARK-19118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: gagan taneja
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19120:

Description: 
{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should refresh the 
metadata cache after loading the data to the table. 

This only happens if users manually add the partition by {{ALTER TABLE ADD 
PARTITION}}


  was:
{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should do metadata refresh 
after loading the data to the table. 

This only happens if users manually add the partition by {{ALTER TABLE ADD 
PARTITION}}



> Returned an Empty Result after Loading a Hive Partitioned Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test_added_partitions (a STRING)
> |PARTITIONED BY (b INT)
> |STORED AS PARQUET
>   """.stripMargin)
> // Create partition without data files and check whether it can be 
> read
> sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
> spark.table("test_added_partitions").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test_added_partitions PARTITION(b='1')
>""".stripMargin)
> spark.table("test_added_partitions").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 
> This only happens if users manually add the partition by {{ALTER TABLE ADD 
> PARTITION}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19120:

Labels: correctness  (was: )

> Returned an Empty Result after Loading a Hive Partitioned Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test_added_partitions (a STRING)
> |PARTITIONED BY (b INT)
> |STORED AS PARQUET
>   """.stripMargin)
> // Create partition without data files and check whether it can be 
> read
> sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
> spark.table("test_added_partitions").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test_added_partitions PARTITION(b='1')
>""".stripMargin)
> spark.table("test_added_partitions").show()
> {noformat}
> The returned result is empty after table loading. We should do metadata 
> refresh after loading the data to the table. 
> This only happens if users manually add the partition by {{ALTER TABLE ADD 
> PARTITION}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table

2017-01-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808100#comment-15808100
 ] 

Xiao Li commented on SPARK-19120:
-

Will submit a fix soon.

> Returned an Empty Result after Loading a Hive Partitioned Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test_added_partitions (a STRING)
> |PARTITIONED BY (b INT)
> |STORED AS PARQUET
>   """.stripMargin)
> // Create partition without data files and check whether it can be 
> read
> sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
> spark.table("test_added_partitions").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test_added_partitions PARTITION(b='1')
>""".stripMargin)
> spark.table("test_added_partitions").show()
> {noformat}
> The returned result is empty after table loading. We should do metadata 
> refresh after loading the data to the table. 
> This only happens if users manually add the partition by {{ALTER TABLE ADD 
> PARTITION}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table

2017-01-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19120:

Description: 
{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should do metadata refresh 
after loading the data to the table. 

This only happens if users manually add the partition by {{ALTER TABLE ADD 
PARTITION}}


  was:
{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should do metadata refresh 
after loading the data to the table. 



> Returned an Empty Result after Loading a Hive Partitioned Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test_added_partitions (a STRING)
> |PARTITIONED BY (b INT)
> |STORED AS PARQUET
>   """.stripMargin)
> // Create partition without data files and check whether it can be 
> read
> sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
> spark.table("test_added_partitions").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test_added_partitions PARTITION(b='1')
>""".stripMargin)
> spark.table("test_added_partitions").show()
> {noformat}
> The returned result is empty after table loading. We should do metadata 
> refresh after loading the data to the table. 
> This only happens if users manually add the partition by {{ALTER TABLE ADD 
> PARTITION}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table

2017-01-07 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19120:
---

 Summary: Returned an Empty Result after Loading a Hive Partitioned 
Table
 Key: SPARK-19120
 URL: https://issues.apache.org/jira/browse/SPARK-19120
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Critical


{noformat}
sql(
  """
|CREATE TABLE test_added_partitions (a STRING)
|PARTITIONED BY (b INT)
|STORED AS PARQUET
  """.stripMargin)

// Create partition without data files and check whether it can be read
sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
spark.table("test_added_partitions").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE test_added_partitions PARTITION(b='1')
   """.stripMargin)

spark.table("test_added_partitions").show()
{noformat}

The returned result is empty after table loading. We should do metadata refresh 
after loading the data to the table. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19115) SparkSQL unsupported the command " create external table if not exist new_tbl like old_tbl"

2017-01-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808091#comment-15808091
 ] 

Xiao Li commented on SPARK-19115:
-

When the location of the table is not user provided, it will be a managed 
table. It is not a bug, but by design.

> SparkSQL unsupported the command " create external table if not exist  
> new_tbl like old_tbl"
> 
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupported the command " create external table if not exist  
> new_tbl like old_tbl"
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19110.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 16491
[https://github.com/apache/spark/pull/16491]

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18948) Add Mean Percentile Rank metric for ranking algorithms

2017-01-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808030#comment-15808030
 ] 

Joseph K. Bradley commented on SPARK-18948:
---

Oh I agree we'd need to add the initial RankingEvaluator first.  It looks like 
there is a PR, though I haven't had a chance to look at it yet.

> Add Mean Percentile Rank metric for ranking algorithms
> --
>
> Key: SPARK-18948
> URL: https://issues.apache.org/jira/browse/SPARK-18948
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Danilo Ascione
>
> Add the Mean Percentile Rank (MPR) metric for ranking algorithms, as 
> described in the paper :
> Hu, Y., Y. Koren, and C. Volinsky. “Collaborative Filtering for Implicit 
> Feedback Datasets.” In 2008 Eighth IEEE International Conference on Data 
> Mining, 263–72, 2008. doi:10.1109/ICDM.2008.22. 
> (http://yifanhu.net/PUB/cf.pdf) (NB: MPR is called "Expected percentile rank" 
> in the paper)
> The ALS algorithm for implicit feedback in Spark ML is based on the same 
> paper. 
> Spark ML lacks an implementation of an appropriate metric for implicit 
> feedback, so the MPR metric can fulfill this use case.
> This implementation add the metric to the RankingMetrics class under 
> org.apache.spark.mllib.evaluation (SPARK-3568), and it uses the same input 
> (prediction and label pairs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19110:
--
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 1.6.4, 2.0.3, 2.1.1, 2.2.0)

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19110:
--
Assignee: Miao Wang

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Miao Wang
>Assignee: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19106:
-

Assignee: Sean Owen

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.1, 2.2.0
>
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19106) Styling for the configuration docs is broken

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19106.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16490
[https://github.com/apache/spark/pull/16490]

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Trivial
> Fix For: 2.1.1, 2.2.0
>
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807993#comment-15807993
 ] 

Sean Owen commented on SPARK-19118:
---

Why are there two child JIRAs? they don't sound that separate, and I'm not sure 
why they need a parent.

> Percentile support for frequency distribution table
> ---
>
> Key: SPARK-19118
> URL: https://issues.apache.org/jira/browse/SPARK-19118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: gagan taneja
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807969#comment-15807969
 ] 

Apache Spark commented on SPARK-9487:
-

User 'skanjila' has created a pull request for this issue:
https://github.com/apache/spark/pull/16498

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19118:


Assignee: (was: Apache Spark)

> Percentile support for frequency distribution table
> ---
>
> Key: SPARK-19118
> URL: https://issues.apache.org/jira/browse/SPARK-19118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: gagan taneja
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19118:


Assignee: Apache Spark

> Percentile support for frequency distribution table
> ---
>
> Key: SPARK-19118
> URL: https://issues.apache.org/jira/browse/SPARK-19118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: gagan taneja
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807959#comment-15807959
 ] 

Apache Spark commented on SPARK-19118:
--

User 'tanejagagan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16497

> Percentile support for frequency distribution table
> ---
>
> Key: SPARK-19118
> URL: https://issues.apache.org/jira/browse/SPARK-19118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: gagan taneja
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19119) Approximate percentile support for frequency distribution table

2017-01-07 Thread gagan taneja (JIRA)
gagan taneja created SPARK-19119:


 Summary: Approximate percentile support for frequency distribution 
table
 Key: SPARK-19119
 URL: https://issues.apache.org/jira/browse/SPARK-19119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.2
Reporter: gagan taneja






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19118) Percentile support for frequency distribution table

2017-01-07 Thread gagan taneja (JIRA)
gagan taneja created SPARK-19118:


 Summary: Percentile support for frequency distribution table
 Key: SPARK-19118
 URL: https://issues.apache.org/jira/browse/SPARK-19118
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.2
Reporter: gagan taneja






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807849#comment-15807849
 ] 

Sean Owen commented on SPARK-19117:
---

I think you could skip these tests in Windows. I think the 'assume' method may 
be good for this? or we've done the same elsewhere.

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - t

[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807801#comment-15807801
 ] 

Hyukjin Kwon commented on SPARK-19117:
--

Ah, [~srowen], I just noticed some latest PRs dedicated to this functionality 
were reviewed by you. Could I please ask what you think about this?

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the 

[jira] [Comment Edited] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807780#comment-15807780
 ] 

Hyukjin Kwon edited comment on SPARK-19117 at 1/7/17 4:45 PM:
--

We can skip all the tests above if we are not going to support these bash in 
other systems. If we want to support these, we could introduce a hidden option 
for the bash location.


was (Author: hyukjin.kwon):
We can skip all the tests above if we are not going to support these bash in 
other systems. If we want to support these, we could introduce a hidden option 
for this.

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Erro

[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807784#comment-15807784
 ] 

Hyukjin Kwon commented on SPARK-19117:
--

I will submit a PR as soon as any committer decides this please.

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with SerDe2 *** FAILED *** (437 milliseconds)
>o

[jira] [Resolved] (SPARK-19085) cleanup OutputWriterFactory and OutputWriter

2017-01-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19085.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> cleanup OutputWriterFactory and OutputWriter
> 
>
> Key: SPARK-19085
> URL: https://issues.apache.org/jira/browse/SPARK-19085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807780#comment-15807780
 ] 

Hyukjin Kwon commented on SPARK-19117:
--

We can skip all the tests above if we are not going to support these bash in 
other systems. If we want to support these, we could introduce a hidden option 
for this.

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The sy

[jira] [Created] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19117:


 Summary: script transformation does not work on Windows due to 
fixed bash executable location
 Key: SPARK-19117
 URL: https://issues.apache.org/jira/browse/SPARK-19117
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Hyukjin Kwon


There are some tests failed on Windows via AppVeyor as below due to this 
problem :

{code}
 - script *** FAILED *** (553 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 
(TID 54, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 389.0 
(TID 725, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform for stdout *** FAILED *** (2 seconds, 813 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 391.0 
(TID 726, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform for stderr *** FAILED *** (2 seconds, 407 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 393.0 
(TID 727, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - test script transform data type *** FAILED *** (171 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 395.0 
(TID 728, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - transform *** FAILED *** (359 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, localhost, 
executor driver): java.io.IOException: Cannot run program "/bin/bash": 
CreateProcess error=2, The system cannot find the file specified
  
 - schema-less transform *** FAILED *** (344 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, localhost, 
executor driver): java.io.IOException: Cannot run program "/bin/bash": 
CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter *** FAILED *** (296 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, localhost, 
executor driver): java.io.IOException: Cannot run program "/bin/bash": 
CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, localhost, 
executor driver): java.io.IOException: Cannot run program "/bin/bash": 
CreateProcess error=2, The system cannot find the file specified

 - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
   Failed to execute query using catalyst:
   Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, localhost, 
executor driver): java.io.IOException: Cannot run program "/bin/bash": 
CreateProcess error=2, The system cannot find the file specified

 - transform with SerDe2 *** FAILED *** (437 milliseconds)
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1355.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1355.0 
(TID 2403, localhost, executor driver): java.io.IOException: Cannot run program 
"/bin/bash": CreateProcess error=2, The system cannot find the file specified

 - script transformation - schemaless *** FAILED *** (78 milliseconds)
   ...
   Cause: org.apache.spark.SparkException: Job aborted due to sta

[jira] [Commented] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807615#comment-15807615
 ] 

Apache Spark commented on SPARK-16101:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16496

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-07 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807603#comment-15807603
 ] 

Charles Allen commented on SPARK-19111:
---

I have not been able to finish root cause stuff, but I know it works for jobs 
except for our largest spark job. And it consistently fails for that large 
spark job.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now b

[jira] [Created] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

2017-01-07 Thread Shea Parkes (JIRA)
Shea Parkes created SPARK-19116:
---

 Summary: LogicalPlan.statistics.sizeInBytes wrong for trivial 
parquet file
 Key: SPARK-19116
 URL: https://issues.apache.org/jira/browse/SPARK-19116
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2, 2.0.1
 Environment: Python 3.5.x
Windows 10
Reporter: Shea Parkes


We're having some modestly severe issues with broadcast join inference, and 
I've been chasing them through the join heuristics in the catalyst engine.  
I've made it as far as I can, and I've hit upon something that does not make 
any sense to me.

I thought that loading from parquet would be a RelationPlan, which would just 
use the sum of default sizeInBytes for each column times the number of rows.  
But this trivial example shows that I am not correct:

{code}
import pyspark.sql.functions as F

df_range = session.range(100).select(F.col('id').cast('integer'))
df_range.write.parquet('c:/scratch/hundred_integers.parquet')

df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
df_parquet.explain(True)

# Expected sizeInBytes
integer_default_sizeinbytes = 4
print(df_parquet.count() * integer_default_sizeinbytes)  # = 400

# Inferred sizeInBytes
print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318

# For posterity (Didn't really expect this to match anything above)
print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
{code}

And here's the results of explain(True) on df_parquet:
{code}
In [456]: == Parsed Logical Plan ==
Relation[id#794] parquet

== Analyzed Logical Plan ==
id: int
Relation[id#794] parquet

== Optimized Logical Plan ==
Relation[id#794] parquet

== Physical Plan ==
*BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], PushedFilters: 
[], ReadSchema: struct
{code}

So basically, I'm not understanding well how the size of the parquet file is 
being estimated.  I don't expect it to be extremely accurate, but empirically 
it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
way too much.  (It's not always too high like the example above, it's often way 
too low.)

Without deeper understanding, I'm considering a result of 2318 instead of 400 
to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18930.
---
Resolution: Not A Problem

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15984) WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15984.
---
Resolution: Not A Problem

> WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max 
> attempts: 0 for application: 8 is invalid" when starting application on YARN
> --
>
> Key: SPARK-15984
> URL: https://issues.apache.org/jira/browse/SPARK-15984
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When executing {{spark-shell}} on Spark on YARN 2.7.2 on Mac OS as follows:
> {code}
> YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c 
> spark.shuffle.service.enabled=true --deploy-mode client -c 
> spark.scheduler.mode=FAIR
> {code}
> it ends up with the following WARN in the logs:
> {code}
> 2016-06-16 08:33:05,308 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
> applicationId: 8
> 2016-06-16 08:33:07,305 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific 
> max attempts: 0 for application: 8 is invalid, because it is out of the range 
> [1, 2]. Use the global max attempts instead.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807484#comment-15807484
 ] 

Yanbo Liang commented on SPARK-10078:
-

+1 [~sethah]

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11968.
---
  Resolution: Not A Problem
Target Version/s:   (was: )

In Progress vs Open really doesn't matter. It's OK to change it back but really 
this one looks a Not A Problem.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13748) Document behavior of createDataFrame and rows with omitted fields

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13748:
--
Assignee: Hyukjin Kwon
Priority: Trivial  (was: Minor)
 Summary: Document behavior of createDataFrame and rows with omitted fields 
 (was: createDataFrame and rows with omitted fields)

> Document behavior of createDataFrame and rows with omitted fields
> -
>
> Key: SPARK-13748
> URL: https://issues.apache.org/jira/browse/SPARK-13748
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Ethan Aubin
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> I found it confusing that a Row with an omitted field is different from a row 
> with field present but value missing. This was originally problematic for 
> json files will varying fields, but it's comes down to something like:
> def test(rows):
> ds = sc.parallelize(rows)
> df = sqlContext.createDataFrame(ds,None,1)
> print df[['y']].collect()
> test([Row(x=1,y=None),Row(x=2, y='asdf')])   # Works
> test([Row(x=1),Row(x=2, y='asdf')])  # Fails with an 
> ArrayIndexOutOfBoundsException.
> maybe more could be said in the documentation for createDataFrame or Row 
> about what's expected. Validation or correction would be helpful, as would a 
> function creating a well formed row from a structtype and dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13748) createDataFrame and rows with omitted fields

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13748.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 13771
[https://github.com/apache/spark/pull/13771]

> createDataFrame and rows with omitted fields
> 
>
> Key: SPARK-13748
> URL: https://issues.apache.org/jira/browse/SPARK-13748
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Ethan Aubin
>Priority: Minor
> Fix For: 2.2.0
>
>
> I found it confusing that a Row with an omitted field is different from a row 
> with field present but value missing. This was originally problematic for 
> json files will varying fields, but it's comes down to something like:
> def test(rows):
> ds = sc.parallelize(rows)
> df = sqlContext.createDataFrame(ds,None,1)
> print df[['y']].collect()
> test([Row(x=1,y=None),Row(x=2, y='asdf')])   # Works
> test([Row(x=1),Row(x=2, y='asdf')])  # Fails with an 
> ArrayIndexOutOfBoundsException.
> maybe more could be said in the documentation for createDataFrame or Row 
> about what's expected. Validation or correction would be helpful, as would a 
> function creating a well formed row from a structtype and dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19114) Backpressure could support non-integral rates (< 1)

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19114:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)
   Summary: Backpressure could support non-integral rates (< 1)  (was: 
Backpressure rate is cast from double to long to double)

I think the assumption that it's a long (therefore, at least 1 event per 
second) is embedded throughout this code; it's not just one instance of casting 
to a long.

It does sound like quite a corner case though, using Spark Streaming to process 
well under 1 event per second. Is it really a streaming use case?

That's not to say it couldn't be fixed but would take some surgery.

> Backpressure could support non-integral rates (< 1)
> ---
>
> Key: SPARK-19114
> URL: https://issues.apache.org/jira/browse/SPARK-19114
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tony Novak
>Priority: Minor
>
> We have a Spark streaming job where each record takes well over a second to 
> execute, so the stable rate is under 1 element/second. We set 
> spark.streaming.backpressure.enabled=true and 
> spark.streaming.backpressure.pid.minRate=0.1, but backpressure did not appear 
> to be effective, even though the TRACE level logs from PIDRateEstimator 
> showed that the new rate was 0.1.
> As it turns out, even though the minRate parameter is a Double, and the rate 
> estimate generated by PIDRateEstimator is a Double as well, RateController 
> casts the new rate to a Long. As a result, if the computed rate is less than 
> 1, it's truncated to 0, which ends up being interpreted as "no limit".
> What's particularly confusing is that the Guava RateLimiter class takes a 
> rate limit as a double, so the long value ends up being cast back to a double.
> Is there any reason not to keep the rate limit as a double all the way 
> through? I'm happy to create a pull request if this makes sense.
> We encountered the bug on Spark 1.6.2, but it looks like the code in the 
> master branch is still affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19099) Wrong time display on Spark History Server web UI

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19099.
---
Resolution: Duplicate

It's not wrong, it's in GMT

> Wrong time display on Spark History Server web UI
> -
>
> Key: SPARK-19099
> URL: https://issues.apache.org/jira/browse/SPARK-19099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: JohnsonZhang
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> While using the spark history server, I got a wrong job start time and end 
> time. I tracked the reason and found it's because the hard coding of TimeZone 
> rawOffSet. 
> I've changed it and acquire the offset value from System.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18298) HistoryServer use GMT time all time

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18298.
---
Resolution: Duplicate

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19103) In web ui,URL's host name should be a specific IP address.

2017-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807395#comment-15807395
 ] 

Sean Owen commented on SPARK-19103:
---

Are you maybe looking for SPARK_LOCAL_IP? there is no attachment 3.

> In web ui,URL's host name should be a specific IP address.
> --
>
> Key: SPARK-19103
> URL: https://issues.apache.org/jira/browse/SPARK-19103
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
> Environment: spark 2.0.2
>Reporter: guoxiaolong
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> In web ui,URL's host name should be a specific IP address.Because open URL 
> must be resolve host name.It can not find host name.So URL can not 
> find.Please see the attachment.Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807377#comment-15807377
 ] 

Sean Owen commented on SPARK-19111:
---

Is there any more detail on why? are you sure it's a size issue, that it isn't 
just a problem with the network, etc? 
I'm not sure it's actionable like this.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload

[jira] [Updated] (SPARK-19112) add codec for ZStandard

2017-01-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19112:
--
Priority: Minor  (was: Major)

Is this something that Spark would just get for free from HDFS APIs -- is there 
a change here other than using a later version of Hadoop?

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-07 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807248#comment-15807248
 ] 

Liang-Chi Hsieh commented on SPARK-7768:


Hi Randall,

With the {{UDTRegistration}} added since 2.0, we already can make UDT work with 
a class of an unmodified third-party library.

You can define the user defined type ({{UserDefinedType}}) for the class from 
third-party library. Then you can register the defined type with the 
third-party class to {{UDTRegistration}}:

{code}

class ThirdPartyClassUDT extends UserDefinedType[ThirdPartyClass] {
  ...
}


UDTRegistration.register(classOf[ThirdPartyClass].getName, 
classOf[ThirdPartyClassUDT].getName)

{code}

Unfortunately, {{UDTRegistration}} is still in private. I hope we can make it 
public soon in this change.



> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:18 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2. Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2. Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2 Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2|

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:14 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

# Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

``` scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2|

[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807078#comment-15807078
 ] 

Apache Spark commented on SPARK-18890:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/15505

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18890:


Assignee: (was: Apache Spark)

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18890:


Assignee: Apache Spark

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077
 ] 

chie hayashida commented on SPARK-17154:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

``` scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
```

# Example2
```scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]

```

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
> Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1