[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Flags: (was: Important) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808663#comment-15808663 ] Shuai Lin commented on SPARK-19123: --- IIUC {{KeyProviderException}} means the storage account key is not configured properly. Are you sure the way you specify the key is correct? Have you checked the azure developer docs for it? BTW I don't think this is an "critical issue", so I changed it to "minor". > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAcce
[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Labels: newbie (was: features newbie test) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:
[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Priority: Minor (was: Critical) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: features, newbie, test > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.
[jira] [Updated] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18941: Assignee: Dongjoon Hyun > "Drop Table" command doesn't delete the directory of the managed Hive table > when users specifying locations > --- > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.2 >Reporter: luat >Assignee: Dongjoon Hyun > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18941: Component/s: (was: Java API) SQL > "Drop Table" command doesn't delete the directory of the managed Hive table > when users specifying locations > --- > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.2 >Reporter: luat >Assignee: Dongjoon Hyun > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18941) "Drop Table" command doesn't delete the directory of the managed Hive table when users specifying locations
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18941. - Resolution: Resolved Fix Version/s: 2.2.0 2.1.1 2.0.3 > "Drop Table" command doesn't delete the directory of the managed Hive table > when users specifying locations > --- > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.2 >Reporter: luat >Assignee: Dongjoon Hyun > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
[ https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-19122: Description: `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in respective order) This is how they are generated: {code} val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1) df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table1") df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table2") {code} Now, if join predicates are specified in query in *same* order as bucketing and sort order, there is no shuffle and sort. {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND a.k=b.k").explain(true) == Physical Plan == *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner :- *Project [i#60, j#61, k#62] : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct +- *Project [i#99, j#100, k#101] +- *Filter (isnotnull(j#100) && isnotnull(k#101)) +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct {code} The same query with join predicates in *different* order from bucketing and sort order leads to extra shuffle and sort being introduced {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j ").explain(true) == Physical Plan == *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(k#62, j#61, 200) : +- *Project [i#60, j#61, k#62] :+- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k#101, j#100, 200) +- *Project [i#99, j#100, k#101] +- *Filter (isnotnull(j#100) && isnotnull(k#101)) +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct {code} was: `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in respective order) This is how they are generated: {code} val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1) df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table1") df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table2") {code} Now, if join predicates are specified in query in *same* order as bucketing and sort order, there is no shuffle and sort. {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND a.k=b.k").explain(true) == Physical Plan == *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner :- *Project [i#60, j#61, k#62] : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct +- *Project [i#99, j#100, k#101] +- *Filter (isnotnull(j#100) && isnotnull(k#101)) +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2], PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct {code} The same query with join predicates in *different* order from bucketing and sort order leads to extra shuffle and sort being introduced {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j ").explain(true) == Physical Plan == *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(k#62, j#61, 200) : +- *Project [i#60, j#61, k#62] :+- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), Is
[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808507#comment-15808507 ] Tejas Patil commented on SPARK-18067: - [~hvanhovell] : * You are right about the data distribution. Both approaches are prone to OOMs but my approach is more likely to OOM based on the data. If the tables are bucketed + sorted on a single column, both approaches will distribute the data in same manner. When I checked at Hive behavior, its doing the same thing that I mentioned and it works across all our workloads. * Adding shuffle makes the job to be less performant in general irrespective of data distribution. I assume that "significant performance problems" will only happen if the data is heavily skewed on the join key ? * I have seen OOMs happen with current spark approach (which hashes everything). I personally feel that the row buffering should be backed by disk spill to be more reliable. It will run a bit slower but at least will be reliable and not page people in the middle of the night due to OOM. I have seen some other scenario in SPARK-19122 but the proposed solution there will help this issue as well. > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
[ https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808485#comment-15808485 ] Tejas Patil edited comment on SPARK-19122 at 1/8/17 1:44 AM: - [~hvanhovell] : In a broader level, I see that when nodes for physical operators are created, the `requiredChildDistribution` and `requiredChildOrdering` is pretty much fixed at that point. While the planning is done and the child nodes are inspected, there is no way to change it. For fixing this, my take would be : * Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided based on its children. ie. change `requiredChildOrdering()` to `requiredChildOrdering(childOrderings)`. Default behavior would be to ignore the input `childOrderings`. * In case of operators like `SortMergeJoinExec`, where the distribution and ordering requirement is way stricter, the operator could itself take a decision what ordering needs to be used. ||child output ordering||ordering of join keys in query||shuffle+sort needed|| | a, b | a, b | No | | a, b | b, a | No | | a, b, c, d | a, b | No | | a, b, c, d | b, c | Yes | | a, b | a, b, c, d | Yes | | b, c | a, b, c, d | Yes | Even SPARK-18067 will benefit from this change. Let me know if you have any opinion about this approach OR have something better in mind. was (Author: tejasp): [~hvanhovell] : In a broader level, I see that when nodes for physical operators are created, the `requiredChildDistribution` and `requiredChildOrdering` is pretty much fixed at that point. While the planning is done and the child nodes are inspected, there is no way to change it. For fixing this, my take would be : * Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided based on its children. ie. change `requiredChildOrdering()` to `requiredChildOrdering(childOrderings)`. Default behavior would be to ignore the input `childOrderings`. * In case of operators like `SortMergeJoinExec`, where the distribution and ordering requirement is way stricter, the operator could itself take a decision what ordering needs to be used. ||child output ordering||ordering of join keys in query||shuffle+sort needed|| | a, b | a, b | No | | a, b | b, a | No | | a, b, c, d | a, b | No | | a, b, c, d | b, c | Yes | | a, b | a, b, c, d | Yes | | b, c | a, b, c, d | Yes | Even SPARK-17271 will benefit from this change. Let me know if you have any opinion about this approach OR have something better in mind. > Unnecessary shuffle+sort added if join predicates ordering differ from > bucketing and sorting order > -- > > Key: SPARK-19122 > URL: https://issues.apache.org/jira/browse/SPARK-19122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in > respective order) > This is how they are generated: > {code} > val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", > "k").coalesce(1) > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table1") > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table2") > {code} > Now, if join predicates are specified in query in *same* order as bucketing > and sort order, there is no shuffle and sort. > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND > a.k=b.k").explain(true) > == Physical Plan == > *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner > :- *Project [i#60, j#61, k#62] > : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) > : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: > ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], > ReadSchema: struct > +- *Project [i#99, j#100, k#101] >+- *Filter (isnotnull(j#100) && isnotnull(k#101)) > +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2], > PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], > ReadSchema: struct > {code} > The same query with join predicates in *different* order from bucketing and > sort order leads to extra shuffle and sort being introduced > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j > ").explain(true) > == Physical Plan == > *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner > :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], fa
[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
[ https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808485#comment-15808485 ] Tejas Patil commented on SPARK-19122: - [~hvanhovell] : In a broader level, I see that when nodes for physical operators are created, the `requiredChildDistribution` and `requiredChildOrdering` is pretty much fixed at that point. While the planning is done and the child nodes are inspected, there is no way to change it. For fixing this, my take would be : * Allow the `requiredChildDistribution` and `requiredChildOrdering` to decided based on its children. ie. change `requiredChildOrdering()` to `requiredChildOrdering(childOrderings)`. Default behavior would be to ignore the input `childOrderings`. * In case of operators like `SortMergeJoinExec`, where the distribution and ordering requirement is way stricter, the operator could itself take a decision what ordering needs to be used. ||child output ordering||ordering of join keys in query||shuffle+sort needed|| | a, b | a, b | No | | a, b | b, a | No | | a, b, c, d | a, b | No | | a, b, c, d | b, c | Yes | | a, b | a, b, c, d | Yes | | b, c | a, b, c, d | Yes | Even SPARK-17271 will benefit from this change. Let me know if you have any opinion about this approach OR have something better in mind. > Unnecessary shuffle+sort added if join predicates ordering differ from > bucketing and sorting order > -- > > Key: SPARK-19122 > URL: https://issues.apache.org/jira/browse/SPARK-19122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in > respective order) > This is how they are generated: > {code} > val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", > "k").coalesce(1) > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table1") > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table2") > {code} > Now, if join predicates are specified in query in *same* order as bucketing > and sort order, there is no shuffle and sort. > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND > a.k=b.k").explain(true) > == Physical Plan == > *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner > :- *Project [i#60, j#61, k#62] > : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) > : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: > ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], > ReadSchema: struct > +- *Project [i#99, j#100, k#101] >+- *Filter (isnotnull(j#100) && isnotnull(k#101)) > +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2], > PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], > ReadSchema: struct > {code} > The same query with join predicates in *different* order from bucketing and > sort order leads to extra shuffle and sort being introduced > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j > ").explain(true) > == Physical Plan == > *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner > :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(k#62, j#61, 200) > : +- *Project [i#60, j#61, k#62] > :+- *Filter (isnotnull(k#62) && isnotnull(j#61)) > : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, > Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], > ReadSchema: struct > +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k#101, j#100, 200) > +- *Project [i#99, j#100, k#101] > +- *Filter (isnotnull(j#100) && isnotnull(k#101)) > +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, > Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], > PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], > ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
[ https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808479#comment-15808479 ] Tejas Patil commented on SPARK-19122: - When a `SortMergeJoinExec` node is created, the join keys in both `left` and `right` relations are extracted in their order of appearance in the query (see 0). Later, the list of keys are used as-is to define the required distribution (see 1) and ordering (see 2) for the sort merge join node. Since the ordering matters here (ie. `ClusteredDistribution(a,b) != ClusteredDistribution(b,a)`), this mismatches with the distribution and ordering of the children... thus `EnsureRequirements` ends up adding shuffle + sort. 0 : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L103 1 : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L80 2 : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L85 > Unnecessary shuffle+sort added if join predicates ordering differ from > bucketing and sorting order > -- > > Key: SPARK-19122 > URL: https://issues.apache.org/jira/browse/SPARK-19122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in > respective order) > This is how they are generated: > {code} > val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", > "k").coalesce(1) > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table1") > df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, > "j", "k").sortBy("j", "k").saveAsTable("table2") > {code} > Now, if join predicates are specified in query in *same* order as bucketing > and sort order, there is no shuffle and sort. > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND > a.k=b.k").explain(true) > == Physical Plan == > *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner > :- *Project [i#60, j#61, k#62] > : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) > : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: > ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], > ReadSchema: struct > +- *Project [i#99, j#100, k#101] >+- *Filter (isnotnull(j#100) && isnotnull(k#101)) > +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2], > PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], > ReadSchema: struct > {code} > The same query with join predicates in *different* order from bucketing and > sort order leads to extra shuffle and sort being introduced > {code} > scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j > ").explain(true) > == Physical Plan == > *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner > :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(k#62, j#61, 200) > : +- *Project [i#60, j#61, k#62] > :+- *Filter (isnotnull(k#62) && isnotnull(j#61)) > : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, > Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], > ReadSchema: struct > +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k#101, j#100, 200) > +- *Project [i#99, j#100, k#101] > +- *Filter (isnotnull(j#100) && isnotnull(k#101)) > +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, > Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], > PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], > ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
Saulo Ricci created SPARK-19123: --- Summary: KeyProviderException when reading Azure Blobs from Apache Spark Key: SPARK-19123 URL: https://issues.apache.org/jira/browse/SPARK-19123 Project: Spark Issue Type: Question Components: Input/Output, Java API Affects Versions: 2.0.0 Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster version 3.5 with Hadoop version 2.7.3 Reporter: Saulo Ricci Priority: Critical I created a Spark job and it's intended to read a set of json files from a Azure Blob container. I set the key and reference to my storage and I'm reading the files as showed in the snippet bellow: {code:java} SparkSession sparkSession = SparkSession.builder().appName("Pipeline") .master("yarn") .config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") .config("fs.azure.account.key..blob.core.windows.net","") .getOrCreate(); Dataset txs = sparkSession.read().json("wasb://path_to_files"); {code} The point is that I'm unfortunately getting a `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from the azure storage. According to the trace showed bellow it seems the header too long but still trying to figure out what exactly that means: {code:java} 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message 140473279682200:error:0D07207B:asn1 encoding routines:ASN1_get_object:header too long:asn1_lib.c:157: 140473279682200:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode error:asn_mime.c:192: 140473279682200:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message 140473279682200:error:0D07207B:asn1 encoding routines:ASN1_get_object:header too long:asn1_lib.c:157: 140473279682200:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode error:asn_mime.c:192: 140473279682200:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) at taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) Caused by: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message {code} I'm using a Apache Spark 2.0.0 cluster set on top of a HDInsight Azure cluster and I'd like to find a solution to this
[jira] [Created] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
Tejas Patil created SPARK-19122: --- Summary: Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order Key: SPARK-19122 URL: https://issues.apache.org/jira/browse/SPARK-19122 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2 Reporter: Tejas Patil `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in respective order) This is how they are generated: {code} val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1) df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table1") df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table2") {code} Now, if join predicates are specified in query in *same* order as bucketing and sort order, there is no shuffle and sort. {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND a.k=b.k").explain(true) == Physical Plan == *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner :- *Project [i#60, j#61, k#62] : +- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct +- *Project [i#99, j#100, k#101] +- *Filter (isnotnull(j#100) && isnotnull(k#101)) +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table2], PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct {code} The same query with join predicates in *different* order from bucketing and sort order leads to extra shuffle and sort being introduced {code} scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j ").explain(true) == Physical Plan == *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(k#62, j#61, 200) : +- *Project [i#60, j#61, k#62] :+- *Filter (isnotnull(k#62) && isnotnull(j#61)) : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k#101, j#100, 200) +- *Project [i#99, j#100, k#101] +- *Filter (isnotnull(j#100) && isnotnull(k#101)) +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/spark-warehouse/table2], PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19120) Returned an Empty Result after Loading a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19120: Assignee: Xiao Li (was: Apache Spark) > Returned an Empty Result after Loading a Hive Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test (a STRING) > |STORED AS PARQUET > """.stripMargin) >spark.table("test").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test >""".stripMargin) > spark.table("test").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables
[ https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19121: Assignee: Apache Spark (was: Xiao Li) > No need to refresh metadata cache for non-partitioned Hive tables > - > > Key: SPARK-19121 > URL: https://issues.apache.org/jira/browse/SPARK-19121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Only partitioned Hive tables need a metadata refresh after insertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19120) Returned an Empty Result after Loading a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808406#comment-15808406 ] Apache Spark commented on SPARK-19120: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16500 > Returned an Empty Result after Loading a Hive Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test (a STRING) > |STORED AS PARQUET > """.stripMargin) >spark.table("test").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test >""".stripMargin) > spark.table("test").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19120) Returned an Empty Result after Loading a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19120: Assignee: Apache Spark (was: Xiao Li) > Returned an Empty Result after Loading a Hive Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test (a STRING) > |STORED AS PARQUET > """.stripMargin) >spark.table("test").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test >""".stripMargin) > spark.table("test").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables
[ https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808407#comment-15808407 ] Apache Spark commented on SPARK-19121: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16500 > No need to refresh metadata cache for non-partitioned Hive tables > - > > Key: SPARK-19121 > URL: https://issues.apache.org/jira/browse/SPARK-19121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Only partitioned Hive tables need a metadata refresh after insertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables
[ https://issues.apache.org/jira/browse/SPARK-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19121: Assignee: Xiao Li (was: Apache Spark) > No need to refresh metadata cache for non-partitioned Hive tables > - > > Key: SPARK-19121 > URL: https://issues.apache.org/jira/browse/SPARK-19121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Only partitioned Hive tables need a metadata refresh after insertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19120: Description: {noformat} sql( """ |CREATE TABLE test (a STRING) |STORED AS PARQUET """.stripMargin) spark.table("test").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test """.stripMargin) spark.table("test").show() {noformat} The returned result is empty after table loading. We should refresh the metadata cache after loading the data to the table. was: {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should refresh the metadata cache after loading the data to the table. This only happens if users manually add the partition by {{ALTER TABLE ADD PARTITION}} > Returned an Empty Result after Loading a Hive Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test (a STRING) > |STORED AS PARQUET > """.stripMargin) >spark.table("test").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test >""".stripMargin) > spark.table("test").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19120: Summary: Returned an Empty Result after Loading a Hive Table (was: Returned an Empty Result after Loading a Hive Partitioned Table) > Returned an Empty Result after Loading a Hive Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test_added_partitions (a STRING) > |PARTITIONED BY (b INT) > |STORED AS PARQUET > """.stripMargin) > // Create partition without data files and check whether it can be > read > sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") > spark.table("test_added_partitions").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test_added_partitions PARTITION(b='1') >""".stripMargin) > spark.table("test_added_partitions").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. > This only happens if users manually add the partition by {{ALTER TABLE ADD > PARTITION}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19121) No need to refresh metadata cache for non-partitioned Hive tables
Xiao Li created SPARK-19121: --- Summary: No need to refresh metadata cache for non-partitioned Hive tables Key: SPARK-19121 URL: https://issues.apache.org/jira/browse/SPARK-19121 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Only partitioned Hive tables need a metadata refresh after insertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808216#comment-15808216 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] build/tests has passed in jenkins, what do you think? Should we commit a little at a time to keep this moving forward or should I make the next change, I prefer to create new pull requests for each unit test area that I fix so my preference would be to commit this little change for ContextCleaner. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17204: Assignee: Apache Spark > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Apache Spark > > We use the {{OFF_HEAP}} storage level extensively with great success. We've > tried off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.
[jira] [Assigned] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17204: Assignee: (was: Apache Spark) > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the {{OFF_HEAP}} storage level extensively with great success. We've > tried off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808197#comment-15808197 ] Apache Spark commented on SPARK-17204: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/16499 > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the {{OFF_HEAP}} storage level extensively with great success. We've > tried off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStage
[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808173#comment-15808173 ] gagan taneja commented on SPARK-19118: -- I had filed original JIRA to provide the functionality for both Percentile and Approx Percentile After review the code i realized that both the function implementations very different and it would be best to divide them into two sub tasks I have created one sub task for adding support in Percentile function For implementing the functionality in Approx Percentile i would need to work with Developer to find the right approach and make the code changes. This is tracked with subtask https://issues.apache.org/jira/browse/SPARK-19119 > Percentile support for frequency distribution table > --- > > Key: SPARK-19118 > URL: https://issues.apache.org/jira/browse/SPARK-19118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.2 >Reporter: gagan taneja > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19120: Description: {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should refresh the metadata cache after loading the data to the table. This only happens if users manually add the partition by {{ALTER TABLE ADD PARTITION}} was: {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should do metadata refresh after loading the data to the table. This only happens if users manually add the partition by {{ALTER TABLE ADD PARTITION}} > Returned an Empty Result after Loading a Hive Partitioned Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test_added_partitions (a STRING) > |PARTITIONED BY (b INT) > |STORED AS PARQUET > """.stripMargin) > // Create partition without data files and check whether it can be > read > sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") > spark.table("test_added_partitions").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test_added_partitions PARTITION(b='1') >""".stripMargin) > spark.table("test_added_partitions").show() > {noformat} > The returned result is empty after table loading. We should refresh the > metadata cache after loading the data to the table. > This only happens if users manually add the partition by {{ALTER TABLE ADD > PARTITION}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19120: Labels: correctness (was: ) > Returned an Empty Result after Loading a Hive Partitioned Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > {noformat} > sql( > """ > |CREATE TABLE test_added_partitions (a STRING) > |PARTITIONED BY (b INT) > |STORED AS PARQUET > """.stripMargin) > // Create partition without data files and check whether it can be > read > sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") > spark.table("test_added_partitions").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test_added_partitions PARTITION(b='1') >""".stripMargin) > spark.table("test_added_partitions").show() > {noformat} > The returned result is empty after table loading. We should do metadata > refresh after loading the data to the table. > This only happens if users manually add the partition by {{ALTER TABLE ADD > PARTITION}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808100#comment-15808100 ] Xiao Li commented on SPARK-19120: - Will submit a fix soon. > Returned an Empty Result after Loading a Hive Partitioned Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > {noformat} > sql( > """ > |CREATE TABLE test_added_partitions (a STRING) > |PARTITIONED BY (b INT) > |STORED AS PARQUET > """.stripMargin) > // Create partition without data files and check whether it can be > read > sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") > spark.table("test_added_partitions").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test_added_partitions PARTITION(b='1') >""".stripMargin) > spark.table("test_added_partitions").show() > {noformat} > The returned result is empty after table loading. We should do metadata > refresh after loading the data to the table. > This only happens if users manually add the partition by {{ALTER TABLE ADD > PARTITION}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table
[ https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19120: Description: {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should do metadata refresh after loading the data to the table. This only happens if users manually add the partition by {{ALTER TABLE ADD PARTITION}} was: {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should do metadata refresh after loading the data to the table. > Returned an Empty Result after Loading a Hive Partitioned Table > --- > > Key: SPARK-19120 > URL: https://issues.apache.org/jira/browse/SPARK-19120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > {noformat} > sql( > """ > |CREATE TABLE test_added_partitions (a STRING) > |PARTITIONED BY (b INT) > |STORED AS PARQUET > """.stripMargin) > // Create partition without data files and check whether it can be > read > sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") > spark.table("test_added_partitions").show() > sql( > s""" > |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE > |INTO TABLE test_added_partitions PARTITION(b='1') >""".stripMargin) > spark.table("test_added_partitions").show() > {noformat} > The returned result is empty after table loading. We should do metadata > refresh after loading the data to the table. > This only happens if users manually add the partition by {{ALTER TABLE ADD > PARTITION}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19120) Returned an Empty Result after Loading a Hive Partitioned Table
Xiao Li created SPARK-19120: --- Summary: Returned an Empty Result after Loading a Hive Partitioned Table Key: SPARK-19120 URL: https://issues.apache.org/jira/browse/SPARK-19120 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Priority: Critical {noformat} sql( """ |CREATE TABLE test_added_partitions (a STRING) |PARTITIONED BY (b INT) |STORED AS PARQUET """.stripMargin) // Create partition without data files and check whether it can be read sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") spark.table("test_added_partitions").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE test_added_partitions PARTITION(b='1') """.stripMargin) spark.table("test_added_partitions").show() {noformat} The returned result is empty after table loading. We should do metadata refresh after loading the data to the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19115) SparkSQL unsupported the command " create external table if not exist new_tbl like old_tbl"
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808091#comment-15808091 ] Xiao Li commented on SPARK-19115: - When the location of the table is not user provided, it will be a managed table. It is not a bug, but by design. > SparkSQL unsupported the command " create external table if not exist > new_tbl like old_tbl" > > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupported the command " create external table if not exist > new_tbl like old_tbl" > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19110. --- Resolution: Fixed Fix Version/s: 2.2.0 2.0.3 2.1.1 Issue resolved by pull request 16491 [https://github.com/apache/spark/pull/16491] > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.1, 2.0.3, 2.2.0 > > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18948) Add Mean Percentile Rank metric for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-18948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808030#comment-15808030 ] Joseph K. Bradley commented on SPARK-18948: --- Oh I agree we'd need to add the initial RankingEvaluator first. It looks like there is a PR, though I haven't had a chance to look at it yet. > Add Mean Percentile Rank metric for ranking algorithms > -- > > Key: SPARK-18948 > URL: https://issues.apache.org/jira/browse/SPARK-18948 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Danilo Ascione > > Add the Mean Percentile Rank (MPR) metric for ranking algorithms, as > described in the paper : > Hu, Y., Y. Koren, and C. Volinsky. “Collaborative Filtering for Implicit > Feedback Datasets.” In 2008 Eighth IEEE International Conference on Data > Mining, 263–72, 2008. doi:10.1109/ICDM.2008.22. > (http://yifanhu.net/PUB/cf.pdf) (NB: MPR is called "Expected percentile rank" > in the paper) > The ALS algorithm for implicit feedback in Spark ML is based on the same > paper. > Spark ML lacks an implementation of an appropriate metric for implicit > feedback, so the MPR metric can fulfill this use case. > This implementation add the metric to the RankingMetrics class under > org.apache.spark.mllib.evaluation (SPARK-3568), and it uses the same input > (prediction and label pairs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19110: -- Target Version/s: 2.0.3, 2.1.1, 2.2.0 (was: 1.6.4, 2.0.3, 2.1.1, 2.2.0) > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19110: -- Assignee: Miao Wang > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Miao Wang >Assignee: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19106: - Assignee: Sean Owen > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Assignee: Sean Owen >Priority: Trivial > Fix For: 2.1.1, 2.2.0 > > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19106. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16490 [https://github.com/apache/spark/pull/16490] > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Priority: Trivial > Fix For: 2.1.1, 2.2.0 > > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807993#comment-15807993 ] Sean Owen commented on SPARK-19118: --- Why are there two child JIRAs? they don't sound that separate, and I'm not sure why they need a parent. > Percentile support for frequency distribution table > --- > > Key: SPARK-19118 > URL: https://issues.apache.org/jira/browse/SPARK-19118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.2 >Reporter: gagan taneja > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807969#comment-15807969 ] Apache Spark commented on SPARK-9487: - User 'skanjila' has created a pull request for this issue: https://github.com/apache/spark/pull/16498 > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19118) Percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19118: Assignee: (was: Apache Spark) > Percentile support for frequency distribution table > --- > > Key: SPARK-19118 > URL: https://issues.apache.org/jira/browse/SPARK-19118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.2 >Reporter: gagan taneja > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19118) Percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19118: Assignee: Apache Spark > Percentile support for frequency distribution table > --- > > Key: SPARK-19118 > URL: https://issues.apache.org/jira/browse/SPARK-19118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.2 >Reporter: gagan taneja >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19118) Percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-19118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807959#comment-15807959 ] Apache Spark commented on SPARK-19118: -- User 'tanejagagan' has created a pull request for this issue: https://github.com/apache/spark/pull/16497 > Percentile support for frequency distribution table > --- > > Key: SPARK-19118 > URL: https://issues.apache.org/jira/browse/SPARK-19118 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.2 >Reporter: gagan taneja > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19119) Approximate percentile support for frequency distribution table
gagan taneja created SPARK-19119: Summary: Approximate percentile support for frequency distribution table Key: SPARK-19119 URL: https://issues.apache.org/jira/browse/SPARK-19119 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.2 Reporter: gagan taneja -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19118) Percentile support for frequency distribution table
gagan taneja created SPARK-19118: Summary: Percentile support for frequency distribution table Key: SPARK-19118 URL: https://issues.apache.org/jira/browse/SPARK-19118 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.2 Reporter: gagan taneja -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
[ https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807849#comment-15807849 ] Sean Owen commented on SPARK-19117: --- I think you could skip these tests in Windows. I think the 'assume' method may be good for this? or we've done the same elsewhere. > script transformation does not work on Windows due to fixed bash executable > location > > > Key: SPARK-19117 > URL: https://issues.apache.org/jira/browse/SPARK-19117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > There are some tests failed on Windows via AppVeyor as below due to this > problem : > {code} > - script *** FAILED *** (553 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - Star Expansion - script transform *** FAILED *** (2 seconds, 375 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stdout *** FAILED *** (2 seconds, 813 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stderr *** FAILED *** (2 seconds, 407 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform data type *** FAILED *** (171 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - transform *** FAILED *** (359 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > > - schema-less transform *** FAILED *** (344 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter *** FAILED *** (296 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - t
[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
[ https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807801#comment-15807801 ] Hyukjin Kwon commented on SPARK-19117: -- Ah, [~srowen], I just noticed some latest PRs dedicated to this functionality were reviewed by you. Could I please ask what you think about this? > script transformation does not work on Windows due to fixed bash executable > location > > > Key: SPARK-19117 > URL: https://issues.apache.org/jira/browse/SPARK-19117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > There are some tests failed on Windows via AppVeyor as below due to this > problem : > {code} > - script *** FAILED *** (553 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - Star Expansion - script transform *** FAILED *** (2 seconds, 375 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stdout *** FAILED *** (2 seconds, 813 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stderr *** FAILED *** (2 seconds, 407 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform data type *** FAILED *** (171 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - transform *** FAILED *** (359 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > > - schema-less transform *** FAILED *** (344 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter *** FAILED *** (296 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the
[jira] [Comment Edited] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
[ https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807780#comment-15807780 ] Hyukjin Kwon edited comment on SPARK-19117 at 1/7/17 4:45 PM: -- We can skip all the tests above if we are not going to support these bash in other systems. If we want to support these, we could introduce a hidden option for the bash location. was (Author: hyukjin.kwon): We can skip all the tests above if we are not going to support these bash in other systems. If we want to support these, we could introduce a hidden option for this. > script transformation does not work on Windows due to fixed bash executable > location > > > Key: SPARK-19117 > URL: https://issues.apache.org/jira/browse/SPARK-19117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > There are some tests failed on Windows via AppVeyor as below due to this > problem : > {code} > - script *** FAILED *** (553 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - Star Expansion - script transform *** FAILED *** (2 seconds, 375 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stdout *** FAILED *** (2 seconds, 813 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stderr *** FAILED *** (2 seconds, 407 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform data type *** FAILED *** (171 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - transform *** FAILED *** (359 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > > - schema-less transform *** FAILED *** (344 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter *** FAILED *** (296 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) >Failed to execute query using catalyst: >Erro
[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
[ https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807784#comment-15807784 ] Hyukjin Kwon commented on SPARK-19117: -- I will submit a PR as soon as any committer decides this please. > script transformation does not work on Windows due to fixed bash executable > location > > > Key: SPARK-19117 > URL: https://issues.apache.org/jira/browse/SPARK-19117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > There are some tests failed on Windows via AppVeyor as below due to this > problem : > {code} > - script *** FAILED *** (553 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - Star Expansion - script transform *** FAILED *** (2 seconds, 375 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stdout *** FAILED *** (2 seconds, 813 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stderr *** FAILED *** (2 seconds, 407 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform data type *** FAILED *** (171 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - transform *** FAILED *** (359 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > > - schema-less transform *** FAILED *** (344 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter *** FAILED *** (296 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with SerDe2 *** FAILED *** (437 milliseconds) >o
[jira] [Resolved] (SPARK-19085) cleanup OutputWriterFactory and OutputWriter
[ https://issues.apache.org/jira/browse/SPARK-19085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19085. - Resolution: Fixed Fix Version/s: 2.2.0 > cleanup OutputWriterFactory and OutputWriter > > > Key: SPARK-19085 > URL: https://issues.apache.org/jira/browse/SPARK-19085 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
[ https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807780#comment-15807780 ] Hyukjin Kwon commented on SPARK-19117: -- We can skip all the tests above if we are not going to support these bash in other systems. If we want to support these, we could introduce a hidden option for this. > script transformation does not work on Windows due to fixed bash executable > location > > > Key: SPARK-19117 > URL: https://issues.apache.org/jira/browse/SPARK-19117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > There are some tests failed on Windows via AppVeyor as below due to this > problem : > {code} > - script *** FAILED *** (553 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - Star Expansion - script transform *** FAILED *** (2 seconds, 375 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stdout *** FAILED *** (2 seconds, 813 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform for stderr *** FAILED *** (2 seconds, 407 > milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - test script transform data type *** FAILED *** (171 milliseconds) >org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run > program "/bin/bash": CreateProcess error=2, The system cannot find the file > specified > - transform *** FAILED *** (359 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > > - schema-less transform *** FAILED *** (344 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter *** FAILED *** (296 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The system cannot find the file specified > - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) >Failed to execute query using catalyst: >Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 > times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, > localhost, executor driver): java.io.IOException: Cannot run program > "/bin/bash": CreateProcess error=2, The sy
[jira] [Created] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location
Hyukjin Kwon created SPARK-19117: Summary: script transformation does not work on Windows due to fixed bash executable location Key: SPARK-19117 URL: https://issues.apache.org/jira/browse/SPARK-19117 Project: Spark Issue Type: Bug Components: SQL Reporter: Hyukjin Kwon There are some tests failed on Windows via AppVeyor as below due to this problem : {code} - script *** FAILED *** (553 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - Star Expansion - script transform *** FAILED *** (2 seconds, 375 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform for stdout *** FAILED *** (2 seconds, 813 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform for stderr *** FAILED *** (2 seconds, 407 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - test script transform data type *** FAILED *** (171 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform *** FAILED *** (359 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - schema-less transform *** FAILED *** (344 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter *** FAILED *** (296 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter2 *** FAILED *** (297 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with custom field delimiter3 *** FAILED *** (312 milliseconds) Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - transform with SerDe2 *** FAILED *** (437 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1355.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1355.0 (TID 2403, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified - script transformation - schemaless *** FAILED *** (78 milliseconds) ... Cause: org.apache.spark.SparkException: Job aborted due to sta
[jira] [Commented] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source
[ https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807615#comment-15807615 ] Apache Spark commented on SPARK-16101: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16496 > Refactoring CSV data source to be consistent with JSON data source > -- > > Key: SPARK-16101 > URL: https://issues.apache.org/jira/browse/SPARK-16101 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, CSV data source has a pretty much different structure with JSON > data source although they can be pretty much similar. > It would be great if they have the similar structure so that some common > issues can be resolved together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807603#comment-15807603 ] Charles Allen commented on SPARK-19111: --- I have not been able to finish root cause stuff, but I know it works for jobs except for our largest spark job. And it consistently fails for that large spark job. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now b
[jira] [Created] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
Shea Parkes created SPARK-19116: --- Summary: LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file Key: SPARK-19116 URL: https://issues.apache.org/jira/browse/SPARK-19116 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.0.2, 2.0.1 Environment: Python 3.5.x Windows 10 Reporter: Shea Parkes We're having some modestly severe issues with broadcast join inference, and I've been chasing them through the join heuristics in the catalyst engine. I've made it as far as I can, and I've hit upon something that does not make any sense to me. I thought that loading from parquet would be a RelationPlan, which would just use the sum of default sizeInBytes for each column times the number of rows. But this trivial example shows that I am not correct: {code} import pyspark.sql.functions as F df_range = session.range(100).select(F.col('id').cast('integer')) df_range.write.parquet('c:/scratch/hundred_integers.parquet') df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet') df_parquet.explain(True) # Expected sizeInBytes integer_default_sizeinbytes = 4 print(df_parquet.count() * integer_default_sizeinbytes) # = 400 # Inferred sizeInBytes print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes()) # = 2318 # For posterity (Didn't really expect this to match anything above) print(df_range._jdf.logicalPlan().statistics().sizeInBytes()) # = 600 {code} And here's the results of explain(True) on df_parquet: {code} In [456]: == Parsed Logical Plan == Relation[id#794] parquet == Analyzed Logical Plan == id: int Relation[id#794] parquet == Optimized Logical Plan == Relation[id#794] parquet == Physical Plan == *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} So basically, I'm not understanding well how the size of the parquet file is being estimated. I don't expect it to be extremely accurate, but empirically it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold way too much. (It's not always too high like the example above, it's often way too low.) Without deeper understanding, I'm considering a result of 2318 instead of 400 to be a bug. My apologies if I'm missing something obvious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18930. --- Resolution: Not A Problem > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15984) WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN
[ https://issues.apache.org/jira/browse/SPARK-15984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15984. --- Resolution: Not A Problem > WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max > attempts: 0 for application: 8 is invalid" when starting application on YARN > -- > > Key: SPARK-15984 > URL: https://issues.apache.org/jira/browse/SPARK-15984 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > When executing {{spark-shell}} on Spark on YARN 2.7.2 on Mac OS as follows: > {code} > YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c > spark.shuffle.service.enabled=true --deploy-mode client -c > spark.scheduler.mode=FAIR > {code} > it ends up with the following WARN in the logs: > {code} > 2016-06-16 08:33:05,308 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new > applicationId: 8 > 2016-06-16 08:33:07,305 WARN > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific > max attempts: 0 for application: 8 is invalid, because it is out of the range > [1, 2]. Use the global max attempts instead. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807484#comment-15807484 ] Yanbo Liang commented on SPARK-10078: - +1 [~sethah] > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11968. --- Resolution: Not A Problem Target Version/s: (was: ) In Progress vs Open really doesn't matter. It's OK to change it back but really this one looks a Not A Problem. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13748) Document behavior of createDataFrame and rows with omitted fields
[ https://issues.apache.org/jira/browse/SPARK-13748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13748: -- Assignee: Hyukjin Kwon Priority: Trivial (was: Minor) Summary: Document behavior of createDataFrame and rows with omitted fields (was: createDataFrame and rows with omitted fields) > Document behavior of createDataFrame and rows with omitted fields > - > > Key: SPARK-13748 > URL: https://issues.apache.org/jira/browse/SPARK-13748 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.6.0 >Reporter: Ethan Aubin >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > I found it confusing that a Row with an omitted field is different from a row > with field present but value missing. This was originally problematic for > json files will varying fields, but it's comes down to something like: > def test(rows): > ds = sc.parallelize(rows) > df = sqlContext.createDataFrame(ds,None,1) > print df[['y']].collect() > test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works > test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an > ArrayIndexOutOfBoundsException. > maybe more could be said in the documentation for createDataFrame or Row > about what's expected. Validation or correction would be helpful, as would a > function creating a well formed row from a structtype and dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13748) createDataFrame and rows with omitted fields
[ https://issues.apache.org/jira/browse/SPARK-13748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13748. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 13771 [https://github.com/apache/spark/pull/13771] > createDataFrame and rows with omitted fields > > > Key: SPARK-13748 > URL: https://issues.apache.org/jira/browse/SPARK-13748 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.6.0 >Reporter: Ethan Aubin >Priority: Minor > Fix For: 2.2.0 > > > I found it confusing that a Row with an omitted field is different from a row > with field present but value missing. This was originally problematic for > json files will varying fields, but it's comes down to something like: > def test(rows): > ds = sc.parallelize(rows) > df = sqlContext.createDataFrame(ds,None,1) > print df[['y']].collect() > test([Row(x=1,y=None),Row(x=2, y='asdf')]) # Works > test([Row(x=1),Row(x=2, y='asdf')]) # Fails with an > ArrayIndexOutOfBoundsException. > maybe more could be said in the documentation for createDataFrame or Row > about what's expected. Validation or correction would be helpful, as would a > function creating a well formed row from a structtype and dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19114) Backpressure could support non-integral rates (< 1)
[ https://issues.apache.org/jira/browse/SPARK-19114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19114: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Summary: Backpressure could support non-integral rates (< 1) (was: Backpressure rate is cast from double to long to double) I think the assumption that it's a long (therefore, at least 1 event per second) is embedded throughout this code; it's not just one instance of casting to a long. It does sound like quite a corner case though, using Spark Streaming to process well under 1 event per second. Is it really a streaming use case? That's not to say it couldn't be fixed but would take some surgery. > Backpressure could support non-integral rates (< 1) > --- > > Key: SPARK-19114 > URL: https://issues.apache.org/jira/browse/SPARK-19114 > Project: Spark > Issue Type: Improvement >Reporter: Tony Novak >Priority: Minor > > We have a Spark streaming job where each record takes well over a second to > execute, so the stable rate is under 1 element/second. We set > spark.streaming.backpressure.enabled=true and > spark.streaming.backpressure.pid.minRate=0.1, but backpressure did not appear > to be effective, even though the TRACE level logs from PIDRateEstimator > showed that the new rate was 0.1. > As it turns out, even though the minRate parameter is a Double, and the rate > estimate generated by PIDRateEstimator is a Double as well, RateController > casts the new rate to a Long. As a result, if the computed rate is less than > 1, it's truncated to 0, which ends up being interpreted as "no limit". > What's particularly confusing is that the Guava RateLimiter class takes a > rate limit as a double, so the long value ends up being cast back to a double. > Is there any reason not to keep the rate limit as a double all the way > through? I'm happy to create a pull request if this makes sense. > We encountered the bug on Spark 1.6.2, but it looks like the code in the > master branch is still affected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19099) Wrong time display on Spark History Server web UI
[ https://issues.apache.org/jira/browse/SPARK-19099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19099. --- Resolution: Duplicate It's not wrong, it's in GMT > Wrong time display on Spark History Server web UI > - > > Key: SPARK-19099 > URL: https://issues.apache.org/jira/browse/SPARK-19099 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: JohnsonZhang >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > While using the spark history server, I got a wrong job start time and end > time. I tracked the reason and found it's because the hard coding of TimeZone > rawOffSet. > I've changed it and acquire the offset value from System. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18298. --- Resolution: Duplicate > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19103) In web ui,URL's host name should be a specific IP address.
[ https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807395#comment-15807395 ] Sean Owen commented on SPARK-19103: --- Are you maybe looking for SPARK_LOCAL_IP? there is no attachment 3. > In web ui,URL's host name should be a specific IP address. > -- > > Key: SPARK-19103 > URL: https://issues.apache.org/jira/browse/SPARK-19103 > Project: Spark > Issue Type: Bug > Components: Web UI > Environment: spark 2.0.2 >Reporter: guoxiaolong >Priority: Minor > Attachments: 1.png, 2.png > > > In web ui,URL's host name should be a specific IP address.Because open URL > must be resolve host name.It can not find host name.So URL can not > find.Please see the attachment.Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807377#comment-15807377 ] Sean Owen commented on SPARK-19111: --- Is there any more detail on why? are you sure it's a size issue, that it isn't just a problem with the network, etc? I'm not sure it's actionable like this. > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload
[jira] [Updated] (SPARK-19112) add codec for ZStandard
[ https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19112: -- Priority: Minor (was: Major) Is this something that Spark would just get for free from HDFS APIs -- is there a change here other than using a later version of Hadoop? > add codec for ZStandard > --- > > Key: SPARK-19112 > URL: https://issues.apache.org/jira/browse/SPARK-19112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Thomas Graves >Priority: Minor > > ZStandard: https://github.com/facebook/zstd and > http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was > recently released. Hadoop > (https://issues.apache.org/jira/browse/HADOOP-13578) and others > (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. > Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807248#comment-15807248 ] Liang-Chi Hsieh commented on SPARK-7768: Hi Randall, With the {{UDTRegistration}} added since 2.0, we already can make UDT work with a class of an unmodified third-party library. You can define the user defined type ({{UserDefinedType}}) for the class from third-party library. Then you can register the defined type with the third-party class to {{UDTRegistration}}: {code} class ThirdPartyClassUDT extends UserDefinedType[ThirdPartyClass] { ... } UDTRegistration.register(classOf[ThirdPartyClass].getName, classOf[ThirdPartyClassUDT].getName) {code} Unfortunately, {{UDTRegistration}} is still in private. I hope we can make it public soon in this change. > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077 ] chie hayashida edited comment on SPARK-17154 at 1/7/17 8:18 AM: [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. h2. Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 2| 3| 1| 1| 1| | 1| 4| 5| 1| 4| 5| | 1| 4| 5| 1| 2| 3| | 1| 4| 5| 1| 1| 1| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 5| 7| 2| 2| 4| | 2| 8| 8| 2| 8| 8| | 2| 8| 8| 2| 5| 7| | 2| 8| 8| 2| 2| 4| +---+--+--+---+--+--+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id#178], Inner, BuildRight :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#178, value1#179, value2#180] {code} h2. Example2 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22")) df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= df2("value22")) df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+---+---+ | id|value1|value2|id2|value11|value22| +---+--+--+---+---+---+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 4| 5| 1| 4| 5| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 8| 8| 2| 8| 8| +---+--+--+---+---+---+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= value22#245) :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id2#243, value11#244, value22#245] {code} The content of df3 are different between Example1 and Example2. I think reason of this is same as SPARK-17154. In above case I understand result of Example1 is incollect and that of Example 2 is collect. But this issue isn't trivial and some developer may overlook this buggy code, I think. Permanent action should be taken for this issue, I think. was (Author: hayashidac): [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. h2. Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2|
[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077 ] chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM: [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. h2. Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 2| 3| 1| 1| 1| | 1| 4| 5| 1| 4| 5| | 1| 4| 5| 1| 2| 3| | 1| 4| 5| 1| 1| 1| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 5| 7| 2| 2| 4| | 2| 8| 8| 2| 8| 8| | 2| 8| 8| 2| 5| 7| | 2| 8| 8| 2| 2| 4| +---+--+--+---+--+--+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id#178], Inner, BuildRight :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#178, value1#179, value2#180] {code} h2. Example2 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22")) df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= df2("value22")) df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+---+---+ | id|value1|value2|id2|value11|value22| +---+--+--+---+---+---+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 4| 5| 1| 4| 5| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 8| 8| 2| 8| 8| +---+--+--+---+---+---+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= value22#245) :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id2#243, value11#244, value22#245] {code} The content of df3 are different between Example1 and Example2. I think the reason of this is same as SPARK-17154. In above case I understand result of Example1 is incollect and that of Example 2 is collect. But this issue isn't trivial and some developer may overlook this buggy code, I think. Permanent action should be taken for this issue, I think. was (Author: hayashidac): [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. h2. Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2|
[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077 ] chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM: [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. h2. Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 2| 3| 1| 1| 1| | 1| 4| 5| 1| 4| 5| | 1| 4| 5| 1| 2| 3| | 1| 4| 5| 1| 1| 1| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 5| 7| 2| 2| 4| | 2| 8| 8| 2| 8| 8| | 2| 8| 8| 2| 5| 7| | 2| 8| 8| 2| 2| 4| +---+--+--+---+--+--+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id#178], Inner, BuildRight :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#178, value1#179, value2#180] {code} h2 Example2 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22")) df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= df2("value22")) df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+---+---+ | id|value1|value2|id2|value11|value22| +---+--+--+---+---+---+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 4| 5| 1| 4| 5| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 8| 8| 2| 8| 8| +---+--+--+---+---+---+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= value22#245) :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id2#243, value11#244, value22#245] {code} The content of df3 are different between Example1 and Example2. I think the reason of this is same as SPARK-17154. In above case I understand result of Example1 is incollect and that of Example 2 is collect. But this issue isn't trivial and some developer may overlook this buggy code, I think. Permanent action should be taken for this issue, I think. was (Author: hayashidac): [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. # Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2|
[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077 ] chie hayashida edited comment on SPARK-17154 at 1/7/17 8:14 AM: [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. # Example 1 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 2| 3| 1| 1| 1| | 1| 4| 5| 1| 4| 5| | 1| 4| 5| 1| 2| 3| | 1| 4| 5| 1| 1| 1| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 5| 7| 2| 2| 4| | 2| 8| 8| 2| 8| 8| | 2| 8| 8| 2| 5| 7| | 2| 8| 8| 2| 2| 4| +---+--+--+---+--+--+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id#178], Inner, BuildRight :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#178, value1#179, value2#180] {code} # Example2 {code} scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22")) df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= df2("value22")) df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+---+---+ | id|value1|value2|id2|value11|value22| +---+--+--+---+---+---+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 4| 5| 1| 4| 5| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 8| 8| 2| 8| 8| +---+--+--+---+---+---+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= value22#245) :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id2#243, value11#244, value22#245] {code} The content of df3 are different between Example1 and Example2. I think the reason of this is same as SPARK-17154. In above case I understand result of Example1 is incollect and that of Example 2 is collect. But this issue isn't trivial and some developer may overlook this buggy code, I think. Permanent action should be taken for this issue, I think. was (Author: hayashidac): [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. # Example 1 ``` scala scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2|
[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807078#comment-15807078 ] Apache Spark commented on SPARK-18890: -- User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/15505 > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18890: Assignee: (was: Apache Spark) > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18890: Assignee: Apache Spark > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Apache Spark >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807077#comment-15807077 ] chie hayashida commented on SPARK-17154: [~nsyca], [~cloud_fan], [~sarutak] I have an example code below. # Example 1 ``` scala scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= df2("value2")) 17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 'id#171 = id#171'. Perhaps you need to use aliases. df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+--+--+ | id|value1|value2| id|value1|value2| +---+--+--+---+--+--+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 2| 3| 1| 1| 1| | 1| 4| 5| 1| 4| 5| | 1| 4| 5| 1| 2| 3| | 1| 4| 5| 1| 1| 1| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 5| 7| 2| 2| 4| | 2| 8| 8| 2| 8| 8| | 2| 8| 8| 2| 5| 7| | 2| 8| 8| 2| 2| 4| +---+--+--+---+--+--+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id#178], Inner, BuildRight :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#178, value1#179, value2#180] ``` # Example2 ```scala scala> val df = Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2") df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field] scala> val df2 = df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22")) df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field] scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= df2("value22")) df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields] scala> df3.show +---+--+--+---+---+---+ | id|value1|value2|id2|value11|value22| +---+--+--+---+---+---+ | 1| 1| 1| 1| 4| 5| | 1| 1| 1| 1| 2| 3| | 1| 1| 1| 1| 1| 1| | 1| 2| 3| 1| 4| 5| | 1| 2| 3| 1| 2| 3| | 1| 4| 5| 1| 4| 5| | 2| 2| 4| 2| 8| 8| | 2| 2| 4| 2| 5| 7| | 2| 2| 4| 2| 2| 4| | 2| 5| 7| 2| 8| 8| | 2| 5| 7| 2| 5| 7| | 2| 8| 8| 2| 8| 8| +---+--+--+---+---+---+ scala> df3.explain == Physical Plan == *BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= value22#245) :- LocalTableScan [id#171, value1#172, value2#173] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id2#243, value11#244, value22#245] ``` The content of df3 are different between Example1 and Example2. I think the reason of this is same as SPARK-17154. In above case I understand result of Example1 is incollect and that of Example 2 is collect. But this issue isn't trivial and some developer may overlook this buggy code, I think. Permanent action should be taken for this issue, I think. > Wrong result can be returned or AnalysisException can be thrown after > self-join or similar operations > - > > Key: SPARK-17154 > URL: https://issues.apache.org/jira/browse/SPARK-17154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kousuke Saruta > Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf > > > When we join two DataFrames which are originated from a same DataFrame, > operations to the joined DataFrame can fail. > One reproducible example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") > val selected1