[jira] [Assigned] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
[ https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23815: Assignee: (was: Apache Spark) > Spark writer dynamic partition overwrite mode fails to write output on multi > level partition > > > Key: SPARK-23815 > URL: https://issues.apache.org/jira/browse/SPARK-23815 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Minor > > Spark introduced new writer mode to overwrite only related partitions in > SPARK-20236. While we are using this feature in our production cluster, we > found a bug when writing multi-level partitions on HDFS. > A simple test case to reproduce this issue: > val df = Seq(("1","2","3")).toDF("col1", "col2","col3") > df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") > If HDFS location "/my/hdfs/location" does not exist, there will be no output. > This seems to be caused by the job commit change in SPARK-20236 in > HadoopMapReduceCommitProtocol. > In the commit job process, the output has been written into staging dir > /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls > fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to > /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail > on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not > create directory for more than one level. > This does not happen in unit test covered with SPARK-20236 with local file > system. > We are proposing a fix. When cleaning current partition dir > /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails > (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to > create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not > exist) so the following rename op can succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
[ https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23815: Assignee: Apache Spark > Spark writer dynamic partition overwrite mode fails to write output on multi > level partition > > > Key: SPARK-23815 > URL: https://issues.apache.org/jira/browse/SPARK-23815 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Assignee: Apache Spark >Priority: Minor > > Spark introduced new writer mode to overwrite only related partitions in > SPARK-20236. While we are using this feature in our production cluster, we > found a bug when writing multi-level partitions on HDFS. > A simple test case to reproduce this issue: > val df = Seq(("1","2","3")).toDF("col1", "col2","col3") > df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") > If HDFS location "/my/hdfs/location" does not exist, there will be no output. > This seems to be caused by the job commit change in SPARK-20236 in > HadoopMapReduceCommitProtocol. > In the commit job process, the output has been written into staging dir > /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls > fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to > /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail > on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not > create directory for more than one level. > This does not happen in unit test covered with SPARK-20236 with local file > system. > We are proposing a fix. When cleaning current partition dir > /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails > (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to > create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not > exist) so the following rename op can succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
[ https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418452#comment-16418452 ] Apache Spark commented on SPARK-23815: -- User 'fangshil' has created a pull request for this issue: https://github.com/apache/spark/pull/20931 > Spark writer dynamic partition overwrite mode fails to write output on multi > level partition > > > Key: SPARK-23815 > URL: https://issues.apache.org/jira/browse/SPARK-23815 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Minor > > Spark introduced new writer mode to overwrite only related partitions in > SPARK-20236. While we are using this feature in our production cluster, we > found a bug when writing multi-level partitions on HDFS. > A simple test case to reproduce this issue: > val df = Seq(("1","2","3")).toDF("col1", "col2","col3") > df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") > If HDFS location "/my/hdfs/location" does not exist, there will be no output. > This seems to be caused by the job commit change in SPARK-20236 in > HadoopMapReduceCommitProtocol. > In the commit job process, the output has been written into staging dir > /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls > fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to > /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail > on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not > create directory for more than one level. > This does not happen in unit test covered with SPARK-20236 with local file > system. > We are proposing a fix. When cleaning current partition dir > /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails > (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to > create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not > exist) so the following rename op can succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
Fangshi Li created SPARK-23815: -- Summary: Spark writer dynamic partition overwrite mode fails to write output on multi level partition Key: SPARK-23815 URL: https://issues.apache.org/jira/browse/SPARK-23815 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Fangshi Li Spark introduced new writer mode to overwrite only related partitions in SPARK-20236. While we are using this feature in our production cluster, we found a bug when writing multi-level partitions on HDFS. A simple test case to reproduce this issue: val df = Seq(("1","2","3")).toDF("col1", "col2","col3") df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") If HDFS location "/my/hdfs/location" does not exist, there will be no output. This seems to be caused by the job commit change in SPARK-20236 in HadoopMapReduceCommitProtocol. In the commit job process, the output has been written into staging dir /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not create directory for more than one level. This does not happen in unit test covered with SPARK-20236 with local file system. We are proposing a fix. When cleaning current partition dir /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not exist) so the following rename op can succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418434#comment-16418434 ] Li Yuanjian commented on SPARK-23811: - The scenario can be reproduced by below test case added in `{{DAGSchedulerSuite`}} {code:java} /** * This tests the case where origin task success after speculative task got FetchFailed * before. */ test("[SPARK-23811] Fetch failed task should kill other attempt") { // Create 3 RDDs with shuffle dependencies on each other: rddA <--- rddB <--- rddC val rddA = new MyRDD(sc, 2, Nil) val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) val shuffleIdA = shuffleDepA.shuffleId val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) submit(rddC, Array(0, 1)) // Complete both tasks in rddA. assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0) complete(taskSets(0), Seq( (Success, makeMapStatus("hostA", 2)), (Success, makeMapStatus("hostB", 2 // The first task success runEvent(makeCompletionEvent( taskSets(1).tasks(0), Success, makeMapStatus("hostB", 2))) // The second task's speculative attempt fails first, but task self still running. // This may caused by ExecutorLost. runEvent(makeCompletionEvent( taskSets(1).tasks(1), FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0, "ignored"), null)) // Check currently missing partition assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size === 1) val missingPartition = mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get(0) // The second result task self success soon runEvent(makeCompletionEvent( taskSets(1).tasks(1), Success, makeMapStatus("hostB", 2))) // No missing partitions here, this will cause child stage never succeed assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size === 0) } {code} > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23811: Assignee: (was: Apache Spark) > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418431#comment-16418431 ] Apache Spark commented on SPARK-23811: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/20930 > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23811: Assignee: Apache Spark > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Assignee: Apache Spark >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23812) DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4
[ https://issues.apache.org/jira/browse/SPARK-23812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418430#comment-16418430 ] Yuming Wang commented on SPARK-23812: - [~wangtao93] Would you like to work on this? > DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4 > -- > > Key: SPARK-23812 > URL: https://issues.apache.org/jira/browse/SPARK-23812 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: wangtao93 >Priority: Minor > > dfs command has been supported,but SqlBase.g4 also put it in > unsupportedHiveNativeCommands . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
[ https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-23814: Description: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("s3n://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:253) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1327) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 86 more {quote} was: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("sn://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at o
[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
[ https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-23814: Description: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:253) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1327) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 86 more {quote} was: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("s3n://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz
[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
[ https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-23814: Description: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("sn://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:253) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1327) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 86 more {quote} was: When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("sn://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't contain colon in it. But when both are present (colon in file name and new line in the data), it snot working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at
[jira] [Created] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.
bharath kumar avusherla created SPARK-23814: --- Summary: Couldn't read file with colon in name and new line character in one of the field. Key: SPARK-23814 URL: https://issues.apache.org/jira/browse/SPARK-23814 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell Affects Versions: 2.2.0 Reporter: bharath kumar avusherla When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("sn://Directory/") function. It is throwing *"**java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz"* error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option *option("multiLine","true")* on any other file which doesn't contain colon in it. But when both are present (colon in file name and new line in the data), it snot working. {quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:253) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1327) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 86 more {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418410#comment-16418410 ] yuliang commented on SPARK-5928: !image-2018-03-29-11-52-32-075.png|width=741,height=189! >From the picture, The shuffle read size is over 2GB, but job not failed. I wonder if anyone can answer me? > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Priority: Major > Attachments: image-2018-03-29-11-52-32-075.png > > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.j
[jira] [Updated] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23811: Attachment: 2.png > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png, 2.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
[ https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23811: Attachment: 1.png > Same tasks' FetchFailed event comes before Success will cause child stage > never succeed > --- > > Key: SPARK-23811 > URL: https://issues.apache.org/jira/browse/SPARK-23811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: 1.png > > > This is a bug caused by abnormal scenario describe below: > # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA > # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` > , shuffleStatus changed. > # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. > # ShuffleMapTask 1 is the last task of its stage, so this stage will never > succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuliang updated SPARK-5928: --- Attachment: image-2018-03-29-11-52-32-075.png > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Priority: Major > Attachments: image-2018-03-29-11-52-32-075.png > > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >
[jira] [Updated] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-23813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cui Xixin updated SPARK-23813: -- Description: i am working on replacing hivesql with sparksql now, i found parse_url perform differently, for example: "select parse_url('http://spark.apache.org/path?query=1 &test=1','HOST') from dual limit 1" in hive,the result is "spark.apache.org" ,but in sparksql is "null", then i fount SPARK-16826, https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been changed for better performence, but also lead to the difference. in fact, the main reason is the 'space' after "query=1", so if we can fix it, for example, encode the sql before new URI() and decode the result? was: i am working on replacing hivesql with sparksql now, i fount parse_url perform differently, for example: "select parse_url('http://spark.apache.org/path?query=1 &test=1','HOST') from dual limit 1" in hive,the result is "spark.apache.org" ,but in sparksql is "null", then i fount SPARK-16826, https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been changed for better performence, but also lead to the difference. in fact, the main reason is the 'space' after "query=1", so if we can fix it, for example, encode the sql before new URI() and decode the result? > [SparkSQL] the result is different between hive and spark when use > PARSE_URL() > --- > > Key: SPARK-23813 > URL: https://issues.apache.org/jira/browse/SPARK-23813 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Cui Xixin >Priority: Minor > > i am working on replacing hivesql with sparksql now, i found parse_url > perform differently, > for example: > "select parse_url('http://spark.apache.org/path?query=1 &test=1','HOST') > from dual limit 1" > in hive,the result is "spark.apache.org" ,but in sparksql is "null", > then i fount SPARK-16826, > https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has > been changed for better performence, but also lead to the difference. in > fact, the main reason is the 'space' after "query=1", so if we can fix it, > for example, encode the sql before new URI() and decode the result? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-23813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cui Xixin updated SPARK-23813: -- Description: i am working on replacing hivesql with sparksql now, i fount parse_url perform differently, for example: "select parse_url('http://spark.apache.org/path?query=1 &test=1','HOST') from dual limit 1" in hive,the result is "spark.apache.org" ,but in sparksql is "null", then i fount SPARK-16826, https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been changed for better performence, but also lead to the difference. in fact, the main reason is the 'space' after "query=1", so if we can fix it, for example, encode the sql before new URI() and decode the result? > [SparkSQL] the result is different between hive and spark when use > PARSE_URL() > --- > > Key: SPARK-23813 > URL: https://issues.apache.org/jira/browse/SPARK-23813 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Cui Xixin >Priority: Minor > > i am working on replacing hivesql with sparksql now, i fount parse_url > perform differently, > for example: > "select parse_url('http://spark.apache.org/path?query=1 &test=1','HOST') > from dual limit 1" > in hive,the result is "spark.apache.org" ,but in sparksql is "null", > then i fount SPARK-16826, > https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has > been changed for better performence, but also lead to the difference. in > fact, the main reason is the 'space' after "query=1", so if we can fix it, > for example, encode the sql before new URI() and decode the result? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()
Cui Xixin created SPARK-23813: - Summary: [SparkSQL] the result is different between hive and spark when use PARSE_URL() Key: SPARK-23813 URL: https://issues.apache.org/jira/browse/SPARK-23813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Cui Xixin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23812) DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4
wangtao93 created SPARK-23812: - Summary: DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4 Key: SPARK-23812 URL: https://issues.apache.org/jira/browse/SPARK-23812 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: wangtao93 dfs command has been supported,but SqlBase.g4 also put it in unsupportedHiveNativeCommands . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed
Li Yuanjian created SPARK-23811: --- Summary: Same tasks' FetchFailed event comes before Success will cause child stage never succeed Key: SPARK-23811 URL: https://issues.apache.org/jira/browse/SPARK-23811 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0, 2.2.0 Reporter: Li Yuanjian This is a bug caused by abnormal scenario describe below: # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , shuffleStatus changed. # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately. # ShuffleMapTask 1 is the last task of its stage, so this stage will never succeed because of there's no missing task DagScheduler can get. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418298#comment-16418298 ] Saisai Shao commented on SPARK-23807: - Hi [~ste...@apache.org], this is dup of SPARK-23534, I think we can convert this as a subtask of SPARK-23534, what do you think? > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23675) Title add spark logo, use spark logo image
[ https://issues.apache.org/jira/browse/SPARK-23675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23675: - Assignee: guoxiaolongzte > Title add spark logo, use spark logo image > -- > > Key: SPARK-23675 > URL: https://issues.apache.org/jira/browse/SPARK-23675 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > Attachments: flink.png, kafka.png, nifi.png, spark_fix_after.png, > spark_fix_before.png, storm.png, storm.png, yarn.png, yarn.png > > > Title add spark logo, use spark logo image. reference other big data system > ui, so i think spark should add it. > spark fix before: !spark_fix_before.png! > > spark fix after: !spark_fix_after.png! > > reference kafka ui: !kafka.png! > > reference storm ui: !storm.png! > > reference yarn ui: !yarn.png! > > reference nifi ui: !nifi.png! > > reference flink ui: !flink.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23675) Title add spark logo, use spark logo image
[ https://issues.apache.org/jira/browse/SPARK-23675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23675. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20818 [https://github.com/apache/spark/pull/20818] > Title add spark logo, use spark logo image > -- > > Key: SPARK-23675 > URL: https://issues.apache.org/jira/browse/SPARK-23675 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > Attachments: flink.png, kafka.png, nifi.png, spark_fix_after.png, > spark_fix_before.png, storm.png, storm.png, yarn.png, yarn.png > > > Title add spark logo, use spark logo image. reference other big data system > ui, so i think spark should add it. > spark fix before: !spark_fix_before.png! > > spark fix after: !spark_fix_after.png! > > reference kafka ui: !kafka.png! > > reference storm ui: !storm.png! > > reference yarn ui: !yarn.png! > > reference nifi ui: !nifi.png! > > reference flink ui: !flink.png! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql
[ https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418282#comment-16418282 ] Maryann Xue commented on SPARK-20169: - This was related to the use of "withColumnRenamed", which will add a "Project" on top of the original relation. And later on, when getting the "outputPartitioning" of the "Project", it returned HashPartitioning("dst") while it should return HashPartitioning("src") since it had been renamed. As a result, an "Exchange" node was missing between the two outmost HashAggregate nodes, and that's why the aggregate result was not correct. {code:java} test("SPARK-20169") { val e = Seq((1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (4, 1)).toDF("src", "dst") val r = Seq((1), (2), (3), (4)).toDF("src") val r1 = e.join(r, "src" :: Nil).groupBy("dst").count().withColumnRenamed("dst", "src") val jr = e.join(r1, "src" :: Nil) val r2 = jr.groupBy("dst").count r2.explain() r2.show } {code} The physical plan resulted from this bug turned out to be {code} == Physical Plan == *(2) HashAggregate(keys=[dst#181], functions=[count(1)]) +- *(2) HashAggregate(keys=[dst#181], functions=[partial_count(1)]) +- *(2) Project [dst#181] +- *(2) BroadcastHashJoin [src#180], [src#197], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [src#180, dst#181] +- *(2) HashAggregate(keys=[dst#181], functions=[]) +- Exchange hashpartitioning(dst#181, 5) +- *(1) HashAggregate(keys=[dst#181], functions=[]) +- *(1) Project [dst#181] +- *(1) BroadcastHashJoin [src#180], [src#187], Inner, BuildRight :- LocalTableScan [src#180, dst#181] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [src#187] {code} The correct physical plan should be {code} == Physical Plan == *(3) HashAggregate(keys=[dst#181], functions=[count(1)]) +- Exchange hashpartitioning(dst#181, 5) +- *(2) HashAggregate(keys=[dst#181], functions=[partial_count(1)]) +- *(2) Project [dst#181] +- *(2) BroadcastHashJoin [src#180], [src#197], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [src#180, dst#181] +- *(2) HashAggregate(keys=[dst#181], functions=[]) +- Exchange hashpartitioning(dst#181, 5) +- *(1) HashAggregate(keys=[dst#181], functions=[]) +- *(1) Project [dst#181] +- *(1) BroadcastHashJoin [src#180], [src#187], Inner, BuildRight :- LocalTableScan [src#180, dst#181] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [src#187] {code} This happens to the same issue as SPARK-23368. The PR for SPARK-23368 fixes it. > Groupby Bug with Sparksql > - > > Key: SPARK-20169 > URL: https://issues.apache.org/jira/browse/SPARK-20169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bin Wu >Priority: Major > > We find a potential bug in Catalyst optimizer which cannot correctly > process "groupby". You can reproduce it by following simple example: > = > from pyspark.sql.functions import * > #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"]) > e = spark.read.csv("graph.csv", header=True) > r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src']) > r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src') > jr = e.join(r1, 'src') > jr.show() > r2 = jr.groupBy('dst').count() > r2.show() > = > FYI, "graph.csv" contains exactly the same data as the commented line. > You can find that jr is: > |src|dst|count| > | 3| 1|1| > | 1| 4|3| > | 1| 3|3| > | 1| 2|3| > | 4| 1|1| > | 2| 1|1| > But, after the last groupBy, the 3 rows with dst = 1 are not grouped together: > |dst|count| > | 1|1| > | 4|1| > | 3|1| > | 2|1| > | 1|1| > | 1|1| > If we build jr directly from raw data (commented line), this error will not > show up. So > we suspect that there is a bug in the Catalyst optimizer when multiple joins > and groupBy's > are being optimized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: is
[jira] [Assigned] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23772: Assignee: Apache Spark > Provide an option to ignore column of all null values or empty map/array > during JSON schema inference > - > > Key: SPARK-23772 > URL: https://issues.apache.org/jira/browse/SPARK-23772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > > It is common that we convert data from JSON source to structured format > periodically. In the initial batch of JSON data, if a field's values are > always null, Spark infers this field as StringType. However, in the second > batch, one non-null value appears in this field and its type turns out to be > not StringType. Then merge schema failed because schema inconsistency. > This also applies to empty arrays and empty objects. My proposal is providing > an option in Spark JSON source to omit those fields until we see a non-null > value. > This is similar to SPARK-12436 but the proposed solution is different. > cc: [~rxin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23772: Assignee: (was: Apache Spark) > Provide an option to ignore column of all null values or empty map/array > during JSON schema inference > - > > Key: SPARK-23772 > URL: https://issues.apache.org/jira/browse/SPARK-23772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > It is common that we convert data from JSON source to structured format > periodically. In the initial batch of JSON data, if a field's values are > always null, Spark infers this field as StringType. However, in the second > batch, one non-null value appears in this field and its type turns out to be > not StringType. Then merge schema failed because schema inconsistency. > This also applies to empty arrays and empty objects. My proposal is providing > an option in Spark JSON source to omit those fields until we see a non-null > value. > This is similar to SPARK-12436 but the proposed solution is different. > cc: [~rxin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418280#comment-16418280 ] Apache Spark commented on SPARK-23772: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20929 > Provide an option to ignore column of all null values or empty map/array > during JSON schema inference > - > > Key: SPARK-23772 > URL: https://issues.apache.org/jira/browse/SPARK-23772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > It is common that we convert data from JSON source to structured format > periodically. In the initial batch of JSON data, if a field's values are > always null, Spark infers this field as StringType. However, in the second > batch, one non-null value appears in this field and its type turns out to be > not StringType. Then merge schema failed because schema inconsistency. > This also applies to empty arrays and empty objects. My proposal is providing > an option in Spark JSON source to omit those fields until we see a non-null > value. > This is similar to SPARK-12436 but the proposed solution is different. > cc: [~rxin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better
[ https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dciborow updated SPARK-23810: - Comment: was deleted (was: test comment to see if i get an email...) > Matrix Multiplication is so bad, file I/O to local python is better > --- > > Key: SPARK-23810 > URL: https://issues.apache.org/jira/browse/SPARK-23810 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: dciborow >Priority: Minor > > I am trying to multiple two matrices. One is 130k by 30k. The second is 30k > by 30k. > Running this leads to hearbeat timeout, Java Heap Space and Garage collection > errors. > {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}} > {{I have also tried the following which will fail on the toLocalMatrix call. > }} > val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix() > val itemMatrix = new > CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix() > val itemMatrixBC = session.sparkContext.broadcast(itemMatrix) > val userToItemMatrix = userMatrix > .multiply(itemMatrixBC.value) > .rows.map(index => (index.index.toInt, index.vector)) > > I instead have gotten this operation "working", by saving the inputs > dataframes to parquet(which start as DataFrames before the .rdd call to get > them to work with the matrix types), and then loading them into > python/pandas, using numpy for the matrix mulplication, saving back to > parquet, and rereading back into spark. > > Python - > import pandas as pd > import numpy as np > X = pd.read_parquet('./items-parquet', engine='pyarrow') > #Xp = np.stack(X.jaccardList) > Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID) > Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1)) > Xpp = Xrows.join(Xp).fillna(0) > Y = pd.read_parquet('./users-parquet',engine='pyarrow') > Yp = np.stack(Y.flatList) > Z = np.matmul(Yp, Xpp) > Zp = pd.DataFrame(Z) > Zp.columns = list(map(str, Zp.columns)) > Zpp = pd.DataFrame() > Zpp['id'] = Zp.index > Zpp['ratings'] = Zp.values.tolist() > Zpp.to_parquet("sampleout.parquet",engine='pyarrow') > > Scala - > import sys.process._ > val result = "python matmul.py".! > val pythonOutput = > userDataFrame.sparkSession.read.parquet("./sampleout.parquet") > > I can provide code, and the data to repo. But could use some instructions how > to set that up. This is based on the MovieLens 20mil dataset, or I can > provide access to my data in Azure. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better
[ https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dciborow updated SPARK-23810: - Priority: Minor (was: Major) > Matrix Multiplication is so bad, file I/O to local python is better > --- > > Key: SPARK-23810 > URL: https://issues.apache.org/jira/browse/SPARK-23810 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: dciborow >Priority: Minor > > I am trying to multiple two matrices. One is 130k by 30k. The second is 30k > by 30k. > Running this leads to hearbeat timeout, Java Heap Space and Garage collection > errors. > {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}} > {{I have also tried the following which will fail on the toLocalMatrix call. > }} > val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix() > val itemMatrix = new > CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix() > val itemMatrixBC = session.sparkContext.broadcast(itemMatrix) > val userToItemMatrix = userMatrix > .multiply(itemMatrixBC.value) > .rows.map(index => (index.index.toInt, index.vector)) > > I instead have gotten this operation "working", by saving the inputs > dataframes to parquet(which start as DataFrames before the .rdd call to get > them to work with the matrix types), and then loading them into > python/pandas, using numpy for the matrix mulplication, saving back to > parquet, and rereading back into spark. > > Python - > import pandas as pd > import numpy as np > X = pd.read_parquet('./items-parquet', engine='pyarrow') > #Xp = np.stack(X.jaccardList) > Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID) > Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1)) > Xpp = Xrows.join(Xp).fillna(0) > Y = pd.read_parquet('./users-parquet',engine='pyarrow') > Yp = np.stack(Y.flatList) > Z = np.matmul(Yp, Xpp) > Zp = pd.DataFrame(Z) > Zp.columns = list(map(str, Zp.columns)) > Zpp = pd.DataFrame() > Zpp['id'] = Zp.index > Zpp['ratings'] = Zp.values.tolist() > Zpp.to_parquet("sampleout.parquet",engine='pyarrow') > > Scala - > import sys.process._ > val result = "python matmul.py".! > val pythonOutput = > userDataFrame.sparkSession.read.parquet("./sampleout.parquet") > > I can provide code, and the data to repo. But could use some instructions how > to set that up. This is based on the MovieLens 20mil dataset, or I can > provide access to my data in Azure. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better
[ https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418244#comment-16418244 ] dciborow commented on SPARK-23810: -- test comment to see if i get an email... > Matrix Multiplication is so bad, file I/O to local python is better > --- > > Key: SPARK-23810 > URL: https://issues.apache.org/jira/browse/SPARK-23810 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: dciborow >Priority: Major > > I am trying to multiple two matrices. One is 130k by 30k. The second is 30k > by 30k. > Running this leads to hearbeat timeout, Java Heap Space and Garage collection > errors. > {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}} > {{I have also tried the following which will fail on the toLocalMatrix call. > }} > val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix() > val itemMatrix = new > CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix() > val itemMatrixBC = session.sparkContext.broadcast(itemMatrix) > val userToItemMatrix = userMatrix > .multiply(itemMatrixBC.value) > .rows.map(index => (index.index.toInt, index.vector)) > > I instead have gotten this operation "working", by saving the inputs > dataframes to parquet(which start as DataFrames before the .rdd call to get > them to work with the matrix types), and then loading them into > python/pandas, using numpy for the matrix mulplication, saving back to > parquet, and rereading back into spark. > > Python - > import pandas as pd > import numpy as np > X = pd.read_parquet('./items-parquet', engine='pyarrow') > #Xp = np.stack(X.jaccardList) > Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID) > Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1)) > Xpp = Xrows.join(Xp).fillna(0) > Y = pd.read_parquet('./users-parquet',engine='pyarrow') > Yp = np.stack(Y.flatList) > Z = np.matmul(Yp, Xpp) > Zp = pd.DataFrame(Z) > Zp.columns = list(map(str, Zp.columns)) > Zpp = pd.DataFrame() > Zpp['id'] = Zp.index > Zpp['ratings'] = Zp.values.tolist() > Zpp.to_parquet("sampleout.parquet",engine='pyarrow') > > Scala - > import sys.process._ > val result = "python matmul.py".! > val pythonOutput = > userDataFrame.sparkSession.read.parquet("./sampleout.parquet") > > I can provide code, and the data to repo. But could use some instructions how > to set that up. This is based on the MovieLens 20mil dataset, or I can > provide access to my data in Azure. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better
dciborow created SPARK-23810: Summary: Matrix Multiplication is so bad, file I/O to local python is better Key: SPARK-23810 URL: https://issues.apache.org/jira/browse/SPARK-23810 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.2.0 Reporter: dciborow I am trying to multiple two matrices. One is 130k by 30k. The second is 30k by 30k. Running this leads to hearbeat timeout, Java Heap Space and Garage collection errors. {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}} {{I have also tried the following which will fail on the toLocalMatrix call. }} val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix() val itemMatrix = new CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix() val itemMatrixBC = session.sparkContext.broadcast(itemMatrix) val userToItemMatrix = userMatrix .multiply(itemMatrixBC.value) .rows.map(index => (index.index.toInt, index.vector)) I instead have gotten this operation "working", by saving the inputs dataframes to parquet(which start as DataFrames before the .rdd call to get them to work with the matrix types), and then loading them into python/pandas, using numpy for the matrix mulplication, saving back to parquet, and rereading back into spark. Python - import pandas as pd import numpy as np X = pd.read_parquet('./items-parquet', engine='pyarrow') #Xp = np.stack(X.jaccardList) Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID) Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1)) Xpp = Xrows.join(Xp).fillna(0) Y = pd.read_parquet('./users-parquet',engine='pyarrow') Yp = np.stack(Y.flatList) Z = np.matmul(Yp, Xpp) Zp = pd.DataFrame(Z) Zp.columns = list(map(str, Zp.columns)) Zpp = pd.DataFrame() Zpp['id'] = Zp.index Zpp['ratings'] = Zp.values.tolist() Zpp.to_parquet("sampleout.parquet",engine='pyarrow') Scala - import sys.process._ val result = "python matmul.py".! val pythonOutput = userDataFrame.sparkSession.read.parquet("./sampleout.parquet") I can provide code, and the data to repo. But could use some instructions how to set that up. This is based on the MovieLens 20mil dataset, or I can provide access to my data in Azure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23809) Active SparkSession should be set by getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418211#comment-16418211 ] Apache Spark commented on SPARK-23809: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/20927 > Active SparkSession should be set by getOrCreate > > > Key: SPARK-23809 > URL: https://issues.apache.org/jira/browse/SPARK-23809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Eric Liang >Priority: Minor > > Currently, the active spark session is set inconsistently (e.g., in > createDataFrame, prior to query execution). Many places in spark also > incorrectly query active session when they should be calling > activeSession.getOrElse(defaultSession). > The semantics here can be cleaned up if we also set the active session when > the default session is set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23809) Active SparkSession should be set by getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23809: Assignee: Apache Spark > Active SparkSession should be set by getOrCreate > > > Key: SPARK-23809 > URL: https://issues.apache.org/jira/browse/SPARK-23809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > Currently, the active spark session is set inconsistently (e.g., in > createDataFrame, prior to query execution). Many places in spark also > incorrectly query active session when they should be calling > activeSession.getOrElse(defaultSession). > The semantics here can be cleaned up if we also set the active session when > the default session is set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23809) Active SparkSession should be set by getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23809: Assignee: (was: Apache Spark) > Active SparkSession should be set by getOrCreate > > > Key: SPARK-23809 > URL: https://issues.apache.org/jira/browse/SPARK-23809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Eric Liang >Priority: Minor > > Currently, the active spark session is set inconsistently (e.g., in > createDataFrame, prior to query execution). Many places in spark also > incorrectly query active session when they should be calling > activeSession.getOrElse(defaultSession). > The semantics here can be cleaned up if we also set the active session when > the default session is set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23809) Active SparkSession should be set by getOrCreate
Eric Liang created SPARK-23809: -- Summary: Active SparkSession should be set by getOrCreate Key: SPARK-23809 URL: https://issues.apache.org/jira/browse/SPARK-23809 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Eric Liang Currently, the active spark session is set inconsistently (e.g., in createDataFrame, prior to query execution). Many places in spark also incorrectly query active session when they should be calling activeSession.getOrElse(defaultSession). The semantics here can be cleaned up if we also set the active session when the default session is set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23808) Test spark sessions should set default session
[ https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418142#comment-16418142 ] Apache Spark commented on SPARK-23808: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20926 > Test spark sessions should set default session > -- > > Key: SPARK-23808 > URL: https://issues.apache.org/jira/browse/SPARK-23808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > SparkSession.getOrCreate() ensures that the session it returns is set as a > default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts > around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23808) Test spark sessions should set default session
[ https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23808: Assignee: Apache Spark > Test spark sessions should set default session > -- > > Key: SPARK-23808 > URL: https://issues.apache.org/jira/browse/SPARK-23808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > > SparkSession.getOrCreate() ensures that the session it returns is set as a > default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts > around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23808) Test spark sessions should set default session
[ https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23808: Assignee: (was: Apache Spark) > Test spark sessions should set default session > -- > > Key: SPARK-23808 > URL: https://issues.apache.org/jira/browse/SPARK-23808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > SparkSession.getOrCreate() ensures that the session it returns is set as a > default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts > around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23808) Test spark sessions should set default session
Jose Torres created SPARK-23808: --- Summary: Test spark sessions should set default session Key: SPARK-23808 URL: https://issues.apache.org/jira/browse/SPARK-23808 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Jose Torres SparkSession.getOrCreate() ensures that the session it returns is set as a default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts around this method, and thus a default is never set. We need to set it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
[ https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418087#comment-16418087 ] Apache Spark commented on SPARK-22941: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20925 > Allow SparkSubmit to throw exceptions instead of exiting / printing errors. > --- > > Key: SPARK-22941 > URL: https://issues.apache.org/jira/browse/SPARK-22941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see > SPARK-11035). But if the caller provides incorrect or inconsistent parameters > to the app, {{SparkSubmit}} will print errors to the output and call > {{System.exit}}, which is not very user friendly in this code path. > We should modify {{SparkSubmit}} to be more friendly when called this way, > while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
[ https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22941: Assignee: Apache Spark > Allow SparkSubmit to throw exceptions instead of exiting / printing errors. > --- > > Key: SPARK-22941 > URL: https://issues.apache.org/jira/browse/SPARK-22941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see > SPARK-11035). But if the caller provides incorrect or inconsistent parameters > to the app, {{SparkSubmit}} will print errors to the output and call > {{System.exit}}, which is not very user friendly in this code path. > We should modify {{SparkSubmit}} to be more friendly when called this way, > while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
[ https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22941: Assignee: (was: Apache Spark) > Allow SparkSubmit to throw exceptions instead of exiting / printing errors. > --- > > Key: SPARK-22941 > URL: https://issues.apache.org/jira/browse/SPARK-22941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see > SPARK-11035). But if the caller provides incorrect or inconsistent parameters > to the app, {{SparkSubmit}} will print errors to the output and call > {{System.exit}}, which is not very user friendly in this code path. > We should modify {{SparkSubmit}} to be more friendly when called this way, > while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23806: Assignee: Apache Spark > Broadcast. unpersist can cause fatal exception when used with dynamic > allocation > > > Key: SPARK-23806 > URL: https://issues.apache.org/jira/browse/SPARK-23806 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this > could also apply to Broadcast.unpersist. > > 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR > org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 > org.apache.spark.SparkException: Exception thrown in awaitResult: at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) > at > org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) > at > org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) > at > org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) > at scala.Option.foreach(Option.scala:257) at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to > /10.10.10.10:53804: java.nio.channels.ClosedChannelException at > org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) Caused by: > java.nio.channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23806: Assignee: (was: Apache Spark) > Broadcast. unpersist can cause fatal exception when used with dynamic > allocation > > > Key: SPARK-23806 > URL: https://issues.apache.org/jira/browse/SPARK-23806 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Priority: Major > > Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this > could also apply to Broadcast.unpersist. > > 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR > org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 > org.apache.spark.SparkException: Exception thrown in awaitResult: at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) > at > org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) > at > org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) > at > org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) > at scala.Option.foreach(Option.scala:257) at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to > /10.10.10.10:53804: java.nio.channels.ClosedChannelException at > org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) Caused by: > java.nio.channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417881#comment-16417881 ] Apache Spark commented on SPARK-23806: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/20924 > Broadcast. unpersist can cause fatal exception when used with dynamic > allocation > > > Key: SPARK-23806 > URL: https://issues.apache.org/jira/browse/SPARK-23806 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Priority: Major > > Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this > could also apply to Broadcast.unpersist. > > 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR > org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 > org.apache.spark.SparkException: Exception thrown in awaitResult: at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at > org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) > at > org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) > at > org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) > at > org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) > at scala.Option.foreach(Option.scala:257) at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to > /10.10.10.10:53804: java.nio.channels.ClosedChannelException at > org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) Caused by: > java.nio.channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-23806: -- Description: Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this could also apply to Broadcast.unpersist. 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) at scala.Option.foreach(Option.scala:257) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to /10.10.10.10:53804: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedChannelException was: Very similar to https://issues.apache.org/jira/browse/SPARK-2261 . But this could also apply to Broadcast.unpersist. 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun
[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23807: Assignee: Apache Spark > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23807: Assignee: (was: Apache Spark) > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417864#comment-16417864 ] Apache Spark commented on SPARK-23807: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/20923 > Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and > binding > --- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Major > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417848#comment-16417848 ] Apache Spark commented on SPARK-23096: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20922 > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Saisai Shao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417787#comment-16417787 ] Apache Spark commented on SPARK-23096: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20921 > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Saisai Shao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417751#comment-16417751 ] Marcelo Vanzin commented on SPARK-23782: All the information you're trying to protect can be found through other means. Looking at the even log dir in HDFS; looking at the cluster manager's UI; just running "ps" on the cluster machines. Different users seeing different listings can be confusing. "Hey I was trying to find your jobs for blah on the SHS and I don't see them." Again, knowing the existence of users and the fact that they run jobs is not a security problem. You cannot see what those jobs do. > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding
Steve Loughran created SPARK-23807: -- Summary: Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding Key: SPARK-23807 URL: https://issues.apache.org/jira/browse/SPARK-23807 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Steve Loughran Hadoop 3, and particular Hadoop 3.1 adds: * Java 8 as the minimum (and currently sole) supported Java version * A new "hadoop-cloud-storage" module intended to be a minimal dependency POM for all the cloud connectors in the version of hadoop built against * The ability to declare a committer for any FileOutputFormat which supercedes the classic FileOutputCommitter -in both a job and for a specific FS URI * A shaded client JAR, though not yet one complete enough for spark. * Lots of other features and fixes. The basic work of building spark with hadoop 3 is one of just doing the build with {{-Dhadoop.version=3.x.y}}; however that * Doesn't build on SBT (dependency resolution of zookeeper JAR) * Misses the new cloud features The ZK dependency can be fixed everywhere by explicitly declaring the ZK artifact, instead of relying on curator to pull it in; this needs a profile to declare the right ZK version, obviously.. To use the cloud features spark the hadoop-3 profile should declare that the spark-hadoop-cloud module depends on —and only on— the hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud storage, and a source package which is only built and tested when build against Hadoop 3.1+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation
Thomas Graves created SPARK-23806: - Summary: Broadcast. unpersist can cause fatal exception when used with dynamic allocation Key: SPARK-23806 URL: https://issues.apache.org/jira/browse/SPARK-23806 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Thomas Graves Very similar to https://issues.apache.org/jira/browse/SPARK-2261 . But this could also apply to Broadcast.unpersist. 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) at scala.Option.foreach(Option.scala:257) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to /10.10.10.10:53804: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417519#comment-16417519 ] Nathan Kleyn edited comment on SPARK-23801 at 3/28/18 3:14 PM: --- [~kiszk] Sure, we'll have our best go at trying to narrow down the job to something that fails - many thanks for the speedy reply! Since upgrading to Spark v2.3.0 a lot of the stages are listed as "ThreadPoolExecutor" now, meaning it's really difficult to actually see what they correspond to in the source. As this job is failing around the 60th stage, that's a lot of things it could be! Is there any advice on the best way to correlate these "ThreadPoolExecutor" stages with the source code now? was (Author: nathankleyn): [~kiszk] We'll have our best go at trying to narrow down the job to something that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as "ThreadPoolExecutor" now, meaning it's really difficult to actually see what they correspond to in the source. As this job is failing around the 60th stage, that's a lot of things it could be! Is there any advice on the best way to correlate these "ThreadPoolExecutor" stages with the source code now? > Consistent SIGSEGV after upgrading to Spark v2.3.0 > -- > > Key: SPARK-23801 > URL: https://issues.apache.org/jira/browse/SPARK-23801 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 > Environment: Mesos coarse grained executor > 18 * r3.4xlarge (16 core boxes) with 105G of executor memory >Reporter: Nathan Kleyn >Priority: Major > Attachments: spark-executor-failure.coredump.log > > > After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent > segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of > executor memory). I've attached the full coredump but here is an except: > {code:java} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build > 1.8.0_161-b12) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode > linux-amd64 ) > # Problematic frame: > # V [libjvm.so+0x995fdc] oopDesc* > PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c > # > # Core dump written. Default location: > /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core > or core.1315 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > #{code} > {code:java} > --- T H R E A D --- > Current thread (0x7f146005b000): GCTaskThread [stack: > 0x7f1464e2d000,0x7f1464f2e000] [id=1363] > siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: > 0x > Registers: > RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, > RDX=0x > RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, > RDI=0x7ef7bc30bda8 > R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, > R11=0x7f14671240e0 > R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, > R15=0x000d > RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, > ERR=0x > TRAPNO=0x000d > Top of Stack: (sp=0x7f1464f2c1a0) > 0x7f1464f2c1a0: 7f146005b000 0001 > 0x7f1464f2c1b0: 0004 7f14600bb640 > 0x7f1464f2c1c0: 7f1464f2c210 7f14673aeed6 > 0x7f1464f2c1d0: 7f1464f2c2c0 7f1464f2c250 > 0x7f1464f2c1e0: 7f11bde31b70 7ef9c035f8c8 > 0x7f1464f2c1f0: 7ef8a80a7060 1741 > 0x7f1464f2c200: 0002 > 0x7f1464f2c210: 7f1464f2c230 7f146742b005 > 0x7f1464f2c220: 7ef8a80a7050 1741 > 0x7f1464f2c230: 7f1464f2c2d0 7f14673ae9fb > 0x7f1464f2c240: 7f1467a5d880 7f14673ad9a0 > 0x7f1464f2c250: 7f1464f2c9f0 7f1464f2c3d0 > 0x7f1464f2c260: 7f1464f2c3a0 7f146005b620 > 0x7f1464f2c270: 7ef8b843d7c8 00020006 > 0x7f1464f2c280: 7f1464f2c340 7f14600bb640 > 0x7f1464f2c290: 17417f1453fb9cec 7f1453fb > 0x7f1464f2c2a0: 7f1453fb819e 7f1464f2c3a0 > 0x7f1464f2c2b0: 0001 > 0x7f1464f2c2c0: 7f1464f2c3d0 7f1464f2c9d0 > 0x7f1464f2c2d0: 7f1464f2c340 7f1467025
[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417519#comment-16417519 ] Nathan Kleyn commented on SPARK-23801: -- [~kiszk] We'll have our best go at trying to narrow down the job to something that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as "ThreadPoolExecutor" now, meaning it's really difficult to actually see what they correspond to in the source. As this job is failing around the 60th stage, that's a lot of things it could be! Is there any advice on the best way to correlate these "ThreadPoolExecutor" stages with the source code now? > Consistent SIGSEGV after upgrading to Spark v2.3.0 > -- > > Key: SPARK-23801 > URL: https://issues.apache.org/jira/browse/SPARK-23801 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 > Environment: Mesos coarse grained executor > 18 * r3.4xlarge (16 core boxes) with 105G of executor memory >Reporter: Nathan Kleyn >Priority: Major > Attachments: spark-executor-failure.coredump.log > > > After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent > segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of > executor memory). I've attached the full coredump but here is an except: > {code:java} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build > 1.8.0_161-b12) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode > linux-amd64 ) > # Problematic frame: > # V [libjvm.so+0x995fdc] oopDesc* > PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c > # > # Core dump written. Default location: > /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core > or core.1315 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > #{code} > {code:java} > --- T H R E A D --- > Current thread (0x7f146005b000): GCTaskThread [stack: > 0x7f1464e2d000,0x7f1464f2e000] [id=1363] > siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: > 0x > Registers: > RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, > RDX=0x > RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, > RDI=0x7ef7bc30bda8 > R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, > R11=0x7f14671240e0 > R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, > R15=0x000d > RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, > ERR=0x > TRAPNO=0x000d > Top of Stack: (sp=0x7f1464f2c1a0) > 0x7f1464f2c1a0: 7f146005b000 0001 > 0x7f1464f2c1b0: 0004 7f14600bb640 > 0x7f1464f2c1c0: 7f1464f2c210 7f14673aeed6 > 0x7f1464f2c1d0: 7f1464f2c2c0 7f1464f2c250 > 0x7f1464f2c1e0: 7f11bde31b70 7ef9c035f8c8 > 0x7f1464f2c1f0: 7ef8a80a7060 1741 > 0x7f1464f2c200: 0002 > 0x7f1464f2c210: 7f1464f2c230 7f146742b005 > 0x7f1464f2c220: 7ef8a80a7050 1741 > 0x7f1464f2c230: 7f1464f2c2d0 7f14673ae9fb > 0x7f1464f2c240: 7f1467a5d880 7f14673ad9a0 > 0x7f1464f2c250: 7f1464f2c9f0 7f1464f2c3d0 > 0x7f1464f2c260: 7f1464f2c3a0 7f146005b620 > 0x7f1464f2c270: 7ef8b843d7c8 00020006 > 0x7f1464f2c280: 7f1464f2c340 7f14600bb640 > 0x7f1464f2c290: 17417f1453fb9cec 7f1453fb > 0x7f1464f2c2a0: 7f1453fb819e 7f1464f2c3a0 > 0x7f1464f2c2b0: 0001 > 0x7f1464f2c2c0: 7f1464f2c3d0 7f1464f2c9d0 > 0x7f1464f2c2d0: 7f1464f2c340 7f1467025f22 > 0x7f1464f2c2e0: 7f145427cb5c 7f1464f2c3a0 > 0x7f1464f2c2f0: 7f1464f2c370 7f146005b000 > 0x7f1464f2c300: 7f1464f2c9f0 7ef850009800 > 0x7f1464f2c310: 7f1464f2c9f0 7f1464f2c3a0 > 0x7f1464f2c320: 7f1464f2c3d0 7f146005b000 > 0x7f1464f2c330: 7f1464f2c9f0 7ef850009800 > 0x7f1464f2c340: 7f1464f2c9c0 7f1467508191 > 0x7f1464f2c350: 7ef9c16f7890 7f1464f2c370 > 0x7f1464f2c360: 7f1464f2c9d0 > 0x7f1464f2c370: 7ef9c035f8c0 7f145427cb5c
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 3:00 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. Before worker dies: !screen_shot_2018-03-20_at_15.23.29.png! Heap dump of worker running for some time: !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, > screen_shot_2018-03-20_at_15.23.29.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: screen_shot_2018-03-20_at_15.23.29.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, > screen_shot_2018-03-20_at_15.23.29.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:56 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_ coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_152021
[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417474#comment-16417474 ] Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:55 PM: -- I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ {code:java} private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code} _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! was (Author: akorzhuev): I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ _private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_ _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726
[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417474#comment-16417474 ] Andrew Korzhuev commented on SPARK-23682: - I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on: * AWS S3 checkpoint * Spark 2.3.0 on k8s * Structured stream - stream join I managed to track the leak down to [https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_ _private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_ _,_ which appears not to clean up _UnsafeRow_s coming from: {code:java} type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]{code} I noticed that memory leaks slower if data is buffered to disk: {code:java} spark.hadoop.fs.s3a.fast.upload true spark.hadoop.fs.s3a.fast.upload.bufferdisk {code} It also seems that the state persisted to S3 is never cleaned up, as both number of objects and volume grows indefinitely. !Screen Shot 2018-03-28 at 16.44.20.png! > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode
[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417466#comment-16417466 ] Thomas Graves commented on SPARK-22618: --- I'll file a separate Jira for it and put up a pr > RDD.unpersist can cause fatal exception when used with dynamic allocation > - > > Key: SPARK-22618 > URL: https://issues.apache.org/jira/browse/SPARK-22618 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Brad >Assignee: Brad >Priority: Minor > Fix For: 2.3.0 > > > If you use rdd.unpersist() with dynamic allocation, then an executor can be > deallocated while your rdd is being removed, which will throw an uncaught > exception killing your job. > I looked into different ways of preventing this error from occurring but > couldn't come up with anything that wouldn't require a big change. I propose > the best fix is just to catch and log IOExceptions in unpersist() so they > don't kill your job. This will match the effective behavior when executors > are lost from dynamic allocation in other parts of the code. > In the worst case scenario I think this could lead to RDD partitions getting > left on executors after they were unpersisted, but this is probably better > than the whole job failing. I think in most cases the IOException would be > due to the executor dieing for some reason, which is effectively the same > result as unpersisting the rdd from that executor anyway. > I noticed this exception in a job that loads a 100GB dataset on a cluster > where we use dynamic allocation heavily. Here is the relevant stack trace > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Exception in thread "main" org.apache.spark.SparkException: Exception thrown > in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131) > at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806) > at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217) > at > com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62) > at > com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40) > at > com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially(SuiteKickoff.
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - T
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen > Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, > Spark executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - T
[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23682: Attachment: Screen Shot 2018-03-28 at 16.44.20.png > Memory issue with Spark structured streaming > > > Key: SPARK-23682 > URL: https://issues.apache.org/jira/browse/SPARK-23682 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 > Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3 > |spark.blacklist.decommissioning.enabled|true| > |spark.blacklist.decommissioning.timeout|1h| > |spark.cleaner.periodicGC.interval|10min| > |spark.default.parallelism|18| > |spark.dynamicAllocation.enabled|false| > |spark.eventLog.enabled|true| > |spark.executor.cores|3| > |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'| > |spark.executor.id|driver| > |spark.executor.instances|3| > |spark.executor.memory|22G| > |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2| > |spark.hadoop.parquet.enable.summary-metadata|false| > |spark.hadoop.yarn.timeline-service.enabled|false| > |spark.jars| | > |spark.master|yarn| > |spark.memory.fraction|0.9| > |spark.memory.storageFraction|0.3| > |spark.memory.useLegacyMode|false| > |spark.rdd.compress|true| > |spark.resourceManager.cleanupExpiredHost|true| > |spark.scheduler.mode|FIFO| > |spark.serializer|org.apache.spark.serializer.KryoSerializer| > |spark.shuffle.service.enabled|true| > |spark.speculation|false| > |spark.sql.parquet.filterPushdown|true| > |spark.sql.parquet.mergeSchema|false| > |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse| > |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true| > |spark.submit.deployMode|client| > |spark.yarn.am.cores|1| > |spark.yarn.am.memory|2G| > |spark.yarn.am.memoryOverhead|1G| > |spark.yarn.executor.memoryOverhead|3G| >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Memory, memory, memory-leak > Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot > 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Spark > executors GC time.png, image-2018-03-22-14-46-31-960.png > > > It seems like there is an issue with memory in structured streaming. A stream > with aggregation (dropDuplicates()) and data partitioning constantly > increases memory usage and finally executors fails with exit code 137: > {quote}ExecutorLostFailure (executor 2 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1520214726510_0001_01_03 on host: > ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: > Container killed on request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal{quote} > Stream creating looks something like this: > {quote}session > .readStream() > .schema(inputSchema) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv("s3://test-bucket/input") > .as(Encoders.bean(TestRecord.class)) > .flatMap(mf, Encoders.bean(TestRecord.class)) > .dropDuplicates("testId", "testName") > .withColumn("year", > functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), > "")) > .writeStream() > .option("path", "s3://test-bucket/output") > .option("checkpointLocation", "s3://test-bucket/checkpoint") > .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS)) > .partitionBy("year") > .format("parquet") > .outputMode(OutputMode.Append()) > .queryName("test-stream") > .start();{quote} > Analyzing the heap dump I found that most of the memory used by > {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}} > that is referenced from > [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196] > > On the first glance it looks normal since that is how Spark keeps aggregation > keys in memory. However I did my testing by renaming files in source folder, > so that they could be picked up by spark again. Since input records are the > same all further rows should be rejected as duplicates and memory consumption > shouldn't increase but it's not true. Moreover, GC time took more than 30% of > total processing time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Commented] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417429#comment-16417429 ] Apache Spark commented on SPARK-23040: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/20920 > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.4.0 > > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x
[ https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417293#comment-16417293 ] Darek commented on SPARK-19076: --- It's done, passed all the tests, just needs to be merged in, not sure who can merge it, I have been asking for a while now, but no one is willing to step in and merge it. If you know know anyone who can merge it, it would be great help. Thanks > Upgrade Hive dependence to Hive 2.x > --- > > Key: SPARK-19076 > URL: https://issues.apache.org/jira/browse/SPARK-19076 > Project: Spark > Issue Type: Improvement >Reporter: Dapeng Sun >Priority: Major > > Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive > 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0 also released > for a long time, at Spark side, it is better to support Hive 2.0 and above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x
[ https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417291#comment-16417291 ] Maciej Bryński commented on SPARK-19076: So it's not resolved but in progress > Upgrade Hive dependence to Hive 2.x > --- > > Key: SPARK-19076 > URL: https://issues.apache.org/jira/browse/SPARK-19076 > Project: Spark > Issue Type: Improvement >Reporter: Dapeng Sun >Priority: Major > > Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive > 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0 also released > for a long time, at Spark side, it is better to support Hive 2.0 and above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x
[ https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417284#comment-16417284 ] Darek commented on SPARK-19076: --- [PR 20659|https://github.com/apache/spark/pull/20659] for this issue already exists, just needs to be merged into master. > Upgrade Hive dependence to Hive 2.x > --- > > Key: SPARK-19076 > URL: https://issues.apache.org/jira/browse/SPARK-19076 > Project: Spark > Issue Type: Improvement >Reporter: Dapeng Sun >Priority: Major > > Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive > 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0 also released > for a long time, at Spark side, it is better to support Hive 2.0 and above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23765) Supports line separator for json datasource
[ https://issues.apache.org/jira/browse/SPARK-23765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23765. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20877 [https://github.com/apache/spark/pull/20877] > Supports line separator for json datasource > --- > > Key: SPARK-23765 > URL: https://issues.apache.org/jira/browse/SPARK-23765 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Same as its parent but this JIRA is specific to JSON datasource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23765) Supports line separator for json datasource
[ https://issues.apache.org/jira/browse/SPARK-23765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23765: --- Assignee: Hyukjin Kwon > Supports line separator for json datasource > --- > > Key: SPARK-23765 > URL: https://issues.apache.org/jira/browse/SPARK-23765 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Same as its parent but this JIRA is specific to JSON datasource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23565) Improved error message for when the number of sources for a query changes
[ https://issues.apache.org/jira/browse/SPARK-23565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417206#comment-16417206 ] Patrick McGloin commented on SPARK-23565: - I would like to work on this. > Improved error message for when the number of sources for a query changes > - > > Key: SPARK-23565 > URL: https://issues.apache.org/jira/browse/SPARK-23565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Patrick McGloin >Priority: Minor > > If you change the number of sources for a Structured Streaming query then you > will get an assertion error as the number of sources in the checkpoint does > not match the number of sources in the query that is starting. This can > happen if, for example, you add a union to the input of the query. This is > of course correct but the error is a bit cryptic and requires investigation. > Suggestion for a more informative error message => > The number of sources for this query has changed. There are [x] sources in > the checkpoint offsets and now there are [y] sources requested by the query. > Cannot continue. > This is the current message. > 02-03-2018 13:14:22 ERROR StreamExecution:91 - Query ORPositionsState to > Kafka [id = 35f71e63-dbd0-49e9-98b2-a4c72a7da80e, runId = > d4439aca-549c-4ef6-872e-29fbfde1df78] terminated with error > java.lang.AssertionError: assertion failed at > scala.Predef$.assert(Predef.scala:156) at > org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:38) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$populateStartOffsets(StreamExecution.scala:429) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:297) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417149#comment-16417149 ] Marco Gaido commented on SPARK-23782: - [~vanzin] sorry but I cannot see any usability issue with having different users listing different things. I think this is normal in every system: one sees only what he can. This is normal in every system where users are present, otherwise it is useless to have users, if they can all see and do the same things. I commented on the PR since some people commented there too, with some use cases I consider critical for this. Thanks. > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23805) support vector-size validation and Inference
[ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23805: Assignee: Apache Spark > support vector-size validation and Inference > > > Key: SPARK-23805 > URL: https://issues.apache.org/jira/browse/SPARK-23805 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > > I think it maybe miningful to unify the usage of \{{AttributeGroup}} and > support vector-size validation and inference in algs. > My thoughts are: > * In \{{transformSchema}}, validate the input vector-size if possible. If > the input vector-size can be obtained from schema, check it. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > require the vector-size to be no more than 4. > ** Suppose a \{{PCAModel}} trained with vectors of length 10, the > \{{transformSchema}} will require the vector-size to be 10. > * In \{{transformSchema}}, inference the output vector-size if possible. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > * In \{{transform}}, inference the output vector-size if possible. > * In \{{fit}}, obtain the input vector-size from schema if possible. This > can help eliminating redundant \{{first}} jobs. > > Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my > idea. Since the validation and inference is quite alg-speciafic, we may need > to sperate the task into several small subtasks. > How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23805) support vector-size validation and Inference
[ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23805: Assignee: (was: Apache Spark) > support vector-size validation and Inference > > > Key: SPARK-23805 > URL: https://issues.apache.org/jira/browse/SPARK-23805 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Major > > I think it maybe miningful to unify the usage of \{{AttributeGroup}} and > support vector-size validation and inference in algs. > My thoughts are: > * In \{{transformSchema}}, validate the input vector-size if possible. If > the input vector-size can be obtained from schema, check it. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > require the vector-size to be no more than 4. > ** Suppose a \{{PCAModel}} trained with vectors of length 10, the > \{{transformSchema}} will require the vector-size to be 10. > * In \{{transformSchema}}, inference the output vector-size if possible. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > * In \{{transform}}, inference the output vector-size if possible. > * In \{{fit}}, obtain the input vector-size from schema if possible. This > can help eliminating redundant \{{first}} jobs. > > Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my > idea. Since the validation and inference is quite alg-speciafic, we may need > to sperate the task into several small subtasks. > How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23805) support vector-size validation and Inference
[ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417088#comment-16417088 ] Apache Spark commented on SPARK-23805: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/20918 > support vector-size validation and Inference > > > Key: SPARK-23805 > URL: https://issues.apache.org/jira/browse/SPARK-23805 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Major > > I think it maybe miningful to unify the usage of \{{AttributeGroup}} and > support vector-size validation and inference in algs. > My thoughts are: > * In \{{transformSchema}}, validate the input vector-size if possible. If > the input vector-size can be obtained from schema, check it. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > require the vector-size to be no more than 4. > ** Suppose a \{{PCAModel}} trained with vectors of length 10, the > \{{transformSchema}} will require the vector-size to be 10. > * In \{{transformSchema}}, inference the output vector-size if possible. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > * In \{{transform}}, inference the output vector-size if possible. > * In \{{fit}}, obtain the input vector-size from schema if possible. This > can help eliminating redundant \{{first}} jobs. > > Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my > idea. Since the validation and inference is quite alg-speciafic, we may need > to sperate the task into several small subtasks. > How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23805) support vector-size validation and Inference
zhengruifeng created SPARK-23805: Summary: support vector-size validation and Inference Key: SPARK-23805 URL: https://issues.apache.org/jira/browse/SPARK-23805 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0 Reporter: zhengruifeng I think it maybe miningful to unify the usage of \{{AttributeGroup}} and support vector-size validation and inference in algs. My thoughts are: * In \{{transformSchema}}, validate the input vector-size if possible. If the input vector-size can be obtained from schema, check it. ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will require the vector-size to be no more than 4. ** Suppose a \{{PCAModel}} trained with vectors of length 10, the \{{transformSchema}} will require the vector-size to be 10. * In \{{transformSchema}}, inference the output vector-size if possible. ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will return a schema with output vector-size=4. ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will return a schema with output vector-size=4. * In \{{transform}}, inference the output vector-size if possible. * In \{{fit}}, obtain the input vector-size from schema if possible. This can help eliminating redundant \{{first}} jobs. Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my idea. Since the validation and inference is quite alg-speciafic, we may need to sperate the task into several small subtasks. How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23705: Assignee: (was: Apache Spark) > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23705: Assignee: Apache Spark > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Assignee: Apache Spark >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417064#comment-16417064 ] Apache Spark commented on SPARK-23705: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/20917 > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org