[jira] [Assigned] (SPARK-20264) asm should be non-test dependency in sql/core
[ https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20264: Assignee: Reynold Xin (was: Apache Spark) > asm should be non-test dependency in sql/core > - > > Key: SPARK-20264 > URL: https://issues.apache.org/jira/browse/SPARK-20264 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > sq/core module currently declares asm as a test scope dependency. > Transitively it should actually be a normal dependency since the actual core > module defines it. This occasionally confuses IntelliJ. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20264) asm should be non-test dependency in sql/core
[ https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20264: Assignee: Apache Spark (was: Reynold Xin) > asm should be non-test dependency in sql/core > - > > Key: SPARK-20264 > URL: https://issues.apache.org/jira/browse/SPARK-20264 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > sq/core module currently declares asm as a test scope dependency. > Transitively it should actually be a normal dependency since the actual core > module defines it. This occasionally confuses IntelliJ. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20264) asm should be non-test dependency in sql/core
[ https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961700#comment-15961700 ] Apache Spark commented on SPARK-20264: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17574 > asm should be non-test dependency in sql/core > - > > Key: SPARK-20264 > URL: https://issues.apache.org/jira/browse/SPARK-20264 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > sq/core module currently declares asm as a test scope dependency. > Transitively it should actually be a normal dependency since the actual core > module defines it. This occasionally confuses IntelliJ. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20264) asm should be non-test dependency in sql/core
Reynold Xin created SPARK-20264: --- Summary: asm should be non-test dependency in sql/core Key: SPARK-20264 URL: https://issues.apache.org/jira/browse/SPARK-20264 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961687#comment-15961687 ] Wenchen Fan commented on SPARK-18055: - I think this is a different issue, can you open a new ticket for it? thanks! > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets
[ https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961686#comment-15961686 ] Wenchen Fan commented on SPARK-19352: - I don't think Spark will provide API support for this feature(Does hive really have?), but the implementation is quite stable now, so you can follow the example in this ticket to write out sorted data. > Sorting issues on relatively big datasets > - > > Key: SPARK-19352 > URL: https://issues.apache.org/jira/browse/SPARK-19352 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Spark version 2.1.0 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 > macOS 10.12.3 >Reporter: Ivan Gozali > > _More details, including the script to generate the synthetic dataset > (requires pandas and numpy) are in this GitHub gist._ > https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f > Given a relatively large synthetic time series dataset of various users > (4.1GB), when attempting to: > * partition this dataset by user ID > * sort the time series data for each user by timestamp > * write each partition to a single CSV file > then some files are unsorted in a very specific manner. In one of the > supposedly sorted files, the rows looked as follows: > {code} > 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39 > 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22 > 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47 > 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14 > 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24 > 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43 > {code} > The above is attempted using the following Scala/Spark code: > {code} > val inpth = "/tmp/gen_data_3cols_small" > spark > .read > .option("inferSchema", "true") > .option("header", "true") > .csv(inpth) > .repartition($"userId") > .sortWithinPartitions("timestamp") > .write > .partitionBy("userId") > .option("header", "true") > .csv(inpth + "_sorted") > {code} > This issue is not seen when using a smaller sized dataset by making the time > span smaller (354MB, with the same number of columns). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20262) AssertNotNull should throw NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-20262. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 > AssertNotNull should throw NullPointerException > --- > > Key: SPARK-20262 > URL: https://issues.apache.org/jira/browse/SPARK-20262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.2, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961677#comment-15961677 ] Hyukjin Kwon commented on SPARK-20259: -- Could you describe the current status and why it should be like that? > Support push down join optimizations in DataFrameReader when loading from JDBC > -- > > Key: SPARK-20259 > URL: https://issues.apache.org/jira/browse/SPARK-20259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: John Muller >Priority: Minor > > Given two dataframes loaded from the same JDBC connection: > {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} > val ordersDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.orders") > .option("user", "username") > .option("password", "password") > .load().toDS > > val productDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.product") > .option("user", "username") > .option("password", "password") > .load().toDS > > ordersDF.createOrReplaceTempView("orders") > productDF.createOrReplaceTempView("product") > // Followed by a join between them: > val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o > INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") > {code} > Catalyst should optimize the query to be: > SELECT northwind.product.name, SUM(northwind.orders.qty) > FROM northwind.orders > INNER JOIN northwind.product ON > northwind.orders.product_id = northwind.product.product_id > GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table
[ https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961670#comment-15961670 ] Xiaochen Ouyang commented on SPARK-19935: - Not yet! I tried to work on this issue, but now only support to create table and scan table. Throwing an exeception when run insert command because Spark2.x have no HBaseFileFormat for Writing HBase. code as follow : @transient private lazy val outputFormat = jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] > SparkSQL unsupports to create a hive table which is mapped for HBase table > -- > > Key: SPARK-19935 > URL: https://issues.apache.org/jira/browse/SPARK-19935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.2 >Reporter: Xiaochen Ouyang > > SparkSQL unsupports the command as following: > CREATE TABLE spark_test(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "xyz"); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table
[ https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961670#comment-15961670 ] Xiaochen Ouyang edited comment on SPARK-19935 at 4/8/17 4:04 AM: - [~wangchao2017] Not yet! I tried to work on this issue, but now only support to create table and scan table. Throwing an exeception when run insert command because Spark2.x have no HBaseFileFormat for Writing HBase. code as follow : @transient private lazy val outputFormat = jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] was (Author: ouyangxc.zte): Not yet! I tried to work on this issue, but now only support to create table and scan table. Throwing an exeception when run insert command because Spark2.x have no HBaseFileFormat for Writing HBase. code as follow : @transient private lazy val outputFormat = jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] > SparkSQL unsupports to create a hive table which is mapped for HBase table > -- > > Key: SPARK-19935 > URL: https://issues.apache.org/jira/browse/SPARK-19935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.2 >Reporter: Xiaochen Ouyang > > SparkSQL unsupports the command as following: > CREATE TABLE spark_test(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "xyz"); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20246) Should check determinism when pushing predicates down through aggregation
[ https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20246. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.2.0 2.1.2 2.0.3 > Should check determinism when pushing predicates down through aggregation > - > > Key: SPARK-20246 > URL: https://issues.apache.org/jira/browse/SPARK-20246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Weiluo Ren >Assignee: Wenchen Fan > Labels: correctness > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > {code}import org.apache.spark.sql.functions._ > spark.range(1,1000).distinct.withColumn("random", > rand()).filter(col("random") > 0.3).orderBy("random").show{code} > gives wrong result. > In the optimized logical plan, it shows that the filter with the > non-deterministic predicate is pushed beneath the aggregate operator, which > should not happen. > cc [~lian cheng] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20263) create empty dataframes in sparkR
[ https://issues.apache.org/jira/browse/SPARK-20263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ott Toomet updated SPARK-20263: --- Priority: Minor (was: Trivial) Description: SparkR 2.1 does not support creating empty dataframes, nor conversion of empty R dataframes to spark ones: createDataFrame(data.frame(a=integer())) gives Error in takeRDD(x, 1)[[1]] : subscript out of bounds was: Spark 2.1 does not support creating empty dataframes, nor conversion of empty R dataframes to spark ones: createDataFrame(data.frame(a=integer())) gives Error in takeRDD(x, 1)[[1]] : subscript out of bounds > create empty dataframes in sparkR > - > > Key: SPARK-20263 > URL: https://issues.apache.org/jira/browse/SPARK-20263 > Project: Spark > Issue Type: Wish > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Ott Toomet >Priority: Minor > > SparkR 2.1 does not support creating empty dataframes, nor conversion of > empty R dataframes to spark ones: > createDataFrame(data.frame(a=integer())) > gives > Error in takeRDD(x, 1)[[1]] : subscript out of bounds -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20263) create empty dataframes in sparkR
Ott Toomet created SPARK-20263: -- Summary: create empty dataframes in sparkR Key: SPARK-20263 URL: https://issues.apache.org/jira/browse/SPARK-20263 Project: Spark Issue Type: Wish Components: SparkR Affects Versions: 2.1.0 Reporter: Ott Toomet Priority: Trivial Spark 2.1 does not support creating empty dataframes, nor conversion of empty R dataframes to spark ones: createDataFrame(data.frame(a=integer())) gives Error in takeRDD(x, 1)[[1]] : subscript out of bounds -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20262) AssertNotNull should throw NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20262: Assignee: Reynold Xin (was: Apache Spark) > AssertNotNull should throw NullPointerException > --- > > Key: SPARK-20262 > URL: https://issues.apache.org/jira/browse/SPARK-20262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20262) AssertNotNull should throw NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20262: Assignee: Apache Spark (was: Reynold Xin) > AssertNotNull should throw NullPointerException > --- > > Key: SPARK-20262 > URL: https://issues.apache.org/jira/browse/SPARK-20262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20262) AssertNotNull should throw NullPointerException
Reynold Xin created SPARK-20262: --- Summary: AssertNotNull should throw NullPointerException Key: SPARK-20262 URL: https://issues.apache.org/jira/browse/SPARK-20262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20262) AssertNotNull should throw NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961584#comment-15961584 ] Apache Spark commented on SPARK-20262: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17573 > AssertNotNull should throw NullPointerException > --- > > Key: SPARK-20262 > URL: https://issues.apache.org/jira/browse/SPARK-20262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20261) EventLoggingListener may not truly flush the logger when a compression codec is used
Brian Cho created SPARK-20261: - Summary: EventLoggingListener may not truly flush the logger when a compression codec is used Key: SPARK-20261 URL: https://issues.apache.org/jira/browse/SPARK-20261 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Brian Cho Priority: Minor Log events with flushLogger set to true are supposed to immediately flush to update the event history. However, this does not happen when using some compression codecs e.g., LZ4BlockOutputStream, because the compressed stream can hold on to the update until the compression block is filled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
[ https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20260: Assignee: (was: Apache Spark) > MLUtils parseLibSVMRecord has incorrect string interpolation for error message > -- > > Key: SPARK-20260 > URL: https://issues.apache.org/jira/browse/SPARK-20260 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Vijay Krishna Ramesh >Priority: Minor > > There is missing string interpolation for the error message, which causes it > to not actually display the line that failed. See > https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
[ https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20260: Assignee: Apache Spark > MLUtils parseLibSVMRecord has incorrect string interpolation for error message > -- > > Key: SPARK-20260 > URL: https://issues.apache.org/jira/browse/SPARK-20260 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Vijay Krishna Ramesh >Assignee: Apache Spark >Priority: Minor > > There is missing string interpolation for the error message, which causes it > to not actually display the line that failed. See > https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
[ https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961491#comment-15961491 ] Apache Spark commented on SPARK-20260: -- User 'vijaykramesh' has created a pull request for this issue: https://github.com/apache/spark/pull/17572 > MLUtils parseLibSVMRecord has incorrect string interpolation for error message > -- > > Key: SPARK-20260 > URL: https://issues.apache.org/jira/browse/SPARK-20260 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Vijay Krishna Ramesh >Priority: Minor > > There is missing string interpolation for the error message, which causes it > to not actually display the line that failed. See > https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
Vijay Krishna Ramesh created SPARK-20260: Summary: MLUtils parseLibSVMRecord has incorrect string interpolation for error message Key: SPARK-20260 URL: https://issues.apache.org/jira/browse/SPARK-20260 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Vijay Krishna Ramesh Priority: Minor There is missing string interpolation for the error message, which causes it to not actually display the line that failed. See https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20255) FileIndex hierarchy inconsistency
[ https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20255. - Resolution: Fixed Assignee: Adrian Ionescu Fix Version/s: 2.2.0 > FileIndex hierarchy inconsistency > - > > Key: SPARK-20255 > URL: https://issues.apache.org/jira/browse/SPARK-20255 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Assignee: Adrian Ionescu >Priority: Minor > Fix For: 2.2.0 > > > Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the > following inconsistency: > On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and > {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements > {{listLeafFiles}} which does all the listing of files. However, the latter is > only used by {{InMemoryFileIndex}}. > I'm hereby proposing to move this method (and all its dependencies) to the > implementation class that actually uses it, and thus unclutter the > {{PartitioningAwareFileIndex}} interface. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20258: - Fix Version/s: 2.2.0 > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.2.0 > > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20258. -- Resolution: Fixed Assignee: Wayne Zhang > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs
[ https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961390#comment-15961390 ] Charles Pritchard commented on SPARK-18934: --- Possibly fixed in: https://issues.apache.org/jira/browse/SPARK-19563 appears to be out of scope in Spark per comment in https://issues.apache.org/jira/browse/SPARK-19352 > Writing to dynamic partitions does not preserve sort order if spill occurs > -- > > Key: SPARK-18934 > URL: https://issues.apache.org/jira/browse/SPARK-18934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Junegunn Choi > > When writing to dynamic partitions, the task sorts the input data by the > partition key (also with bucket key if used), so that it can write to one > partition at a time using a single writer. And if spill occurs during the > process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of > data. > However, the merge process only considers the partition key, so that the sort > order within a partition specified via {{sortWithinPartitions}} or {{SORT > BY}} is not preserved. > We can reproduce the problem on Spark shell. Make sure to start shell in > local mode with small driver memory (e.g. 1G) so that spills occur. > {code} > // FileFormatWriter > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").format("orc").partitionBy("part") > .saveAsTable("test_sort_within") > spark.read.table("test_sort_within").show > {code} > {noformat} > +---++ > | value|part| > +---++ > | 2| 0| > |8388610| 0| > | 4| 0| > |8388612| 0| > | 6| 0| > |8388614| 0| > | 8| 0| > |8388616| 0| > | 10| 0| > |8388618| 0| > | 12| 0| > |8388620| 0| > | 14| 0| > |8388622| 0| > | 16| 0| > |8388624| 0| > | 18| 0| > |8388626| 0| > | 20| 0| > |8388628| 0| > +---++ > {noformat} > We can confirm that the issue using orc dump. > {noformat} > > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d > > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head > > -20 > {"value":2} > {"value":8388610} > {"value":4} > {"value":8388612} > {"value":6} > {"value":8388614} > {"value":8} > {"value":8388616} > {"value":10} > {"value":8388618} > {"value":12} > {"value":8388620} > {"value":14} > {"value":8388622} > {"value":16} > {"value":8388624} > {"value":18} > {"value":8388626} > {"value":20} > {"value":8388628} > {noformat} > {{SparkHiveDynamicPartitionWriterContainer}} has the same problem. > {code} > // Insert into an existing Hive table with dynamic partitions > // CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) > STORED AS ORC > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").insertInto("test_sort_within") > spark.read.table("test_sort_within").show > {code} > I was able to fix the problem by appending a numeric index column to the > sorting key which effectively makes the sort stable. I'll create a pull > request on GitHub but since I'm not really familiar with the internals of > Spark, I'm not sure if my approach is valid or idiomatic. So please let me > know if there are better ways to handle this, or if you want to address the > issue differently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets
[ https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961389#comment-15961389 ] Charles Pritchard commented on SPARK-19352: --- [~cloud_fan] Is there something on the roadmap to get that guarantee? We need guaranteed sorting from a general performance perspective, but it's also a baseline feature of Hive (AKA: "SORT BY") to be able to sort data into a file in a partition. > Sorting issues on relatively big datasets > - > > Key: SPARK-19352 > URL: https://issues.apache.org/jira/browse/SPARK-19352 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Spark version 2.1.0 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 > macOS 10.12.3 >Reporter: Ivan Gozali > > _More details, including the script to generate the synthetic dataset > (requires pandas and numpy) are in this GitHub gist._ > https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f > Given a relatively large synthetic time series dataset of various users > (4.1GB), when attempting to: > * partition this dataset by user ID > * sort the time series data for each user by timestamp > * write each partition to a single CSV file > then some files are unsorted in a very specific manner. In one of the > supposedly sorted files, the rows looked as follows: > {code} > 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39 > 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22 > 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47 > 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14 > 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24 > 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43 > {code} > The above is attempted using the following Scala/Spark code: > {code} > val inpth = "/tmp/gen_data_3cols_small" > spark > .read > .option("inferSchema", "true") > .option("header", "true") > .csv(inpth) > .repartition($"userId") > .sortWithinPartitions("timestamp") > .write > .partitionBy("userId") > .option("header", "true") > .csv(inpth + "_sorted") > {code} > This issue is not seen when using a smaller sized dataset by making the time > span smaller (354MB, with the same number of columns). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC
John Muller created SPARK-20259: --- Summary: Support push down join optimizations in DataFrameReader when loading from JDBC Key: SPARK-20259 URL: https://issues.apache.org/jira/browse/SPARK-20259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 1.6.2 Reporter: John Muller Priority: Minor Given two dataframes loaded from the same JDBC connection: {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} val ordersDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.orders") .option("user", "username") .option("password", "password") .load().toDS val productDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "northwind.product") .option("user", "username") .option("password", "password") .load().toDS ordersDF.createOrReplaceTempView("orders") productDF.createOrReplaceTempView("product") // Followed by a join between them: val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") {code} Catalyst should optimize the query to be: SELECT northwind.product.name, SUM(northwind.orders.qty) FROM northwind.orders INNER JOIN northwind.product ON northwind.orders.product_id = northwind.product.product_id GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20258: Assignee: Apache Spark > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Apache Spark > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961274#comment-15961274 ] Felix Cheung commented on SPARK-20258: -- Thanks! > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961273#comment-15961273 ] Apache Spark commented on SPARK-20258: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/17571 > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
[ https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20258: Assignee: (was: Apache Spark) > SparkR logistic regression example did not converge in programming guide > > > Key: SPARK-20258 > URL: https://issues.apache.org/jira/browse/SPARK-20258 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang > > SparkR logistic regression example did not converge in programming guide. All > estimates are essentially zero: > {code} > training2 <- read.df("data/mllib/sample_binary_classification_data.txt", > source = "libsvm") > df_list2 <- randomSplit(training2, c(7,3), 2) > binomialDF <- df_list2[[1]] > binomialTestDF <- df_list2[[2]] > binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") > 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to > singular covariance matrix. Retrying with Quasi-Newton solver. > > summary(binomialGLM) > Deviance Residuals: > (Note: These are approximate quantiles with relative error <= 0.01) > Min 1Q Median 3Q Max > -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 > Coefficients: > Estimate > (Intercept)9.0255e+00 > features_0 0.e+00 > features_1 0.e+00 > features_2 0.e+00 > features_3 0.e+00 > features_4 0.e+00 > features_5 0.e+00 > features_6 0.e+00 > features_7 0.e+00 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20258) SparkR logistic regression example did not converge in programming guide
Wayne Zhang created SPARK-20258: --- Summary: SparkR logistic regression example did not converge in programming guide Key: SPARK-20258 URL: https://issues.apache.org/jira/browse/SPARK-20258 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Wayne Zhang SparkR logistic regression example did not converge in programming guide. All estimates are essentially zero: {code} training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") df_list2 <- randomSplit(training2, c(7,3), 2) binomialDF <- df_list2[[1]] binomialTestDF <- df_list2[[2]] binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial") 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver. > summary(binomialGLM) Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -2.4828e-06 -2.4063e-06 2.2778e-06 2.4350e-06 2.7722e-06 Coefficients: Estimate (Intercept)9.0255e+00 features_0 0.e+00 features_1 0.e+00 features_2 0.e+00 features_3 0.e+00 features_4 0.e+00 features_5 0.e+00 features_6 0.e+00 features_7 0.e+00 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20257) Fix test for directory created to work when running as R CMD check
Felix Cheung created SPARK-20257: Summary: Fix test for directory created to work when running as R CMD check Key: SPARK-20257 URL: https://issues.apache.org/jira/browse/SPARK-20257 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20257) Fix test for directory created to work when running as R CMD check
[ https://issues.apache.org/jira/browse/SPARK-20257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961221#comment-15961221 ] Felix Cheung commented on SPARK-20257: -- Please see the PR https://github.com/apache/spark/pull/17516 for info And this is the test https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3058 "No extra files are created in SPARK_HOME by starting session and making calls" > Fix test for directory created to work when running as R CMD check > -- > > Key: SPARK-20257 > URL: https://issues.apache.org/jira/browse/SPARK-20257 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20197) CRAN check fail with package installation
[ https://issues.apache.org/jira/browse/SPARK-20197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20197: - Target Version/s: 2.1.1, 2.2.0 Fix Version/s: 2.2.0 2.1.1 > CRAN check fail with package installation > -- > > Key: SPARK-20197 > URL: https://issues.apache.org/jira/browse/SPARK-20197 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.1.1 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > Failed > - > 1. Failure: No extra files are created in SPARK_HOME by starting session > and making calls (@test_sparkSQL.R#2858) > length(sparkRFilesBefore) > 0 isn't true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961193#comment-15961193 ] Xin Wu commented on SPARK-20256: I am working on a fix and creating simulated test cases for this issue. > Fail to start SparkContext/SparkSession with Hive support enabled when user > does not have read/write privilege to Hive metastore warehouse dir > -- > > Key: SPARK-20256 > URL: https://issues.apache.org/jira/browse/SPARK-20256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Xin Wu >Priority: Critical > > In a cluster setup with production Hive running, when the user wants to run > spark-shell using the production Hive metastore, hive-site.xml is copied to > SPARK_HOME/conf. So when spark-shell is being started, it tries to check > database existence of "default" database from Hive metastore. Yet, since this > user may not have READ/WRITE access to the configured Hive warehouse > directory done by Hive itself, such permission error will prevent spark-shell > or any spark application with Hive support enabled from starting at all. > Example error: > {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > java.lang.IllegalArgumentException: Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > ... 47 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.security.AccessControlException: Permission > denied: user=notebook, access=READ, > inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > ); > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at >
[jira] [Updated] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-20256: --- Description: In a cluster setup with production Hive running, when the user wants to run spark-shell using the production Hive metastore, hive-site.xml is copied to SPARK_HOME/conf. So when spark-shell is being started, it tries to check database existence of "default" database from Hive metastore. Yet, since this user may not have READ/WRITE access to the configured Hive warehouse directory done by Hive itself, such permission error will prevent spark-shell or any spark application with Hive support enabled from starting at all. Example error: {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) ... 47 elided Caused by: java.lang.reflect.InvocationTargetException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) ); at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978) ... 58 more Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at
[jira] [Resolved] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20026. -- Resolution: Fixed Assignee: Wayne Zhang Fix Version/s: 2.2.0 > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Wayne Zhang > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
Xin Wu created SPARK-20256: -- Summary: Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir Key: SPARK-20256 URL: https://issues.apache.org/jira/browse/SPARK-20256 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.1.1, 2.2.0 Reporter: Xin Wu Priority: Critical In a cluster setup with production Hive running, when the user wants to run spark-shell using the production Hive metastore, hive-site.xml is copied to SPARK_HOME/conf. So when spark-shell is being started, it tries to check database existence of "default" database from Hive metastore. Yet, since this user may not have READ/WRITE access to the configured Hive warehouse directory done by Hive itself, such permission error will prevent spark-shell or any spark application with Hive support enabled from starting at all. Example error: {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) ... 47 elided Caused by: java.lang.reflect.InvocationTargetException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) ); at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978) ... 58 more Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961180#comment-15961180 ] Paul Zaczkieiwcz commented on SPARK-18055: -- [~marmbrus]: I ran into this issue when using a custom org.apache.spark.sql.expressions.Aggregator in Spark 2.0.2. {code:java} val aggregator:Aggregator = df.groupByKey(s => CookieId(s.cookie_id) ).agg(aggregator.toColumn) {code} I got a very similar {{scala.ScalaReflectionException}}, which is how I found this ticket. Is there an easy way around this short of either converting my brand-new {{Aggregator}} into a {{UserDefinedAggregateFunction}} or custom installing a patched version of Spark onto my cluster? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20255) FileIndex hierarchy inconsistency
[ https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20255: Assignee: Apache Spark > FileIndex hierarchy inconsistency > - > > Key: SPARK-20255 > URL: https://issues.apache.org/jira/browse/SPARK-20255 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Assignee: Apache Spark >Priority: Minor > > Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the > following inconsistency: > On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and > {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements > {{listLeafFiles}} which does all the listing of files. However, the latter is > only used by {{InMemoryFileIndex}}. > I'm hereby proposing to move this method (and all its dependencies) to the > implementation class that actually uses it, and thus unclutter the > {{PartitioningAwareFileIndex}} interface. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20255) FileIndex hierarchy inconsistency
[ https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961173#comment-15961173 ] Apache Spark commented on SPARK-20255: -- User 'adrian-ionescu' has created a pull request for this issue: https://github.com/apache/spark/pull/17570 > FileIndex hierarchy inconsistency > - > > Key: SPARK-20255 > URL: https://issues.apache.org/jira/browse/SPARK-20255 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Priority: Minor > > Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the > following inconsistency: > On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and > {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements > {{listLeafFiles}} which does all the listing of files. However, the latter is > only used by {{InMemoryFileIndex}}. > I'm hereby proposing to move this method (and all its dependencies) to the > implementation class that actually uses it, and thus unclutter the > {{PartitioningAwareFileIndex}} interface. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20255) FileIndex hierarchy inconsistency
[ https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20255: Assignee: (was: Apache Spark) > FileIndex hierarchy inconsistency > - > > Key: SPARK-20255 > URL: https://issues.apache.org/jira/browse/SPARK-20255 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Priority: Minor > > Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the > following inconsistency: > On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and > {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements > {{listLeafFiles}} which does all the listing of files. However, the latter is > only used by {{InMemoryFileIndex}}. > I'm hereby proposing to move this method (and all its dependencies) to the > implementation class that actually uses it, and thus unclutter the > {{PartitioningAwareFileIndex}} interface. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
[ https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961171#comment-15961171 ] Apache Spark commented on SPARK-20253: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17569 > Remove unnecessary nullchecks of a return value from Spark runtime routines > in generated Java code > -- > > Key: SPARK-20253 > URL: https://issues.apache.org/jira/browse/SPARK-20253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > While we know several Spark runtime routines never return null (e.g. > {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always > checks whether the return value is null or not. > It is good to remove this nullcheck for reducing Java bytecode size and > reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
[ https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20253: Assignee: (was: Apache Spark) > Remove unnecessary nullchecks of a return value from Spark runtime routines > in generated Java code > -- > > Key: SPARK-20253 > URL: https://issues.apache.org/jira/browse/SPARK-20253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > While we know several Spark runtime routines never return null (e.g. > {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always > checks whether the return value is null or not. > It is good to remove this nullcheck for reducing Java bytecode size and > reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
[ https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20253: Assignee: Apache Spark > Remove unnecessary nullchecks of a return value from Spark runtime routines > in generated Java code > -- > > Key: SPARK-20253 > URL: https://issues.apache.org/jira/browse/SPARK-20253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > While we know several Spark runtime routines never return null (e.g. > {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always > checks whether the return value is null or not. > It is good to remove this nullcheck for reducing Java bytecode size and > reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961167#comment-15961167 ] Apache Spark commented on SPARK-20254: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17569 > SPARK-19716 generates unnecessary data conversion for Dataset with primitive > array > -- > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}} > in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to > store an array into {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > Plans before SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > Plans after SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961170#comment-15961170 ] Andrew Ash commented on SPARK-20144: This is a regression from 1.6 to the 2.x line. [~marmbrus] recommended modifying {{spark.sql.files.openCostInBytes}} as a workaround in this post: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-within-partitions-is-not-maintained-in-parquet-td18618.html#a18627 > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20255) FileIndex hierarchy inconsistency
Adrian Ionescu created SPARK-20255: -- Summary: FileIndex hierarchy inconsistency Key: SPARK-20255 URL: https://issues.apache.org/jira/browse/SPARK-20255 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Adrian Ionescu Priority: Minor Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the following inconsistency: On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements {{listLeafFiles}} which does all the listing of files. However, the latter is only used by {{InMemoryFileIndex}}. I'm hereby proposing to move this method (and all its dependencies) to the implementation class that actually uses it, and thus unclutter the {{PartitioningAwareFileIndex}} interface. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20254: Assignee: (was: Apache Spark) > SPARK-19716 generates unnecessary data conversion for Dataset with primitive > array > -- > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}} > in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to > store an array into {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > Plans before SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > Plans after SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class >
[jira] [Assigned] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20254: Assignee: Apache Spark > SPARK-19716 generates unnecessary data conversion for Dataset with primitive > array > -- > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}} > in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to > store an array into {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > Plans before SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > Plans after SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class
[jira] [Commented] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961126#comment-15961126 ] Apache Spark commented on SPARK-20254: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17568 > SPARK-19716 generates unnecessary data conversion for Dataset with primitive > array > -- > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}} > in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to > store an array into {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > Plans before SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > Plans after SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed
[jira] [Updated] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20254: - Summary: SPARK-19716 generates unnecessary data conversion for Dataset with primitive array (was: SPARK-19716 generates inefficient Java code from a primitive array of Dataset) > SPARK-19716 generates unnecessary data conversion for Dataset with primitive > array > -- > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}} > in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to > store an array into {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > Plans before SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > Plans after SPARK-19716 > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2]
[jira] [Issue Comment Deleted] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20243: -- Comment: was deleted (was: This needs detail to be a JIRA.) > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > > Introduced by SPARK-19946. > DebugFilesystem.assertNoOpenStreams gets the size of the openStreams > ConcurrentHashMap and then later, if the size was > 0, accesses the first > element in openStreams.values. But, the ConcurrentHashMap might be cleared by > another thread between getting its size and accessing it, resulting in an > exception when trying to call .head on an empty collection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19991) FileSegmentManagedBuffer performance improvement.
[ https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961110#comment-15961110 ] Apache Spark commented on SPARK-19991: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/17567 > FileSegmentManagedBuffer performance improvement. > - > > Key: SPARK-19991 > URL: https://issues.apache.org/jira/browse/SPARK-19991 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.0.2, 2.1.0 >Reporter: Guoqiang Li >Priority: Minor > > When we do not set the value of the configuration items > {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, > each call to the cFileSegmentManagedBuffer.nioByteBuffer or > FileSegmentManagedBuffer.createInputStream method creates a > NoSuchElementException instance. This is a more time-consuming operation. > The shuffle-server thread`s stack: > {noformat} > "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 > nid=0x28d12 runnable [0x7f71af93e000] >java.lang.Thread.State: RUNNABLE > at java.lang.Throwable.fillInStackTrace(Native Method) > at java.lang.Throwable.fillInStackTrace(Throwable.java:783) > - locked <0x0007a930f080> (a java.util.NoSuchElementException) > at java.lang.Throwable.(Throwable.java:265) > at java.lang.Exception.(Exception.java:66) > at java.lang.RuntimeException.(RuntimeException.java:62) > at > java.util.NoSuchElementException.(NoSuchElementException.java:57) > at > org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38) > at > org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31) > at > org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50) > at > org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157) > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728) > at > org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835) > at > org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017) > at > org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256) > at > org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) > at > org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at >
[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD
[ https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961104#comment-15961104 ] Imran Rashid commented on SPARK-20219: -- I'm not saying I don't think this is a good proposal. I'm just trying to be realistic about this change vs. other things in the works, and to try to set priorities, to set expectations. I'll take another quick look at the PR and see if I can think of another way to do this. > Schedule tasks based on size of input from ScheduledRDD > --- > > Key: SPARK-20219 > URL: https://issues.apache.org/jira/browse/SPARK-20219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: screenshot-1.png > > > When data is highly skewed on ShuffledRDD, it make sense to launch those > tasks which process much more input as soon as possible. The current > scheduling mechanism in *TaskSetManager* is quite simple: > {code} > for (i <- (0 until numTasks).reverse) { > addPendingTask(i) > } > {code} > In scenario that "large tasks" locate at bottom half of tasks array, if tasks > with much more input are launched early, we can significantly reduce the time > cost and save resource when *"dynamic allocation"* is disabled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20254) SPARK-19716 generates inefficient Java code from a primitive array of Dataset
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20254: - Description: Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the current implementation generates {{mapobjects()}} at {{DeserializeToObject}} in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to store an array into {{GenericArrayData}}. cc: [~cloud_fan] {code} val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache ds.count val ds2 = ds.map(e => e) ds2.explain(true) ds2.show {code} Plans before SPARK-19716 {code} == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- 'MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Analyzed Logical Plan == value: array SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject cast(value#2 as array).toDoubleArray, obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Optimized Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject value#2.toDoubleArray, obj#23: [D +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] == Physical Plan == *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- *MapElements , obj#24: [D +- *DeserializeToObject value#2.toDoubleArray, obj#23: [D +- *InMemoryTableScan [value#2] +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] {code} Plans after SPARK-19716 {code} == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- 'MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- 'DeserializeToObject unresolveddeserializer(unresolvedmapobjects(, getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Analyzed Logical Plan == value: array SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject mapobjects(MapObjects_loopValue4, MapObjects_loopIsNull4, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue4, MapObjects_loopIsNull4, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue4).toDoubleArray, obj#23: [D +- SerializeFromObject [staticinvoke(class
[jira] [Updated] (SPARK-20254) SPARK-19716 generates inefficient Java code from a primitive array of Dataset
[ https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20254: - Summary: SPARK-19716 generates inefficient Java code from a primitive array of Dataset (was: SPARK-19716 generate inefficient Java code from a primitive array of Dataset) > SPARK-19716 generates inefficient Java code from a primitive array of Dataset > - > > Key: SPARK-20254 > URL: https://issues.apache.org/jira/browse/SPARK-20254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the > current implementation generates {{mapobjects()}} at {{DeserializeToObject}}. > This {{mapObject()}} introduces Java code to store an array into > {{GenericArrayData}}. > cc: [~cloud_fan] > > {code} > val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache > ds.count > val ds2 = ds.map(e => e) > ds2.explain(true) > ds2.show > {code} > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(upcast(getcolumnbyordinal(0, > ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: > "scala.Array").toDoubleArray), obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject cast(value#2 as array).toDoubleArray, > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Optimized Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] >+- Scan ExternalRDDScan[obj#1] > == Physical Plan == > *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- *MapElements , obj#24: [D >+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D > +- *InMemoryTableScan [value#2] > +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- Scan ExternalRDDScan[obj#1] > {code} > {code} > == Parsed Logical Plan == > 'SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#25] > +- 'MapElements , class [D, > [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D >+- 'DeserializeToObject > unresolveddeserializer(unresolvedmapobjects(, > getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), > obj#23: [D > +- SerializeFromObject [staticinvoke(class > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, > ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS > value#2] > +- ExternalRDD [obj#1] > == Analyzed Logical Plan == > value: array > SerializeFromObject
[jira] [Created] (SPARK-20254) SPARK-19716 generate inefficient Java code from a primitive array of Dataset
Kazuaki Ishizaki created SPARK-20254: Summary: SPARK-19716 generate inefficient Java code from a primitive array of Dataset Key: SPARK-20254 URL: https://issues.apache.org/jira/browse/SPARK-20254 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the current implementation generates {{mapobjects()}} at {{DeserializeToObject}}. This {{mapObject()}} introduces Java code to store an array into {{GenericArrayData}}. cc: [~cloud_fan] {code} val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache ds.count val ds2 = ds.map(e => e) ds2.explain(true) ds2.show {code} {code} == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- 'MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Analyzed Logical Plan == value: array SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject cast(value#2 as array).toDoubleArray, obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Optimized Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject value#2.toDoubleArray, obj#23: [D +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] == Physical Plan == *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- *MapElements , obj#24: [D +- *DeserializeToObject value#2.toDoubleArray, obj#23: [D +- *InMemoryTableScan [value#2] +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] {code} {code} == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- 'MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- 'DeserializeToObject unresolveddeserializer(unresolvedmapobjects(, getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Analyzed Logical Plan == value: array SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements , class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject mapobjects(MapObjects_loopValue4, MapObjects_loopIsNull4, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue4, MapObjects_loopIsNull4, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None,
[jira] [Assigned] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19518: Assignee: Apache Spark > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi >Assignee: Apache Spark > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19518: Assignee: (was: Apache Spark) > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960985#comment-15960985 ] Apache Spark commented on SPARK-19518: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17566 > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port
[ https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Derek Dagit closed SPARK-20181. --- Resolution: Invalid This is no longer an issue in master because the log level is already set such that the message with the stack trace does not appear. > Avoid noisy Jetty WARN log when failing to bind a port > -- > > Key: SPARK-20181 > URL: https://issues.apache.org/jira/browse/SPARK-20181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Derek Dagit >Priority: Minor > > As a user, I would like to suppress the Jetty WARN log about failing to bind > to a port already in use, so that my logs are less noisy. > Currently, Jetty code prints the stack trace of the BindException at WARN > level. In the context of starting a service on an ephemeral port, this is not > a useful warning, and it is exceedingly verbose. > {noformat} > 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED > ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:448) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:28 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 13 (so 12 successive aggregations). Please see the following chart: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. For the most part, the amount of time the job takes to complete from n = 13 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Except for the master, the cluster remains idle. Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). Please see the following chart: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. For the most part, the amount of time the job takes to complete from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Except for the master, the cluster remains idle. Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch
[jira] [Commented] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960965#comment-15960965 ] Sean Owen commented on SPARK-20227: --- See https://issues.apache.org/jira/browse/SPARK-20226 for something possibly similar. It'd be useful to get a thread dump to see where the time is being spent. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on > ip-172-30-0-149.ec2.internal killed by driver. > 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has > been removed (new total is 0) > {code} > All executors are inactive and thus killed after 60 seconds, the master > spends some CPU on a process that hangs indefinitely, and the workers are > idle. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:26 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following figure: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. For the most part, the amount of time the job takes to complete from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Except for the master, the cluster remains idle. Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following figure: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:26 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). Please see the following chart: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. For the most part, the amount of time the job takes to complete from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Except for the master, the cluster remains idle. Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following figure: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. For the most part, the amount of time the job takes to complete from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Except for the master, the cluster remains idle. Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) >
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following figure: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finishes in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! n on x axis, time for the job to complete in seconds on y axis. The job takes as much as 1 hour and 47 minutes to complete for n = 14. The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:23 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png|n on x axis, time for the job to complete in seconds on y axis! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 17/04/05 14:39:29
[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:22 PM: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://image.ibb.co/fxRL85/figure_1.png|n on x axis, time for the job to complete in seconds on y axis! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. was (Author: quentin): Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://ibb.co/d9QrFk|n on x axis, time for the job to complete in seconds on y axis! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1
[jira] [Commented] (SPARK-20227) Job hangs when joining a lot of aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953 ] Quentin Auge commented on SPARK-20227: -- Well, you were right to ask. After further investigation, it seems the job does not hang forever, but takes much longer to finish when n is large enough. I made a mistake when I mentioned the job finished in a sensible amount of time for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive aggregations). See the following time: !https://ibb.co/d9QrFk|n on x axis, time for the job to complete in seconds on y axis! The significant amount of time the job takes from n = 12 is spent after this message: {code} 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 0) 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1. 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on ip-172-30-0-149.ec2.internal killed by driver. 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 0) {code} Right after that, the job starts again and completes. > Job hangs when joining a lot of aggregated columns > -- > > Key: SPARK-20227 > URL: https://issues.apache.org/jira/browse/SPARK-20227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge >Reporter: Quentin Auge > > I'm trying to replace a lot of different columns in a dataframe with > aggregates of themselves, and then join the resulting dataframe. > {code} > # Create a dataframe with 1 row and 50 columns > n = 50 > df = sc.parallelize([Row(*range(n))]).toDF() > cols = df.columns > # Replace each column values with aggregated values > window = Window.partitionBy(cols[0]) > for col in cols[1:]: > df = df.withColumn(col, sum(col).over(window)) > # Join > other_df = sc.parallelize([Row(0)]).toDF() > result = other_df.join(df, on = cols[0]) > result.show() > {code} > Spark hangs forever when executing the last line. The strange thing is, it > depends on the number of columns. Spark does not hang for n = 5, 10, or 20 > columns. For n = 50 and beyond, it does. > {code} > 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because > it has been idle for 60 seconds (new desired total will be 0) > 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 1. > 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None) > 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on > ip-172-30-0-149.ec2.internal killed by driver. > 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has > been removed (new total is 0) > {code} > All executors are inactive and thus killed after 60 seconds, the master > spends some CPU on a process that hangs indefinitely, and the workers are > idle. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960925#comment-15960925 ] Steve Loughran commented on SPARK-2984: --- For s3a commits, HADOOP-13786 is going to be the fix. This not some trivial "directory missing" problem, it is the fundamental issue that rename() isn't how you commit work into an eventually consistent object store whose s3 client mimics rename by copying all the data from one blob to another. See [the design document|https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md]. It's currently at demo quality: appears to work, just not been tested at scale or with fault injection. Netflix have been using the core code though.Works for ORC that is, Parquet will take a bit more work due to bits of the spark code which explicitly look for ParquetOutputCommitter subclasses; something will need to be done there. > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at >
[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960922#comment-15960922 ] Torsten Scholak edited comment on SPARK-16784 at 4/7/17 3:01 PM: - I having this exact problem. I need to be able to change the log settings depending on the job and/or the application. The method illustrated above, i.e. specifying spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties, spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties using spark-submit with "--files log4j.properties" does indeed NOT work. However, I was surprised to find it suggested as solution on stackoverflow (http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs), since it is in direct contradiction to the issue described in this ticket. [~jbacon] have you created a follow-up ticket? was (Author: tscholak): I having this exact problem. I need to be able to change the log settings depending on the job and/or the application. The method illustrated above, i.e. specifying spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties, spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties using spark-submit with "--files log4j.properties" does indeed NOT work. However, I was surprised to find it suggested as solution on stackoverflow (http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs), since it is in direct contradiction to issue described in this ticket. [~jbacon] have you created a follow-up ticket? > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960922#comment-15960922 ] Torsten Scholak commented on SPARK-16784: - I having this exact problem. I need to be able to change the log settings depending on the job and/or the application. The method illustrated above, i.e. specifying spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties, spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties using spark-submit with "--files log4j.properties" does indeed NOT work. However, I was surprised to find it suggested as solution on stackoverflow (http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs), since it is in direct contradiction to issue described in this ticket. [~jbacon] have you created a follow-up ticket? > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960906#comment-15960906 ] Hemang Nagar commented on SPARK-2984: - Is there any work going on this issue, or anything related to this, as it seems nobody has been able to resolve this, and a lot of people including me have this issue? > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and > {noformat} > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. > Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. > at >
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960868#comment-15960868 ] Barry Becker commented on SPARK-20226: -- Only 11 columns. I did not want to wait for 10 or 20 minutes on each run, so I only used 11. If I went to 14 it would take over 10 minutes (or longer). I guess I could try it again with 14 columns and see how much it helps. Maybe in that case it would make a bigger difference, but even waiting a minute for such a small dataset seems too long. > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons > Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons > Number_CLEANED__" >} >}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0", > "uid":"bucketizer_f5db4fb8120e", > "paramMap":{ > >
[jira] [Created] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
Kazuaki Ishizaki created SPARK-20253: Summary: Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code Key: SPARK-20253 URL: https://issues.apache.org/jira/browse/SPARK-20253 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki While we know several Spark runtime routines never return null (e.g. {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always checks whether the return value is null or not. It is good to remove this nullcheck for reducing Java bytecode size and reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960835#comment-15960835 ] Liang-Chi Hsieh commented on SPARK-20226: - How many columns are added in above runs? I didn't see the long running time (> 10mins at least) as you reported in the jira description. For big query plans, constraint propagation will hit combination explosion issue and block the driver for long. So we have this flag "spark.sql.constraintPropagation.enabled" to disable it. For relatively small query plans (I suppose the above runs are because of the shorter running time), this flag doesn't make significant difference. Every time when you cache the table after adding a column, it finishes planning the query plan, so you will not hit the issue of constraint propagation. > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation > Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0", > "uid":"bucketizer_6f65ca9fa813", > "paramMap":{ > "outputCol":"Summons >
[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960806#comment-15960806 ] Barry Becker commented on SPARK-20226: -- OK, I set the flag using sqlContext.setConf("spark.sql.constraintPropagation.enabled", "false") to be sure its set. Here are my results. I averaged a few runs in each instance to try and get a more accurate reading. The variance was pretty high. {code} cacheTable applyPipeline total cache after add column "spark.sql.constraintPropagation.enabled" -- -- -- -- 71s 8.5s1m 30s no caching "true" 53s 8.5s1m 10s no caching "false" 65s 8.5s1m 23s no caching nothing set (true is default I assume) 1s 8.421s cachingnothing set {code} As you can see, it gets a little better with constraintPropagation off, but not nearly as good as caching the dataframe before applying the pipeline. Why is caching before the pipeline such a big win, if the pipeline stages are applied linearly? > Call to sqlContext.cacheTable takes an incredibly long time in some cases > - > > Key: SPARK-20226 > URL: https://issues.apache.org/jira/browse/SPARK-20226 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: linux or windows >Reporter: Barry Becker > Labels: cache > Attachments: profile_indexer2.PNG, xyzzy.csv > > > I have a case where the call to sqlContext.cacheTable can take an arbitrarily > long time depending on the number of columns that are referenced in a > withColumn expression applied to a dataframe. > The dataset is small (20 columns 7861 rows). The sequence to reproduce is the > following: > 1) add a new column that references 8 - 14 of the columns in the dataset. >- If I add 8 columns, then the call to cacheTable is fast - like *5 > seconds* >- If I add 11 columns, then it is slow - like *60 seconds* >- and if I add 14 columns, then it basically *takes forever* - I gave up > after 10 minutes or so. > The Column expression that is added, is basically just concatenating > the columns together in a single string. If a number is concatenated on a > string (or vice versa) the number is first converted to a string. > The expression looks something like this: > {code} > `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + > `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + > `Penalty Amount` + `Interest Amount` > {code} > which we then convert to a Column expression that looks like this: > {code} > UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), > UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), > UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), > UDF('Interest Amount)) > {code} >where the UDFs are very simple functions that basically call toString > and + as needed. > 2) apply a pipeline that includes some transformers that was saved earlier. > Here are the steps of the pipeline (extracted from parquet) > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License > Type_CLEANED__","handleInvalid":"skip","outputCol":"License > Type_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing > Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code} > - > {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation >
[jira] [Commented] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
[ https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960762#comment-15960762 ] Peter Mead commented on SPARK-20252: I'm Not sure how this explains how it work the first (and every) time if the spark context is not changed? There must be a discrepancy in the way that DSE creates the spark context the first time through and the way I create it after sc.stop? > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > --- > > Key: SPARK-20252 > URL: https://issues.apache.org/jira/browse/SPARK-20252 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.3 > Environment: Datastax DSE dual node SPARK cluster > [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native > protocol v4] >Reporter: Peter Mead > > After starting a spark shell using DSE -u -p x spark > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable("hcl","videos_by_actor") > vids: > com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] > = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15 > scala> vids.count > res0: Long = 114961 > Works OK!! > BUT if the spark context is stopped and recreated THEN: > scala> sc.stop() > scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, > org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > scala> :paste > // Entering paste mode (ctrl-D to finish) > val conf = new SparkConf(true) > .set("spark.cassandra.connection.host", "redacted") > .set("spark.cassandra.auth.username", "redacted") > .set("spark.cassandra.auth.password", "redacted") > // Exiting paste mode, now interpreting. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342 > scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf) > sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8 > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> vids.count > [Stage 0:> (0 + 2) / > 2]WARN 2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: > Lost task 0.0 in stage 0.0 (TID 0, cassandra183): > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > FAILS!! > I have been unable to get this to work from a remote SPARK shell! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
[ https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20252. --- Resolution: Duplicate This is basically a limitation of how the shell and classloaders work. Simpler variants on this work as do compiled apps. (Also it's not necessarily going to work to stop and restart the context.) > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > --- > > Key: SPARK-20252 > URL: https://issues.apache.org/jira/browse/SPARK-20252 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.3 > Environment: Datastax DSE dual node SPARK cluster > [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native > protocol v4] >Reporter: Peter Mead > > After starting a spark shell using DSE -u -p x spark > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable("hcl","videos_by_actor") > vids: > com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] > = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15 > scala> vids.count > res0: Long = 114961 > Works OK!! > BUT if the spark context is stopped and recreated THEN: > scala> sc.stop() > scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, > org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > scala> :paste > // Entering paste mode (ctrl-D to finish) > val conf = new SparkConf(true) > .set("spark.cassandra.connection.host", "redacted") > .set("spark.cassandra.auth.username", "redacted") > .set("spark.cassandra.auth.password", "redacted") > // Exiting paste mode, now interpreting. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342 > scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf) > sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8 > scala> case class movie_row (actor: String, character_name: String, video_id: > java.util.UUID, video_year: Int, title: String) > defined class movie_row > scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") > vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = > CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 > scala> vids.count > [Stage 0:> (0 + 2) / > 2]WARN 2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: > Lost task 0.0 in stage 0.0 (TID 0, cassandra183): > java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row > FAILS!! > I have been unable to get this to work from a remote SPARK shell! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960703#comment-15960703 ] Roman Studenikin commented on SPARK-20251: -- we've spent quite a lot of time investigating this already, I didn't find anything like that in google. this is an app without checkpoints, so the problem happens just in the middle of run. It just fails in producing data to kafka and skips the batch. Could you please refer to any specific cases when this behaviour is expected or point to any docs I could read about it? > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table
[ https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960682#comment-15960682 ] sydt commented on SPARK-19935: -- Have you resolve this problem about create table in sparksql for hbase table > SparkSQL unsupports to create a hive table which is mapped for HBase table > -- > > Key: SPARK-19935 > URL: https://issues.apache.org/jira/browse/SPARK-19935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.2 >Reporter: Xiaochen Ouyang > > SparkSQL unsupports the command as following: > CREATE TABLE spark_test(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "xyz"); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
Peter Mead created SPARK-20252: -- Summary: java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row Key: SPARK-20252 URL: https://issues.apache.org/jira/browse/SPARK-20252 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.6.3 Environment: Datastax DSE dual node SPARK cluster [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native protocol v4] Reporter: Peter Mead After starting a spark shell using DSE -u -p x spark scala> case class movie_row (actor: String, character_name: String, video_id: java.util.UUID, video_year: Int, title: String) defined class movie_row scala> val vids=sc.cassandraTable("hcl","videos_by_actor") vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15 scala> vids.count res0: Long = 114961 Works OK!! BUT if the spark context is stopped and recreated THEN: scala> sc.stop() scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf scala> :paste // Entering paste mode (ctrl-D to finish) val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "redacted") .set("spark.cassandra.auth.username", "redacted") .set("spark.cassandra.auth.password", "redacted") // Exiting paste mode, now interpreting. conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342 scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf) sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8 scala> case class movie_row (actor: String, character_name: String, video_id: java.util.UUID, video_year: Int, title: String) defined class movie_row scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor") vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 scala> vids.count [Stage 0:> (0 + 2) / 2]WARN 2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, cassandra183): java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row FAILS!! I have been unable to get this to work from a remote SPARK shell! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19900. --- Resolution: Cannot Reproduce > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07
[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960671#comment-15960671 ] Sean Owen commented on SPARK-20251: --- This depends on too many things, like how you've set up your app and what recovery semantics you've implemented. I'd close this as something that should start with more research and a question on the user@ list. > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20218) '/applications/[app-id]/stages' in REST API,add description.
[ https://issues.apache.org/jira/browse/SPARK-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20218. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 17534 [https://github.com/apache/spark/pull/17534] > '/applications/[app-id]/stages' in REST API,add description. > > > Key: SPARK-20218 > URL: https://issues.apache.org/jira/browse/SPARK-20218 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Trivial > Fix For: 2.2.0, 2.1.2 > > > 1. '/applications/[app-id]/stages' in rest api.status should add description > '?status=[active|complete|pending|failed] list only stages in the state.' > Now the lack of this description, resulting in the use of this api do not > know the use of the status through the brush stage list. > 2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant > description ‘?status=[active|complete|pending|failed] list only stages in the > state.’. > Because only one stage is determined based on stage-id. > code: > @GET > def stageList(@QueryParam("status") statuses: JList[StageStatus]): > Seq[StageData] = { > val listener = ui.jobProgressListener > val stageAndStatus = AllStagesResource.stagesAndStatus(ui) > val adjStatuses = { > if (statuses.isEmpty()) { > Arrays.asList(StageStatus.values(): _*) > } else { > statuses > } > }; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20218) '/applications/[app-id]/stages' in REST API,add description.
[ https://issues.apache.org/jira/browse/SPARK-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20218: - Assignee: guoxiaolongzte Priority: Trivial (was: Minor) > '/applications/[app-id]/stages' in REST API,add description. > > > Key: SPARK-20218 > URL: https://issues.apache.org/jira/browse/SPARK-20218 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Trivial > > 1. '/applications/[app-id]/stages' in rest api.status should add description > '?status=[active|complete|pending|failed] list only stages in the state.' > Now the lack of this description, resulting in the use of this api do not > know the use of the status through the brush stage list. > 2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant > description ‘?status=[active|complete|pending|failed] list only stages in the > state.’. > Because only one stage is determined based on stage-id. > code: > @GET > def stageList(@QueryParam("status") statuses: JList[StageStatus]): > Seq[StageData] = { > val listener = ui.jobProgressListener > val stageAndStatus = AllStagesResource.stagesAndStatus(ui) > val adjStatuses = { > if (statuses.isEmpty()) { > Arrays.asList(StageStatus.values(): _*) > } else { > statuses > } > }; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Studenikin updated SPARK-20251: - Description: We are experiencing strange behaviour of spark streaming application. Sometimes it just skips batch in a case of job failure and starts working on the next one. We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or we are missing any important configuration params? Screenshots from spark UI: http://pasteboard.co/1oRW0GDUX.png http://pasteboard.co/1oSjdFpbc.png was: We are experiencing strange behaviour of spark streaming application. Sometimes it just skips batch in a case of job failure and starts working on the next one. We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or we are missing any important configuration params? > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20251) Spark streaming skips batches in a case of failure
Roman Studenikin created SPARK-20251: Summary: Spark streaming skips batches in a case of failure Key: SPARK-20251 URL: https://issues.apache.org/jira/browse/SPARK-20251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Roman Studenikin We are experiencing strange behaviour of spark streaming application. Sometimes it just skips batch in a case of job failure and starts working on the next one. We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or we are missing any important configuration params? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD
[ https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960608#comment-15960608 ] jin xing commented on SPARK-20219: -- [~kayousterhout] [~irashid] Thanks a lot for taking look at this :) And sorry for late reply. The use cases are like below: 1. It is a Spark SQL job in my cluster. The sql is quite long and I'm hesitant to post it here(I can post later if there is people want to see it :)). There is 3 stages in the job: Stage-1, Stage-2, Stage-3. Stage-3 shuffle read from Stage-1 and Stage-2. There are 2000 partitions in Stage-3(we set spark.sql.shuffle.partitions=2000). The distribution of the size of the shuffle-read is in the screenshot. Running with the change in the pr, total time cost of Stage-3 is 3654 seconds. Without the change, it will cost 4934 seconds. I supplied 50 executors(this is common in data warehouse when the job failed to acquire enough containers from yarn) to Stage-3. I think the improvement here is a good one. 2. I also did a small test in my local environment. Code is like below: {code} val rdd = sc.textFile("/tmp/data", 9) rdd.map { case num => (num, 1) }.groupByKey.map { case (key, iter) => iter.sum (key, iter.size) }.collect.foreach(println) {code} There are 200m lines in the RDD, the content is some people's names. In the ResultStage, the first 8 partitions are almost of the same size and the 9th partition is 10 times of the first 8 partitions. Running with the change, the result is: 17/04/07 11:50:52 INFO DAGScheduler: ResultStage 1 (collect at SparkArchetype.scala:26) finished in 23.027 s. Running without the change, the result is: 17/04/07 11:54:27 INFO DAGScheduler: ResultStage 1 (collect at SparkArchetype.scala:26) finished in 34.546 s. In my warehouse, there are lots of cases like the first one I described above. So I really hope this idea could be taken into consideration. I feel sorry to bring in the complexity and I'm very thankful if you can give some advice for better implementation. > Schedule tasks based on size of input from ScheduledRDD > --- > > Key: SPARK-20219 > URL: https://issues.apache.org/jira/browse/SPARK-20219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: screenshot-1.png > > > When data is highly skewed on ShuffledRDD, it make sense to launch those > tasks which process much more input as soon as possible. The current > scheduling mechanism in *TaskSetManager* is quite simple: > {code} > for (i <- (0 until numTasks).reverse) { > addPendingTask(i) > } > {code} > In scenario that "large tasks" locate at bottom half of tasks array, if tasks > with much more input are launched early, we can significantly reduce the time > cost and save resource when *"dynamic allocation"* is disabled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth
[ https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19282: -- Fix Version/s: (was: 2.2.0) > RandomForestRegressionModel summary should expose getMaxDepth > - > > Key: SPARK-19282 > URL: https://issues.apache.org/jira/browse/SPARK-19282 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark, SparkR >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19522: -- Target Version/s: 2.0.3, 2.1.2, 2.2.0 (was: 2.0.3, 2.1.1) > --executor-memory flag doesn't work in local-cluster mode > - > > Key: SPARK-19522 > URL: https://issues.apache.org/jira/browse/SPARK-19522 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > {code} > bin/spark-shell --master local-cluster[2,1,2048] > {code} > doesn't do what you think it does. You'll end up getting executors with 1GB > each because that's the default. This is because the executor memory flag > isn't actually read in local mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19035) rand() function in case when cause failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19035: -- Target Version/s: 2.0.3, 2.1.2, 2.2.0 (was: 2.0.3, 2.1.1, 2.2.0) > rand() function in case when cause failed > - > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > *In this case:* >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > *Throw error:* > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a > select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group > by case when a=1 then rand() end also output this > *Notice*: > If replace rand() as 1,it work. > A simpler way to reproduce this bug: `SELECT a + rand() FROM t GROUP BY a + > rand()`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD
[ https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-20219: - Attachment: screenshot-1.png > Schedule tasks based on size of input from ScheduledRDD > --- > > Key: SPARK-20219 > URL: https://issues.apache.org/jira/browse/SPARK-20219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: screenshot-1.png > > > When data is highly skewed on ShuffledRDD, it make sense to launch those > tasks which process much more input as soon as possible. The current > scheduling mechanism in *TaskSetManager* is quite simple: > {code} > for (i <- (0 until numTasks).reverse) { > addPendingTask(i) > } > {code} > In scenario that "large tasks" locate at bottom half of tasks array, if tasks > with much more input are launched early, we can significantly reduce the time > cost and save resource when *"dynamic allocation"* is disabled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20250) Improper OOM error when a task been killed while spilling data
[ https://issues.apache.org/jira/browse/SPARK-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-20250: - Description: When a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the TaskMemoryManager will throw an OOM exception. Then the executor takes it as UncaughtException, which will be handled by SparkUncaughtExceptionHandler and the executor will consequently be shutdown. However, this error may lead to the whole application failure due to the "max number of executor failures (30) reached". In our production environment, we have encountered a lot of such cases. \\ {noformat} 17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (1 time so far) 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task 16.0 in stage 3.0 (TID 857) 17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346) 17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (2 times so far) 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) . 17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6 17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size = 26010016 bytes, TID = 857 17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage 3.0 (TID 857) java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : null at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
[jira] [Updated] (SPARK-20250) Improper OOM error when a task been killed while spilling data
[ https://issues.apache.org/jira/browse/SPARK-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-20250: - Description: While a task is calling spill() when it receives a killing request from driver (e.g., speculative task), the TaskMemoryManager will throw an OOM exception. Then the executor takes it as UncaughtException, which will be handled by SparkUncaughtExceptionHandler and the executor will consequently be shutdown. However, this error may lead to the whole application failure due to the "max number of executor failures (30) reached". In our production environment, we have encountered a lot of such cases. \\ {noformat} 17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (1 time so far) 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task 16.0 in stage 3.0 (TID 857) 17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346) 17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data of 928.0 MB to disk (2 times so far) 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) at org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) . 17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6 17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size = 26010016 bytes, TID = 857 17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage 3.0 (TID 857) java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : null at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)