[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121968#comment-15121968 ] Cristian commented on SPARK-9301: - Seconded, looks like MutableAggregationBuffer is not so mutable after all, everything gets converted to catalyst types and back everytime, which makes it impossible to implement anything that collects a larger amount of data to evaluate later. > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035700#comment-15035700 ] Cristian commented on SPARK-11596: -- That's great, thank you > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian >Assignee: Yin Huai > Fix For: 1.6.0 > > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034058#comment-15034058 ] Cristian commented on SPARK-11596: -- Any chance this can be fixed soon ? It looks like a straightforward simple bug to me, but it can have quite an impact. > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11879) Checkpoint support for DataFrame
Cristian created SPARK-11879: Summary: Checkpoint support for DataFrame Key: SPARK-11879 URL: https://issues.apache.org/jira/browse/SPARK-11879 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.2 Reporter: Cristian Explicit support for checkpointing DataFrames is need to be able to truncate lineages, prune the query plan (particularly the logical plan) and transparent failure recovery. While for recovery saving to a Parquet file may be sufficient, actually using that as a checkpoint (and truncating the lineage), requires reading the files back. This is required to be able to use DataFrames in iterative scenarios like Streaming and ML, as well as for avoiding expensive re-computations in case of executor failure when executing a complex chain of queries on very large datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019158#comment-15019158 ] Cristian commented on SPARK-11596: -- Although you are right that this does not reproduce without caching. It does get slower because it needs to recompute the chain everytime, but without caching it actually finishes all 100 iterations in a reasonable amount of time. With caching it virtually stops after about 20 iterations and spend all the time in toString I suspect this is because the query plans are different and there is some query plan triggered by caching that has a very expensive toString ? > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian updated SPARK-11596: - Description: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) union.cache() println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {quote} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) {quote} was: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {quote} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318)
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019096#comment-15019096 ] Cristian commented on SPARK-11596: -- Sorry, my repro code was missing a cache() statement. I added it now and that should make the actual execution - triggered intentionally by count(), yes - very cheap. You'll still see that after only about 10 iterations it becomes extremely slow. The issue is in QueryExecution.toString() which is called from SQLExecution.withNewExecutionId() I actually verified this by removing the transformation of the plans into strings there, rebuilding and rerunning, and I can see the problem goes away. > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019124#comment-15019124 ] Cristian commented on SPARK-11596: -- An easy way to check this is to run the code in local mode, and when it becomes very slow use jstack and you'll get the stack trace I pasted above. Interesting that almost always you'll find this at the top of stack which seems to suggest an issue with getting the simple name of anonymous class, maybe a Scala closure ? The issue is generating the execution ID itself, not the execution, that's for certain {quote} at java.lang.Class.getEnclosingMethod0(Native Method) at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Class.getEnclosingClass(Class.java:1272) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at org.apache.spark.sql.catalyst.trees.TreeNode.nodeName(TreeNode.scala:377) at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:395) at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:429) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431) at scala.collection.immutable.List.foreach(List.scala:318) {quote} > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) >
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019241#comment-15019241 ] Cristian commented on SPARK-11596: -- Ok, I found a much simpler repro. Note the below does not actually execute, just generates the plans. DF.explain() takes a very long time here: {code} val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) union.cache() Some(union) } c.get.explain(true) //<--- this is very expensive {code} > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019267#comment-15019267 ] Cristian commented on SPARK-11596: -- Looks like the problem is here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L358 {code} case tn: TreeNode[_] if tn.toString contains "\n" => s"(${tn.simpleString})" :: Nil {code} When this condition is true the tree evaluation becomes exponential. > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian updated SPARK-11596: - Comment: was deleted (was: An easy way to check this is to run the code in local mode, and when it becomes very slow use jstack and you'll get the stack trace I pasted above. Interesting that almost always you'll find this at the top of stack which seems to suggest an issue with getting the simple name of anonymous class, maybe a Scala closure ? The issue is generating the execution ID itself, not the execution, that's for certain {quote} at java.lang.Class.getEnclosingMethod0(Native Method) at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Class.getEnclosingClass(Class.java:1272) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at org.apache.spark.sql.catalyst.trees.TreeNode.nodeName(TreeNode.scala:377) at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:395) at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:429) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431) at scala.collection.immutable.List.foreach(List.scala:318) {quote}) > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > Attachments: screenshot-1.png > > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > union.cache() > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) >
[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012386#comment-15012386 ] Cristian commented on SPARK-11596: -- This is not what this is about. It's very useful to have UnionAll work recursively or iteratively. The problem is not with the execution of the actual plan, but in creating an execution ID, QueryPlan.simpleString is very inefficient I suspect because it copies Strings instead of using StringBuffers all the way through. Basically this is a very expensive toString, with a very simple fix. Any chance you might look into it, it's should be straightforward to fix. For objective reasons I'm not able to. > SQL execution very slow for nested query plans because of > DataFrame.withNewExecutionId > -- > > Key: SPARK-11596 > URL: https://issues.apache.org/jira/browse/SPARK-11596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cristian > > For nested query plans like a recursive unionAll, withExecutionId is > extremely slow, likely because of repeated string concatenation in > QueryPlan.simpleString > Test case: > {code} > (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => > println(s"PROCESSING >>> $idx") > val df = sqlContext.sparkContext.parallelize((0 to > 10).zipWithIndex).toDF("A", "B") > val union = curr.map(_.unionAll(df)).getOrElse(df) > println(">>" + union.count) > //union.show() > Some(union) > } > {code} > Stack trace: > {quote} > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > scala.collection.AbstractIterator.addString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > scala.collection.AbstractIterator.mkString(Iterator.scala:1157) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) > scala.collection.immutable.List.foreach(List.scala:318) > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) > org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) > org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) > org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian updated SPARK-11596: - Description: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code:scala} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {quote} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) {quote} was: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {block} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318)
[jira] [Created] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
Cristian created SPARK-11596: Summary: SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId Key: SPARK-11596 URL: https://issues.apache.org/jira/browse/SPARK-11596 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Cristian For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {block} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) {block} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId
[ https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian updated SPARK-11596: - Description: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {quote} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372) org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369) org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936) org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949) org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) {quote} was: For nested query plans like a recursive unionAll, withExecutionId is extremely slow, likely because of repeated string concatenation in QueryPlan.simpleString Test case: {code:scala} (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) println(">>" + union.count) //union.show() Some(union) } {code} Stack trace: {quote} scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) scala.collection.AbstractIterator.addString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) scala.collection.AbstractIterator.mkString(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364) org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367) org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318) org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403) scala.collection.immutable.List.foreach(List.scala:318)
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634316#comment-14634316 ] Cristian commented on SPARK-4849: - I would argue that the priority for this is not Minor since if resolved it will enable many use cases where data can be stored in memory and queried repeatedly at low latency. For example with Spark Streaming applications, it's common to join incoming data with a memory resident dataset for enrichment. If that join can be performed without a shuffle it would enable important use cases which are currently too high-latency to implement with Spark. It appears this is also a fairly straightforward fix, so any chance it can get some priority ? Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8435) Cannot create tables in an specific database using a provider
Cristian created SPARK-8435: --- Summary: Cannot create tables in an specific database using a provider Key: SPARK-8435 URL: https://issues.apache.org/jira/browse/SPARK-8435 Project: Spark Issue Type: Bug Environment: Spark SQL 1.4.0 (Spark-Shell), Hive metastore, MySQL Driver, Linux Reporter: Cristian Hello, I've been trying to create tables in different catalogs using a Hive metastore and when I execute the CREATE statement, I realized that it is created into the default catalog. This is what I'm trying. {quote} scala sqlContext.sql(CREATE DATABASE IF NOT EXISTS testmetastore COMMENT 'Testing catalogs' ) scala sqlContext.sql(USE testmetastore) scala sqlContext.sql(CREATE TABLE students USING org.apache.spark.sql.parquet OPTIONS (path '/user/hive, highavailability 'true', DefaultLimit '1000')) {quote} And this is what I get. I can see that it is kind of working because it seems that when it checks if the table exists, it searchs in the correct catalog (testmetastore). But finally when it tries to create the table, it uses the default catalog. {quote} scala sqlContext.sql(CREATE TABLE students USING a OPTIONS (highavailability 'true', DefaultLimit '1000')).show 15/06/18 10:28:48 INFO HiveMetaStore: 0: get_table : db=*testmetastore* tbl=students 15/06/18 10:28:48 INFO audit: ugi=ccaballeroip=unknown-ip-addr cmd=get_table : db=testmetastore tbl=students 15/06/18 10:28:48 INFO Persistence: Request to load fields comment,name,type of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored 15/06/18 10:28:48 INFO Persistence: Request to load fields comment,name,type of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored 15/06/18 10:28:48 INFO HiveMetaStore: 0: create_table: Table(tableName:students, dbName:*default*, owner:ccaballero, createTime:1434616128, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col, type:arraystring, comment:from deserializer)], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{DefaultLimit=1000, serialization.format=1, highavailability=true}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=a}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) 15/06/18 10:28:48 INFO audit: ugi=ccaballeroip=unknown-ip-addr cmd=create_table: Table(tableName:students, dbName:default, owner:ccaballero, createTime:1434616128, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col, type:arraystring, comment:from deserializer)], location:null, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, parameters:{DefaultLimit=1000, serialization.format=1, highavailability=true}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=a}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) 15/06/18 10:28:49 INFO SparkContext: Starting job: show at console:20 15/06/18 10:28:49 INFO DAGScheduler: Got job 2 (show at console:20) with 1 output partitions (allowLocal=false) 15/06/18 10:28:49 INFO DAGScheduler: Final stage: ResultStage 2(show at console:20) 15/06/18 10:28:49 INFO DAGScheduler: Parents of final stage: List() 15/06/18 10:28:49 INFO DAGScheduler: Missing parents: List() 15/06/18 10:28:49 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[6] at show at console:20), which has no missing parents 15/06/18 10:28:49 INFO MemoryStore: ensureFreeSpace(1792) called with curMem=0, maxMem=278302556 15/06/18 10:28:49 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1792.0 B, free 265.4 MB) 15/06/18 10:28:49 INFO MemoryStore: ensureFreeSpace(1139) called with curMem=1792, maxMem=278302556 15/06/18 10:28:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1139.0 B, free 265.4 MB) 15/06/18 10:28:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:59110 (size: 1139.0 B, free: 265.4 MB) 15/06/18 10:28:49 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375243#comment-14375243 ] Cristian commented on SPARK-5863: - I'm a bit confused. The original jira refers to a very specific issue that has a very simple solution (I think). {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} Can be written as: {code} r.toSeq.view.zip(schema.fields.view.map(_.dataType)) {code} Or in fact this schema.fields.map(_.dataType) can be instead remembered and passed into the function instead of recalcing every time. Also this is a clear regression as I showed initially, it seems to have a simple fix, so why not fix the 1.2 branch as well ? Improve performance of convertToScala codepath. --- Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org