[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions

2016-01-28 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121968#comment-15121968
 ] 

Cristian commented on SPARK-9301:
-

Seconded, looks like MutableAggregationBuffer is not so mutable after all, 
everything gets converted to catalyst types and back everytime, which makes it 
impossible to implement anything that collects a larger amount of data to 
evaluate later.

> collect_set and collect_list aggregate functions
> 
>
> Key: SPARK-9301
> URL: https://issues.apache.org/jira/browse/SPARK-9301
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Nick Buroojy
>Priority: Critical
> Fix For: 1.6.0
>
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-12-02 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035700#comment-15035700
 ] 

Cristian commented on SPARK-11596:
--

That's great, thank you

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
>Assignee: Yin Huai
> Fix For: 1.6.0
>
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-12-01 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034058#comment-15034058
 ] 

Cristian commented on SPARK-11596:
--

Any chance this can be fixed soon ? It looks like a straightforward simple bug 
to me, but it can have quite an impact.  

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11879) Checkpoint support for DataFrame

2015-11-20 Thread Cristian (JIRA)
Cristian created SPARK-11879:


 Summary: Checkpoint support for DataFrame
 Key: SPARK-11879
 URL: https://issues.apache.org/jira/browse/SPARK-11879
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.2
Reporter: Cristian


Explicit support for checkpointing DataFrames is need to be able to truncate 
lineages, prune the query plan (particularly the logical plan) and transparent 
failure recovery.

While for recovery saving to a Parquet file may be sufficient, actually using 
that as a checkpoint (and truncating the lineage), requires reading the files 
back.

This is required to be able to use DataFrames in iterative scenarios like 
Streaming and ML, as well as for avoiding expensive re-computations in case of 
executor failure when executing a complex chain of queries on very large 
datasets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019158#comment-15019158
 ] 

Cristian commented on SPARK-11596:
--

Although you are right that this does not reproduce without caching. It does 
get slower because it needs to recompute the chain everytime, but without 
caching it actually finishes all 100 iterations in a reasonable amount of time. 
With caching it virtually stops after about 20 iterations and spend all the 
time in toString

I suspect this is because the query plans are different and there is some query 
plan triggered by caching that has a very expensive toString ?

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian updated SPARK-11596:
-
Description: 
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
union.cache()
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{quote}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
{quote}

  was:
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{quote}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)

[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019096#comment-15019096
 ] 

Cristian commented on SPARK-11596:
--

Sorry, my repro code was missing a cache() statement. I added it now and that 
should make the actual execution - triggered intentionally by count(), yes - 
very cheap.

You'll still see that after only about 10 iterations it becomes extremely slow.

The issue is in QueryExecution.toString() which is called from 
SQLExecution.withNewExecutionId()

I actually verified this by removing the transformation of the plans into 
strings there, rebuilding and rerunning, and I can see the problem goes away.

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019124#comment-15019124
 ] 

Cristian commented on SPARK-11596:
--

An easy way to check this is to run the code in local mode, and when it becomes 
very slow use jstack and you'll get the stack trace I pasted above. 

Interesting that almost always you'll find this at the top of stack which seems 
to suggest an issue with getting the simple name of anonymous class, maybe a 
Scala closure ?

The issue is generating the execution ID itself, not the execution, that's for 
certain

{quote}
at java.lang.Class.getEnclosingMethod0(Native Method)
at java.lang.Class.getEnclosingMethodInfo(Class.java:1072)
at java.lang.Class.getEnclosingClass(Class.java:1272)
at java.lang.Class.getSimpleBinaryName(Class.java:1443)
at java.lang.Class.getSimpleName(Class.java:1309)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.nodeName(TreeNode.scala:377)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:395)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:429)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431)
at scala.collection.immutable.List.foreach(List.scala:318)
{quote}

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> 

[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019241#comment-15019241
 ] 

Cristian commented on SPARK-11596:
--

Ok, I found a much simpler repro. Note the below does not actually execute, 
just generates the plans.  DF.explain() takes a very long time here:

{code}
val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
union.cache()
Some(union)
  }

c.get.explain(true) //<--- this is very expensive
{code}

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019267#comment-15019267
 ] 

Cristian commented on SPARK-11596:
--

Looks like the problem is here:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L358

{code}
case tn: TreeNode[_] if tn.toString contains "\n" => s"(${tn.simpleString})" :: 
Nil
{code}

When this condition is true the tree evaluation becomes exponential. 

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-20 Thread Cristian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian updated SPARK-11596:
-
Comment: was deleted

(was: An easy way to check this is to run the code in local mode, and when it 
becomes very slow use jstack and you'll get the stack trace I pasted above. 

Interesting that almost always you'll find this at the top of stack which seems 
to suggest an issue with getting the simple name of anonymous class, maybe a 
Scala closure ?

The issue is generating the execution ID itself, not the execution, that's for 
certain

{quote}
at java.lang.Class.getEnclosingMethod0(Native Method)
at java.lang.Class.getEnclosingMethodInfo(Class.java:1072)
at java.lang.Class.getEnclosingClass(Class.java:1272)
at java.lang.Class.getSimpleBinaryName(Class.java:1443)
at java.lang.Class.getSimpleName(Class.java:1309)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.nodeName(TreeNode.scala:377)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:395)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:429)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:431)
at scala.collection.immutable.List.foreach(List.scala:318)
{quote})

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
> Attachments: screenshot-1.png
>
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> union.cache()
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> 

[jira] [Commented] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-18 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012386#comment-15012386
 ] 

Cristian commented on SPARK-11596:
--

This is not what this is about. It's very useful to have UnionAll work 
recursively or iteratively.

The problem is not with the execution of the actual plan, but in creating an 
execution ID, QueryPlan.simpleString is very inefficient I suspect because it 
copies Strings instead of using StringBuffers all the way through.

Basically this is a very expensive toString, with a very simple fix. Any chance 
you might look into it, it's should be straightforward to fix. For objective 
reasons I'm not able to.

> SQL execution very slow for nested query plans because of 
> DataFrame.withNewExecutionId
> --
>
> Key: SPARK-11596
> URL: https://issues.apache.org/jira/browse/SPARK-11596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cristian
>
> For nested query plans like a recursive unionAll, withExecutionId is 
> extremely slow, likely because of repeated string concatenation in 
> QueryPlan.simpleString
> Test case:
> {code}
> (1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
> println(s"PROCESSING >>> $idx")
> val df = sqlContext.sparkContext.parallelize((0 to 
> 10).zipWithIndex).toDF("A", "B")
> val union = curr.map(_.unionAll(df)).getOrElse(df)
> println(">>" + union.count)
> //union.show()
> Some(union)
>   }
> {code}
> Stack trace:
> {quote}
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> scala.collection.AbstractIterator.addString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
> org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
> org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
> scala.collection.immutable.List.foreach(List.scala:318)
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
> org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
> org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-09 Thread Cristian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian updated SPARK-11596:
-
Description: 
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code:scala}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{quote}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
{quote}

  was:
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{block}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)

[jira] [Created] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-09 Thread Cristian (JIRA)
Cristian created SPARK-11596:


 Summary: SQL execution very slow for nested query plans because of 
DataFrame.withNewExecutionId
 Key: SPARK-11596
 URL: https://issues.apache.org/jira/browse/SPARK-11596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Cristian


For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{block}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
{block}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11596) SQL execution very slow for nested query plans because of DataFrame.withNewExecutionId

2015-11-09 Thread Cristian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian updated SPARK-11596:
-
Description: 
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{quote}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:372)
org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:369)
org.apache.spark.sql.SQLContext$QueryExecution.stringOrError(SQLContext.scala:936)
org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:949)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
{quote}

  was:
For nested query plans like a recursive unionAll, withExecutionId is extremely 
slow, likely because of repeated string concatenation in QueryPlan.simpleString

Test case:

{code:scala}
(1 to 100).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 
10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
println(">>" + union.count)
//union.show()
Some(union)
  }
{code}

Stack trace:
{quote}
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
scala.collection.AbstractIterator.addString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:364)
org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:367)
org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:168)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:401)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$generateTreeString$1.apply(TreeNode.scala:403)
scala.collection.immutable.List.foreach(List.scala:318)

[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2015-07-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634316#comment-14634316
 ] 

Cristian commented on SPARK-4849:
-

I would argue that the priority for this is not Minor since if resolved it will 
enable many use cases where data can be stored in memory and queried repeatedly 
at low latency. 

For example with Spark Streaming applications, it's common to join incoming 
data with a memory resident dataset for enrichment. If that join can be 
performed without a shuffle it would enable important use cases which are 
currently too high-latency to implement with Spark.

It appears this is also a fairly straightforward fix, so any chance it can get 
some priority ?

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8435) Cannot create tables in an specific database using a provider

2015-06-18 Thread Cristian (JIRA)
Cristian created SPARK-8435:
---

 Summary: Cannot create tables in an specific database using a 
provider
 Key: SPARK-8435
 URL: https://issues.apache.org/jira/browse/SPARK-8435
 Project: Spark
  Issue Type: Bug
 Environment: Spark SQL 1.4.0 (Spark-Shell), Hive metastore, MySQL 
Driver, Linux
Reporter: Cristian


Hello,

I've been trying to create tables in different catalogs using a Hive metastore 
and when I execute the CREATE statement, I realized that it is created into 
the default catalog.

This is what I'm trying. 
{quote}
scala sqlContext.sql(CREATE DATABASE IF NOT EXISTS testmetastore COMMENT 
'Testing catalogs' )
scala sqlContext.sql(USE testmetastore)
scala sqlContext.sql(CREATE TABLE students USING org.apache.spark.sql.parquet 
OPTIONS (path '/user/hive, highavailability 'true', DefaultLimit '1000'))
{quote}

And this is what I get. I can see that it is kind of working because it seems 
that when it checks if the table exists, it searchs in the correct catalog 
(testmetastore). But finally when it tries to create the table, it uses the 
default catalog.

{quote}
scala sqlContext.sql(CREATE TABLE students USING a OPTIONS (highavailability 
'true', DefaultLimit '1000')).show
15/06/18 10:28:48 INFO HiveMetaStore: 0: get_table : db=*testmetastore* 
tbl=students
15/06/18 10:28:48 INFO audit: ugi=ccaballeroip=unknown-ip-addr  
cmd=get_table : db=testmetastore tbl=students   
15/06/18 10:28:48 INFO Persistence: Request to load fields comment,name,type 
of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is 
embedded, so ignored
15/06/18 10:28:48 INFO Persistence: Request to load fields comment,name,type 
of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is 
embedded, so ignored
15/06/18 10:28:48 INFO HiveMetaStore: 0: create_table: 
Table(tableName:students, dbName:*default*, owner:ccaballero, 
createTime:1434616128, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:arraystring, 
comment:from deserializer)], location:null, 
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, 
parameters:{DefaultLimit=1000, serialization.format=1, highavailability=true}), 
bucketCols:[], sortCols:[], parameters:{}, 
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{EXTERNAL=TRUE, 
spark.sql.sources.provider=a}, viewOriginalText:null, viewExpandedText:null, 
tableType:MANAGED_TABLE)
15/06/18 10:28:48 INFO audit: ugi=ccaballeroip=unknown-ip-addr  
cmd=create_table: Table(tableName:students, dbName:default, owner:ccaballero, 
createTime:1434616128, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:arraystring, 
comment:from deserializer)], location:null, 
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, 
parameters:{DefaultLimit=1000, serialization.format=1, highavailability=true}), 
bucketCols:[], sortCols:[], parameters:{}, 
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{})), partitionKeys:[], parameters:{EXTERNAL=TRUE, 
spark.sql.sources.provider=a}, viewOriginalText:null, viewExpandedText:null, 
tableType:MANAGED_TABLE)  
15/06/18 10:28:49 INFO SparkContext: Starting job: show at console:20
15/06/18 10:28:49 INFO DAGScheduler: Got job 2 (show at console:20) with 1 
output partitions (allowLocal=false)
15/06/18 10:28:49 INFO DAGScheduler: Final stage: ResultStage 2(show at 
console:20)
15/06/18 10:28:49 INFO DAGScheduler: Parents of final stage: List()
15/06/18 10:28:49 INFO DAGScheduler: Missing parents: List()
15/06/18 10:28:49 INFO DAGScheduler: Submitting ResultStage 2 
(MapPartitionsRDD[6] at show at console:20), which has no missing parents
15/06/18 10:28:49 INFO MemoryStore: ensureFreeSpace(1792) called with curMem=0, 
maxMem=278302556
15/06/18 10:28:49 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 1792.0 B, free 265.4 MB)
15/06/18 10:28:49 INFO MemoryStore: ensureFreeSpace(1139) called with 
curMem=1792, maxMem=278302556
15/06/18 10:28:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in 
memory (estimated size 1139.0 B, free 265.4 MB)
15/06/18 10:28:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
localhost:59110 (size: 1139.0 B, free: 265.4 MB)
15/06/18 10:28:49 INFO SparkContext: Created broadcast 2 from broadcast at 
DAGScheduler.scala:874

[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375243#comment-14375243
 ] 

Cristian commented on SPARK-5863:
-

I'm a bit confused. The original jira refers to a very specific issue that has 
a very simple solution (I think).

{code} 
r.toSeq.zip(schema.fields.map(_.dataType)) 
{code} 

Can be written as:

{code} 
r.toSeq.view.zip(schema.fields.view.map(_.dataType)) 
{code} 

Or in fact this schema.fields.map(_.dataType) can be instead remembered and 
passed into the function instead of recalcing every time.

Also this is a clear regression as I showed initially, it seems to have a 
simple fix, so why not fix the 1.2 branch as well ?



 Improve performance of convertToScala codepath.
 ---

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org