Seems you're hitting the self-join, currently Spark SQL won't cache any 
result/logical tree for further analyzing or computing for self-join. Since the 
logical tree is huge, it's reasonable to take long time in generating its tree 
string recursively. And I also doubt the computing can finish within a 
reasonable time, as there probably be lots of partitions (grows exponentially) 
of the intermediate result.

As a workaround, you can break the iterations into smaller ones and trigger 
them manually in sequence.

-----Original Message-----
From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com] 
Sent: Wednesday, June 17, 2015 6:17 PM
To: User
Subject: generateTreeString causes huge performance problems on dataframe 
persistence

Hey,
I noticed that my code spends hours with `generateTreeString` even though the 
actual dag/dataframe execution takes seconds.

I’m running a query that grows exponential in the number of iterations when 
evaluated without caching, but should be linear when caching previous results.

E.g.

    result_i+1 = distinct(join(result_i, result_i))

Which evaluates exponentially like this this without caching.

Iteration | Dataframe Plan Tree
0            |        /\
1            |     /\    /\
2            |    /\/\  /\/\
n            |    ……….

But should be linear with caching.

Iteration | Dataframe Plan Tree
0            |     /\
              |     \/
1            |     /\
              |     \/
2            |     /\
              |     \/
n            | ……….


It seems that even though the DAG will have the later form, 
`generateTreeString` will walk the entire plan naively as if no caching was 
done.

The spark webui also shows no active jobs even though my CPU uses one core 
fully, calculating that string.

Below is the piece of stacktrace that starts the entire walk.

^
|
Thousands of calls to  `generateTreeString`.
|
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, 
StringBuilder) TreeNode.scala:431
org.apache.spark.sql.catalyst.trees.TreeNode.treeString() TreeNode.scala:400
org.apache.spark.sql.catalyst.trees.TreeNode.toString() TreeNode.scala:397
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() 
InMemoryColumnarTableScan.scala:164
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() 
InMemoryColumnarTableScan.scala:164
scala.Option.getOrElse(Function0) Option.scala:120
org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() 
InMemoryColumnarTableScan.scala:164
org.apache.spark.sql.columnar.InMemoryRelation.<init>(Seq, boolean, int, 
StorageLevel, SparkPlan, Option, RDD, Statistics, Accumulable) 
InMemoryColumnarTableScan.scala:112
org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, 
StorageLevel, SparkPlan, Option) InMemoryColumnarTableScan.scala:45
org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() 
CacheManager.scala:102
org.apache.spark.sql.execution.CacheManager.writeLock(Function0) 
CacheManager.scala:70 
org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, Option, 
StorageLevel) CacheManager.scala:94
org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^
|
Application logic.
|

Could someone confirm my suspicion?
And does somebody know why it’s called while caching, and why it walks the 
entire tree including cached results?

Cheers, Jan-Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Reply via email to