Joseph K. Bradley created SPARK-13346:
-----------------------------------------

             Summary: DataFrame caching is not handled well during planning or 
execution
                 Key: SPARK-13346
                 URL: https://issues.apache.org/jira/browse/SPARK-13346
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Joseph K. Bradley


I have an iterative algorithm based on DataFrames, and the query plan grows 
very quickly with each iteration.  Caching the current DataFrame at the end of 
an iteration does not fix the problem.  However, converting the DataFrame to an 
RDD and back at the end of each iteration does fix the problem.

Printing the query plans shows that the plan explodes quickly (10 lines, to 
several hundred lines, to several thousand lines, ...) with successive 
iterations.

The desired behavior is for the analyzer to recognize that a big chunk of the 
query plan does not need to be computed since it is already cached.  The 
computation on each iteration should be the same.

If useful, I can push (complex) code to reproduce the issue.  But it should be 
simple to see if you create an iterative algorithm which produces a new 
DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to