Eron Wright  created SPARK-8794:
-----------------------------------

             Summary: Column pruning isn't applied to certain transformed 
DataFrames
                 Key: SPARK-8794
                 URL: https://issues.apache.org/jira/browse/SPARK-8794
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0
            Reporter: Eron Wright 


I observe that certain transformations (e.g. sample) on DataFrame cause the 
underlying relation's support for column pruning to be disregarded in 
subsequent queries.

I encountered this issue while using an ML pipeline with a typical dataset of 
(label, features).   For my particular data source (which implements 
PrunedScan), the 'features' column is expensive to compute while the 'label' 
column is cheap.  The first stage of the pipeline - StringIndexer - operates 
only on the label and so should be quick.   Yet I found that the 'features' 
column would be materialized.   Upon investigation,  the issue occurs when the 
dataset is split into train/test with sampling.   The sampling transformation 
causes the pruning optimization to be lost.

See this gist for a sample program demonstrating the issue:
[https://gist.github.com/EronWright/cb5fb9af46fd810194f8]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to