Eron Wright created SPARK-8794: ----------------------------------- Summary: Column pruning isn't applied to certain transformed DataFrames Key: SPARK-8794 URL: https://issues.apache.org/jira/browse/SPARK-8794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Eron Wright
I observe that certain transformations (e.g. sample) on DataFrame cause the underlying relation's support for column pruning to be disregarded in subsequent queries. I encountered this issue while using an ML pipeline with a typical dataset of (label, features). For my particular data source (which implements PrunedScan), the 'features' column is expensive to compute while the 'label' column is cheap. The first stage of the pipeline - StringIndexer - operates only on the label and so should be quick. Yet I found that the 'features' column would be materialized. Upon investigation, the issue occurs when the dataset is split into train/test with sampling. The sampling transformation causes the pruning optimization to be lost. See this gist for a sample program demonstrating the issue: [https://gist.github.com/EronWright/cb5fb9af46fd810194f8] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org