[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128451#comment-14128451 ] Cody Koeninger commented on SPARK-3462: --- Created a PR for feedback. https://github.com/apache/spark/pull/2345 Seems to do the right thing locally, will see about testing on a cluster parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128478#comment-14128478 ] Apache Spark commented on SPARK-3462: - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/2345 parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3462) parquet pushdown for unionAll
[ https://issues.apache.org/jira/browse/SPARK-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128682#comment-14128682 ] Cody Koeninger commented on SPARK-3462: --- Tested this on a cluster against unions of 2 and 3 parquet tables, around 2billion records. Seems like a big performance win - previously, simple queries (eg count, approx distinct count of single column) against a union of 2 tables were taking 5 to 10x as long as a single table. Now it's closer to linear, e.g. 35 secs for one table, 74 for union of 2, etc. parquet pushdown for unionAll - Key: SPARK-3462 URL: https://issues.apache.org/jira/browse/SPARK-3462 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Cody Koeninger http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html // single table, pushdown scala p.where('age 40).select('name) res36: org.apache.spark.sql.SchemaRDD = SchemaRDD[97] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4 40)] // union of 2 tables, no pushdown scala b.where('age 40).select('name) res37: org.apache.spark.sql.SchemaRDD = SchemaRDD[99] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [name#3] Filter (age#4 40) Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation /var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [] ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org