Shuai Lin created SPARK-17489: --------------------------------- Summary: Improve filtering for bucketed tables Key: SPARK-17489 URL: https://issues.apache.org/jira/browse/SPARK-17489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shuai Lin
Datasource allows creation of bucketed tables, can we optimize the query planning when there is a filter on the bucketed column? For example: {code} select * from bucked_table where bucketed_col = "foo" {code} Given the above query, spark should only load the bucket files corresponding to the bucket files of value "foo". But the current implementation does load all the files. Here is a small program to demonstrate. {code} # bin/spark-shell --master="local[2]" case class Foo(name: String, age: Int) spark.createDataFrame(Seq( Foo("aaa", 1), Foo("aaa", 2), Foo("bbb", 3), Foo("bbb", 4))) .write .format("json") .mode("overwrite") .bucketBy(2, "name") .saveAsTable("foo") spark.sql("select * from foo where name = 'aaa'").show() {code} Then use sysdig to capture the file read events: {code} $ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" and "evt.buffer contains bbb" 05:36:59.430426611 {\"name\":\"bbb\",\"age\":3} {\"name\":\"bbb\",\"age\":4} {code} Sysdig shows the bucket files that obviously doesn't match the filter (name = "aaa") are also read by spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org