Shuai Lin created SPARK-17489:
---------------------------------

             Summary: Improve filtering for bucketed tables
                 Key: SPARK-17489
                 URL: https://issues.apache.org/jira/browse/SPARK-17489
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Shuai Lin


Datasource allows creation of bucketed tables, can we optimize the query 
planning when there is a filter on the bucketed column?

For example:

{code}
select * from bucked_table where bucketed_col = "foo"
{code}

Given the above query, spark should only load the bucket files corresponding to 
the bucket files of value "foo".

But the current implementation does load all the files. Here is a small program 
to demonstrate.

{code}
# bin/spark-shell --master="local[2]"

case class Foo(name: String, age: Int)
spark.createDataFrame(Seq(
  Foo("aaa", 1),
  Foo("aaa", 2), 
  Foo("bbb", 3), 
  Foo("bbb", 4)))
  .write
  .format("json")
  .mode("overwrite")
  .bucketBy(2, "name")
  .saveAsTable("foo")

spark.sql("select * from foo where name = 'aaa'").show()

{code}

Then use sysdig to capture the file read events:

{code}
$ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" 
and "evt.buffer contains bbb"  

05:36:59.430426611 
{\"name\":\"bbb\",\"age\":3}
{\"name\":\"bbb\",\"age\":4}
{code}

Sysdig shows the bucket files that obviously doesn't match the filter (name = 
"aaa") are also read by spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to