[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Chawade updated SPARK-13219:
------------------------------------
    Description: 
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
    assetid int PRIMARY KEY,
    address text,
    propertyname text
)
CREATE TABLE tenants (
    assetid int PRIMARY KEY,
    name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
    HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.

  was:
When 2 or more tables are joined in SparkSQL and there is an equality clause in 
query on attributes used to perform the join, it is useful to apply that clause 
on scans for both table. If this is not done, one of the tables results in full 
scan which can reduce the query dramatically. Consider following example with 2 
tables being joined.

{code}
CREATE TABLE assets (
    assetid int PRIMARY KEY,
    address text,
    propertyname text
)
CREATE TABLE element22082.tenants (
    assetid int PRIMARY KEY,
    name text
)
spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
t.assetid and t.assetid='1201';
WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
== Physical Plan ==
Project [name#14]
 ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
  Exchange (HashPartitioning 200)
   Filter (CAST(assetid#13, DoubleType) = 1201.0)
    HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
tenants, Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
Some(a)), None
Time taken: 1.354 seconds, Fetched 8 row(s)
{code}

The simple workaround is to add another equality condition for each table but 
it becomes cumbersome. It will be helpful if the query planner could improve 
filter propagation.


> Pushdown predicate propagation in SparkSQL with join
> ----------------------------------------------------
>
>                 Key: SPARK-13219
>                 URL: https://issues.apache.org/jira/browse/SPARK-13219
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.4.1, 1.6.0
>         Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>            Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
>     assetid int PRIMARY KEY,
>     address text,
>     propertyname text
> )
> CREATE TABLE tenants (
>     assetid int PRIMARY KEY,
>     name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>    Filter (CAST(assetid#13, DoubleType) = 1201.0)
>     HiveTableScan [assetid#13,name#14], (MetastoreRelation element22082, 
> tenants, Some(t)), None
>   Exchange (HashPartitioning 200)
>    HiveTableScan [assetid#15], (MetastoreRelation element22082, assets, 
> Some(a)), None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to