[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rick Kramer resolved SPARK-23012. --------------------------------- Resolution: Fixed > Support for predicate pushdown and partition pruning when left joining large > Hive tables > ---------------------------------------------------------------------------------------- > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer > Affects Versions: 2.2.0 > Reporter: Rick Kramer > Priority: Major > Fix For: 2.4.0 > > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( > a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( > b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org