[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833558#comment-16833558 ]
Song Jun commented on SPARK-27227: ---------------------------------- [~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very much! > Spark Runtime Filter > -------------------- > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Song Jun > Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A > we can collect all the values of B.b, and send it to table A to do > partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org