liyunzhang_intel created HIVE-16046:
---------------------------------------

             Summary: Broadcasting small table for Hive on Spark
                 Key: HIVE-16046
                 URL: https://issues.apache.org/jira/browse/HIVE-16046
             Project: Hive
          Issue Type: Bug
            Reporter: liyunzhang_intel


currently the spark plan is 
{code}
1. TS(Small table)->Sel/Fil->HashTableSink  
                                   

2. TS(Small table)->Sel/Fil->HashTableSink          
                                                                                
                                       
3.                                             HashTableDummy --
                                                                |
                                                HashTableDummy  --
                                                                |
                                RootTS(Big table) ->Sel/Fil ->MapJoin 
-->Sel/Fil ->FileSink
{code}
        1.   Run the small­table SparkWorks on Spark cluster, which dump to 
hashmap file
        2.    Run the SparkWork for the big table on Spark cluster.  Mappers 
will lookup the small­table hashmap from the file using HashTableDummy’s 
loader. 

The disadvantage of current implementation is it need long time to distribute 
cache the hash table if the hash table is large.  Here want to use 
sparkContext.broadcast() to store small table although it will keep the 
broadcast variable in driver and bring some performance decline on driver.
[~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to