[jira] [Created] (HIVE-16046) Broadcasting small table for Hive on Spark

liyunzhang_intel (JIRA) Mon, 27 Feb 2017 00:16:57 -0800

liyunzhang_intel created HIVE-16046:
---------------------------------------


             Summary: Broadcasting small table for Hive on Spark
                 Key: HIVE-16046
                 URL: https://issues.apache.org/jira/browse/HIVE-16046
             Project: Hive
          Issue Type: Bug
            Reporter: liyunzhang_intel


currently the spark plan is 
{code}
1. TS(Small table)->Sel/Fil->HashTableSink  
                                   

2. TS(Small table)->Sel/Fil->HashTableSink          
                                                                                
                                       
3.                                             HashTableDummy --
                                                                |
                                                HashTableDummy  --
                                                                |
                                RootTS(Big table) ->Sel/Fil ->MapJoin 
-->Sel/Fil ->FileSink
{code}
        1.   Run the smalltable SparkWorks on Spark cluster, which dump to 
hashmap file
        2.    Run the SparkWork for the big table on Spark cluster.  Mappers 
will lookup the smalltable hashmap from the file using HashTableDummy’s 
loader. 

The disadvantage of current implementation is it need long time to distribute 
cache the hash table if the hash table is large.  Here want to use 
sparkContext.broadcast() to store small table although it will keep the 
broadcast variable in driver and bring some performance decline on driver.
[~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16046) Broadcasting small table for Hive on Spark

Reply via email to