liyunzhang_intel created HIVE-16046:
---------------------------------------
Summary: Broadcasting small table for Hive on Spark
Key: HIVE-16046
URL: https://issues.apache.org/jira/browse/HIVE-16046
Project: Hive
Issue Type: Bug
Reporter: liyunzhang_intel
currently the spark plan is
{code}
1. TS(Small table)->Sel/Fil->HashTableSink
2. TS(Small table)->Sel/Fil->HashTableSink
3. HashTableDummy --
|
HashTableDummy --
|
RootTS(Big table) ->Sel/Fil ->MapJoin
-->Sel/Fil ->FileSink
{code}
1. Run the smalltable SparkWorks on Spark cluster, which dump to
hashmap file
2. Run the SparkWork for the big table on Spark cluster. Mappers
will lookup the smalltable hashmap from the file using HashTableDummy’s
loader.
The disadvantage of current implementation is it need long time to distribute
cache the hash table if the hash table is large. Here want to use
sparkContext.broadcast() to store small table although it will keep the
broadcast variable in driver and bring some performance decline on driver.
[~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)