liyunzhang_intel created HIVE-16046: ---------------------------------------
Summary: Broadcasting small table for Hive on Spark Key: HIVE-16046 URL: https://issues.apache.org/jira/browse/HIVE-16046 Project: Hive Issue Type: Bug Reporter: liyunzhang_intel currently the spark plan is {code} 1. TS(Small table)->Sel/Fil->HashTableSink 2. TS(Small table)->Sel/Fil->HashTableSink 3. HashTableDummy -- | HashTableDummy -- | RootTS(Big table) ->Sel/Fil ->MapJoin -->Sel/Fil ->FileSink {code} 1. Run the smalltable SparkWorks on Spark cluster, which dump to hashmap file 2. Run the SparkWork for the big table on Spark cluster. Mappers will lookup the smalltable hashmap from the file using HashTableDummy’s loader. The disadvantage of current implementation is it need long time to distribute cache the hash table if the hash table is large. Here want to use sparkContext.broadcast() to store small table although it will keep the broadcast variable in driver and bring some performance decline on driver. [~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)