[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185944#comment-14185944 ]
Szehon Ho commented on HIVE-8621: --------------------------------- Hi Suhas, thanks for creating the JIRA. I think that we should actually have m X n variables (m= numSmallTables, n=numBuckets). If you read the code of MapJoinOperator, it's processing logic as I understand keeps them separate data structures. It will be better if we can re-use that operator. We can still do everything else as planned (m MapTasks that are union'ed), but during collection phase, it should be easy for us to make one variable per table. (by checking the alias tag). It is the same logic that MapReduce is dividing per table (see HashTableSinkOperator.flushToFile()). Let me know if that makes sense. The trickier part is for bucket join, how to get one variable per bucket after results are collected, there more research is needed. > Aggregate all small table join data into 1 broadcast variable > ------------------------------------------------------------- > > Key: HIVE-8621 > URL: https://issues.apache.org/jira/browse/HIVE-8621 > Project: Hive > Issue Type: Sub-task > Reporter: Suhas Satish > Assignee: Suhas Satish > > This is a sub-task of map-join for spark > https://issues.apache.org/jira/browse/HIVE-7613 > This can use the baseline patch for map-join > https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)