[
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185944#comment-14185944
]
Szehon Ho commented on HIVE-8621:
---------------------------------
Hi Suhas, thanks for creating the JIRA. I think that we should actually have m
X n variables (m= numSmallTables, n=numBuckets). If you read the code of
MapJoinOperator, it's processing logic as I understand keeps them separate data
structures. It will be better if we can re-use that operator.
We can still do everything else as planned (m MapTasks that are union'ed), but
during collection phase, it should be easy for us to make one variable per
table. (by checking the alias tag). It is the same logic that MapReduce is
dividing per table (see HashTableSinkOperator.flushToFile()). Let me know if
that makes sense.
The trickier part is for bucket join, how to get one variable per bucket after
results are collected, there more research is needed.
> Aggregate all small table join data into 1 broadcast variable
> -------------------------------------------------------------
>
> Key: HIVE-8621
> URL: https://issues.apache.org/jira/browse/HIVE-8621
> Project: Hive
> Issue Type: Sub-task
> Reporter: Suhas Satish
> Assignee: Suhas Satish
>
> This is a sub-task of map-join for spark
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)