[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185944#comment-14185944
 ] 

Szehon Ho commented on HIVE-8621:
---------------------------------

Hi Suhas, thanks for creating the JIRA.  I think that we should actually have m 
X n variables (m= numSmallTables, n=numBuckets).  If you read the code of 
MapJoinOperator, it's processing logic as I understand keeps them separate data 
structures.  It will be better if we can re-use that operator.

We can still do everything else as planned (m MapTasks that are union'ed), but 
during collection phase, it should be easy for us to make one variable per 
table.  (by checking the alias tag).  It is the same logic that MapReduce is 
dividing per table (see HashTableSinkOperator.flushToFile()).  Let me know if 
that makes sense.

The trickier part is for bucket join, how to get one variable per bucket after 
results are collected, there more research is needed.

> Aggregate all small table join data into 1 broadcast variable
> -------------------------------------------------------------
>
>                 Key: HIVE-8621
>                 URL: https://issues.apache.org/jira/browse/HIVE-8621
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
>
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to