[
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189531#comment-14189531
]
Suhas Satish commented on HIVE-8621:
------------------------------------
Currently so far in the spark implementation, we are not tagging the small
tables, but I realized that we need to tag them to be able to use different
broadcast variables for different tables.
Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join.
In M/R, we have only one HashTableSink Operator (HTS) for all small tables
combined. This conversion from RS-> HTS
happens in LocalMapJoinProcFactory and is triggered by rule R7
(MapReduceCompiler: MapJoinFactory.getTableScanMapJoin ) in
TaskCompiler.optimizeTaskPlan phase.
Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we
will end up with 2 HashTableSinks (or in general, (n-1) HTS for n-way join).
Each of these will generate its broadcast variable.
After going through Sandy Ryza's spark presentation here,
http://www.slideshare.net/SandyRyza/spark-job-failures-talk
it looks like the recommended way to distribute compute in spark is to have a
large number of SparkTasks. So I think its better to have each MapWork from
each small table as a separate SparkTask. This can be tackled independently in
this jira if you guys agree
https://issues.apache.org/jira/browse/HIVE-8622
> Dump small table join data into appropriate number of broadcast variables
> [Spark Branch]
> ----------------------------------------------------------------------------------------
>
> Key: HIVE-8621
> URL: https://issues.apache.org/jira/browse/HIVE-8621
> Project: Hive
> Issue Type: Sub-task
> Reporter: Suhas Satish
> Assignee: Suhas Satish
>
> The number of broadcast variables that must be created is m x n where
> 'm' is the number of small tables in the (m+1) way join and n is the number
> of buckets of tables. If unbucketed, n=1
> This is a sub-task of map-join for spark
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)