Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Chao Sun Sun, 09 Nov 2014 14:40:12 -0800


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
> >  line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each 
> > other, which isn't true in general.Let's say the are two works (w1 and w2), 
> > each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
> > also contains map join operator. Dependency in this scenario will be 
> > graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, 
> they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
> will depends on both the SparkWork generated at w1 and SparkWork generated at 
> w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to 
> create a SparkWork to handle its small tables. So, both w1 and w2 will need 
> to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
> SparkWork depends on the two SparkWorks created.
> 
> Chao Sun wrote:
>     I'm not getting it, why "This dependency is not linear"? Can you give a 
> counter example?
>     Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
>     
>          HTS_1   HTS_2     HTS_3    HTS_4
>            \      /           \     /
>             \    /             \   /
>               MJ_1              MJ_2
>                |                 |
>                |                 |
>               HTS_5            HTS_6
>                   \            /
>                    \          /
>                     \        /
>                      \      /
>                       \    /
>                         MJ_3
>                         
>     Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
> SparkWork, say SW_1
>     then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and 
> MJ_3 in another SparkWork SW_3.
>     SW_1 -> SW_2 -> SW_3.
> 
> Xuefu Zhang wrote:
>     I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same 
> SparkWork. They belong to different MJ handling different sets of small 
> tables. This will complicate things, making HashTableSinkOperator and 
> HashTableLoader more complicated.
>     
>     Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, 
> and vice versa.
>     
>     Please refer to pseudo code posted in the JIRA for implementation ideas. 
> Thanks.


Resolved via a offline chat.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2014, 10:39 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Reply via email to