Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Chao Sun Sat, 08 Nov 2014 21:57:08 -0800


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
> >  line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each 
> > other, which isn't true in general.Let's say the are two works (w1 and w2), 
> > each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
> > also contains map join operator. Dependency in this scenario will be 
> > graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, 
> they can be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
> will depends on both the SparkWork generated at w1 and SparkWork generated at 
> w2. This dependency is not linear.
>     
>     To put more details, for each work that has map join op, we need to 
> create a SparkWork to handle its small tables. So, both w1 and w2 will need 
> to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
> SparkWork depends on the two SparkWorks created.


I'm not getting it, why "This dependency is not linear"? Can you give a counter 
example?
Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:

     HTS_1   HTS_2     HTS_3    HTS_4
       \      /           \     /
        \    /             \   /
          MJ_1              MJ_2
           |                 |
           |                 |
          HTS_5            HTS_6
              \            /
               \          /
                \        /
                 \      /
                  \    /
                    MJ_3
                    
Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
SparkWork, say SW_1
then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 
in another SparkWork SW_3.
SW_1 -> SW_2 -> SW_3.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

Reply via email to