Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-09 Thread Xuefu Zhang


 On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 214
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214
 
  This assumes that result SparkWorks will be linearly dependent on each 
  other, which isn't true in general.Let's say the are two works (w1 and w2), 
  each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
  also contains map join operator. Dependency in this scenario will be 
  graphic rather than linear.
 
 Chao Sun wrote:
 I was thinking, in this case, if there's no dependency between w1 and w2, 
 they can be put in the same SparkWork, right?
 Otherwise, they will form a linear dependency too.
 
 Xuefu Zhang wrote:
 w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
 will depends on both the SparkWork generated at w1 and SparkWork generated at 
 w2. This dependency is not linear.
 
 To put more details, for each work that has map join op, we need to 
 create a SparkWork to handle its small tables. So, both w1 and w2 will need 
 to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
 SparkWork depends on the two SparkWorks created.
 
 Chao Sun wrote:
 I'm not getting it, why This dependency is not linear? Can you give a 
 counter example?
 Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
 
  HTS_1   HTS_2 HTS_3HTS_4
\  /   \ /
 \/ \   /
   MJ_1  MJ_2
| |
| |
   HTS_5HTS_6
   \/
\  /
 \/
  \  /
   \/
 MJ_3
 
 Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
 SparkWork, say SW_1
 then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and 
 MJ_3 in another SparkWork SW_3.
 SW_1 - SW_2 - SW_3.

I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same SparkWork. 
They belong to different MJ handling different sets of small tables. This will 
complicate things, making HashTableSinkOperator and HashTableLoader more 
complicated.

Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, and 
vice versa.

Please refer to pseudo code posted in the JIRA for implementation ideas. Thanks.


- Xuefu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-09 Thread Chao Sun

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
---

(Updated Nov. 9, 2014, 10:39 p.m.)


Review request for hive.


Changes
---

Adopting Xuefu's pseudo code. Now for each BaseWork with MJ operator, use a 
SparkWork for its parent BaseWorks that contain HashTableSinkOperator.
I manually tested this patch with several qfiles containing map-join queries, 
and results look correct.


Bugs: HIVE-8622
https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
---

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
---


Thanks,

Chao Sun



Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-09 Thread Chao Sun


 On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 214
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214
 
  This assumes that result SparkWorks will be linearly dependent on each 
  other, which isn't true in general.Let's say the are two works (w1 and w2), 
  each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
  also contains map join operator. Dependency in this scenario will be 
  graphic rather than linear.
 
 Chao Sun wrote:
 I was thinking, in this case, if there's no dependency between w1 and w2, 
 they can be put in the same SparkWork, right?
 Otherwise, they will form a linear dependency too.
 
 Xuefu Zhang wrote:
 w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
 will depends on both the SparkWork generated at w1 and SparkWork generated at 
 w2. This dependency is not linear.
 
 To put more details, for each work that has map join op, we need to 
 create a SparkWork to handle its small tables. So, both w1 and w2 will need 
 to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
 SparkWork depends on the two SparkWorks created.
 
 Chao Sun wrote:
 I'm not getting it, why This dependency is not linear? Can you give a 
 counter example?
 Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:
 
  HTS_1   HTS_2 HTS_3HTS_4
\  /   \ /
 \/ \   /
   MJ_1  MJ_2
| |
| |
   HTS_5HTS_6
   \/
\  /
 \/
  \  /
   \/
 MJ_3
 
 Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
 SparkWork, say SW_1
 then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and 
 MJ_3 in another SparkWork SW_3.
 SW_1 - SW_2 - SW_3.
 
 Xuefu Zhang wrote:
 I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same 
 SparkWork. They belong to different MJ handling different sets of small 
 tables. This will complicate things, making HashTableSinkOperator and 
 HashTableLoader more complicated.
 
 Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, 
 and vice versa.
 
 Please refer to pseudo code posted in the JIRA for implementation ideas. 
 Thanks.

Resolved via a offline chat.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---


On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 9, 2014, 10:39 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-09 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60530
---

Ship it!


Ship It!

- Xuefu Zhang


On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 9, 2014, 10:39 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101882

This assumes that result SparkWorks will be linearly dependent on each 
other, which isn't true in general.Let's say the are two works (w1 and w2), 
each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also 
contains map join operator. Dependency in this scenario will be graphic rather 
than linear.


- Xuefu Zhang


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Chao Sun


 On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 214
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214
 
  This assumes that result SparkWorks will be linearly dependent on each 
  other, which isn't true in general.Let's say the are two works (w1 and w2), 
  each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
  also contains map join operator. Dependency in this scenario will be 
  graphic rather than linear.

I was thinking, in this case, if there's no dependency between w1 and w2, they 
can be put in the same SparkWork, right?
Otherwise, they will form a linear dependency too.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Chao Sun


 On Nov. 8, 2014, 12:44 a.m., Szehon Ho wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 224
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line224
 
  I've been thinking about this, as you had brought up a pretty rare 
  use-case where a big-table parent of mapjoin1 still had a HTS , but its for 
  another(!) mapjoin.  I dont know if this is still a valid case , but do you 
  think this handles it, as it just indisciriminately adds it to the parent 
  map if it has HTS?

Fixed through a offline chat.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60380
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Chao Sun


 On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 100
  https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line100
 
  It seems possible that current is MJwork, right? Are you going to add 
  it to the target?

Yes, it's possible. But that MJwork will be a one of which all HTS are already 
handled, so we can go through it to some HTS for other MJworks.


 On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 115
  https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line115
 
  Frankly, I'm not 100% following the logic. The diagram has operators 
  mixed with works, which makes it hard. But I'm seeing where you're coming 
  from. Maybe you can explain to me better in person.

Here the operator name (MJ, HTS) means a work contains the operator, so MJ is a 
BaseWork containing MJ operator, and same for HTS.
Yes, I think explaining in person would be better.


 On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 155
  https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line155
 
  I think there is a separate JIRA handling combining mapjoins, owned by 
  Szehon.

In my understanding, Szehon's JIRA is try to put MJ operators in the same 
BaseWork. But, there're some cases that we cannot apply this optimization, and 
MJ operators will be in different BaseWorks. My work here is to try to put them 
in the same SparkWork, if there's no dependency among them.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60403
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Xuefu Zhang


 On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 214
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214
 
  This assumes that result SparkWorks will be linearly dependent on each 
  other, which isn't true in general.Let's say the are two works (w1 and w2), 
  each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
  also contains map join operator. Dependency in this scenario will be 
  graphic rather than linear.
 
 Chao Sun wrote:
 I was thinking, in this case, if there's no dependency between w1 and w2, 
 they can be put in the same SparkWork, right?
 Otherwise, they will form a linear dependency too.

w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will 
depends on both the SparkWork generated at w1 and SparkWork generated at w2. 
This dependency is not linear.

To put more details, for each work that has map join op, we need to create a 
SparkWork to handle its small tables. So, both w1 and w2 will need to create 
such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork 
depends on the two SparkWorks created.


- Xuefu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-08 Thread Chao Sun


 On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 214
  https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214
 
  This assumes that result SparkWorks will be linearly dependent on each 
  other, which isn't true in general.Let's say the are two works (w1 and w2), 
  each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 
  also contains map join operator. Dependency in this scenario will be 
  graphic rather than linear.
 
 Chao Sun wrote:
 I was thinking, in this case, if there's no dependency between w1 and w2, 
 they can be put in the same SparkWork, right?
 Otherwise, they will form a linear dependency too.
 
 Xuefu Zhang wrote:
 w1 and w2 are fine. they will be in the same SparkWork. This SparkWork 
 will depends on both the SparkWork generated at w1 and SparkWork generated at 
 w2. This dependency is not linear.
 
 To put more details, for each work that has map join op, we need to 
 create a SparkWork to handle its small tables. So, both w1 and w2 will need 
 to create such SparkWork. While w1 and w2 are in the same SparkWork, this 
 SparkWork depends on the two SparkWorks created.

I'm not getting it, why This dependency is not linear? Can you give a counter 
example?
Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:

 HTS_1   HTS_2 HTS_3HTS_4
   \  /   \ /
\/ \   /
  MJ_1  MJ_2
   | |
   | |
  HTS_5HTS_6
  \/
   \  /
\/
 \  /
  \/
MJ_3

Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same 
SparkWork, say SW_1
then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 
in another SparkWork SW_3.
SW_1 - SW_2 - SW_3.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
---


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-07 Thread Chao Sun

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
---

(Updated Nov. 7, 2014, 3:57 p.m.)


Review request for hive.


Changes
---

Another patch with a cleaner solution in my opinion. I tested it with 
subquery_multiinsert.q and result looks fine. Please give suggestions!


Bugs: HIVE-8622
https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
---

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
---


Thanks,

Chao Sun



Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-07 Thread Chao Sun

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/
---

(Updated Nov. 7, 2014, 6:07 p.m.)


Review request for hive.


Changes
---

Instead of using a Set, we should use a Map from a BaseWork w/ MJ to all its 
parent BaseWorks w/ HTSs. The principle is, we cannot process all BaseWorks 
below this MJ until all HTSs are processed.


Bugs: HIVE-8622
https://issues.apache.org/jira/browse/HIVE-8622


Repository: hive-git


Description
---

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 

Diff: https://reviews.apache.org/r/27627/diff/


Testing
---


Thanks,

Chao Sun



Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-07 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60403
---



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101790

Nit: need space before and after -. Same below in multiple occurrances.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101795

It seems possible that current is MJwork, right? Are you going to add it to 
the target?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101808

Frankly, I'm not 100% following the logic. The diagram has operators mixed 
with works, which makes it hard. But I'm seeing where you're coming from. Maybe 
you can explain to me better in person.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101799

I think there is a separate JIRA handling combining mapjoins, owned by 
Szehon.


- Xuefu Zhang


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-07 Thread Szehon Ho

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60380
---



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101745

We can add a comment.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101746

We should not start this with capital letters.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101846

I've been thinking about this, as you had brought up a pretty rare use-case 
where a big-table parent of mapjoin1 still had a HTS , but its for another(!) 
mapjoin.  I dont know if this is still a valid case , but do you think this 
handles it, as it just indisciriminately adds it to the parent map if it has 
HTS?


- Szehon Ho


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 7, 2014, 6:07 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review59987
---



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101309

Do you mean parentTasks != null?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101335

it seems this check should be the first line in the outer for loop, for 
better efficiency and clarity.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
https://reviews.apache.org/r/27627/#comment101336

Merge with itself?


- Xuefu Zhang


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 5, 2014, 5:51 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Szehon Ho

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60034
---


Hi Chao, I left a review for a form of this patch at 
https://reviews.apache.org/r/27640/, as Suhas put it up for a separate review 
in combination with his patch.

- Szehon Ho


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 5, 2014, 5:51 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Chao Sun


 On Nov. 5, 2014, 9:24 p.m., Szehon Ho wrote:
  Hi Chao, I left a review for a form of this patch at 
  https://reviews.apache.org/r/27640/, as Suhas put it up for a separate 
  review in combination with his patch.

Thanks, I'll take a look there.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60034
---


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 5, 2014, 5:51 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun
 




Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Chao Sun


 On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 128
  https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line128
 
  Do you mean parentTasks != null?

That was a silly mistake.


 On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote:
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
   line 185
  https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line185
 
  Merge with itself?

Yes, in this case (current BaseWork has no MJ), we merge all parent SparkWorks 
into the current SparkWork.


- Chao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review59987
---


On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27627/
 ---
 
 (Updated Nov. 5, 2014, 5:51 p.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-8622
 https://issues.apache.org/jira/browse/HIVE-8622
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27627/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Chao Sun