Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Chao Sun Fri, 19 Sep 2014 11:20:07 -0700


> On Sept. 19, 2014, 5:45 p.m., Xuefu Zhang wrote:
> > Nice work.
> > 
> > Besides comment below, I think there are some improvement can be done, 
> > either here or in a different patch:
> > 
> > 1. If we have a module that can compile an op tree (given by top ops) into 
> > a spark task, then we can reuse it after the original op tree is broken 
> > into several trees. From each tree, we compile it into a spark task. In the 
> > end, we hook up parent child relation ship. The current logic is a little 
> > complicated and hard to understand.
> > 2. Tests 
> > 3. Optimizations


I agree. I can do these in separate following patches.


> On Sept. 19, 2014, 5:45 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java,
> >  line 142
> > <https://reviews.apache.org/r/25394/diff/3/?file=693788#file693788line142>
> >
> >     Here we are mapping the children of lca to lca itself. Why is this 
> > necessary, as you can find the chidren of lca later without the map. Cannot 
> > we just store lca here?

The problem is because we are only generating one FS but multiple TSs. After 
the FS and the first TS is generated, the relation between child-parent is lost 
(since the optree is modified), and hence we need to store this information 
somewhere else, to be used when process the rest TSs.


> On Sept. 19, 2014, 5:45 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java,
> >  line 140
> > <https://reviews.apache.org/r/25394/diff/3/?file=693788#file693788line140>
> >
> >     This seems covering only the case where all FSs have a commont FORWARD 
> > parent. What if only some of them sharing a FORWARD parent, but other FSs 
> > and the FORWARD operator sharing some common parent?
> >     
> >     I think the rule for whether to break the plan goes like this:
> >     
> >     A plan needs to be broken if and only if there are more than one 
> > FileSinkOperator that can be traced back to a common parent and the tracing 
> > has to pass a ReduceSinkOperator on the way.

In this case the LCA is not a FOR, then break at this point is safe (might not 
be optimal), is that right?
Personally, after so many attempts, I'm a bit inclined to just do what MR does: 
go top-down and keep the first RS in the same SparkWork. For the rests RS, just 
break the plan.


> On Sept. 19, 2014, 5:45 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java,
> >  line 120
> > <https://reviews.apache.org/r/25394/diff/3/?file=693788#file693788line120>
> >
> >     I feel that the logic here can be simplified. Could we just pop all 
> > paths and then check if the root is the same and keep doing so until the 
> > common parent is found?

I'm not quite sure. I would happily accept if you have a better algorithm :) 
(the one I'm using is a just standard algorithm for finding LCA).
The LCA could be at different place in each path. How do you proceed to pop all 
paths? Also, there could be multiple common parents, but we need to identify 
the lowest one.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25394/#review53871
-----------------------------------------------------------


On Sept. 18, 2014, 6:38 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25394/
> -----------------------------------------------------------
> 
> (Updated Sept. 18, 2014, 6:38 p.m.)
> 
> 
> Review request for hive, Brock Noland and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7503
>     https://issues.apache.org/jira/browse/HIVE-7503
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert. When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 4211a0703f5b6bfd8a628b13864fac75ef4977cf 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> 695d8b90cb1989805a7ff4e39a9635bbcea9c66c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 
> 864965e03a3f9d665e21e1c1b10b19dc286b842f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 76fc290f00430dbc34dbbc1a0cef0d0eb59e6029 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java
>  5fcaf643a0e90fc4acc21187f6d78cefdb1b691a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/25394/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>

Re: Review Request 25394: HIVE-7503: Support Hive's multi-table insert query with Spark [Spark Branch]

Reply via email to