Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Na Yang Thu, 06 Nov 2014 22:34:59 -0800


> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 
> > 220
> > <https://reviews.apache.org/r/27719/diff/1/?file=754282#file754282line220>
> >
> >     Could you please remove the trailing spaces?


Sure.


> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/test/results/clientpositive/spark/stats1.q.out, line 182
> > <https://reviews.apache.org/r/27719/diff/1/?file=754283#file754283line182>
> >
> >     This seems slightly different from MR's output. I'm wondering if this 
> > is expected.

Xuefu, thank you for doing the code review. The spark output is missing one 
filesinkoperator's stats data. I need to fix that.


On Nov. 7, 2014, 3:32 a.m., Na Yang wrote:
> > The original code is pretty much cloned from Tez, I'm wondering if Tez 
> > suffers the same problem.

We modified the remove union code in spark by removing the newly cloned 
FileSinkOperators from the fileSinkSet to avoid generating multiple duplicated 
merge tasks.  However, this caused the stats flag missing from the cloned 
FileSinkOperators which are actually used in the SparkWork. My current patch 
only adds the stats flag to one of the cloned FileSinkOperators, not all of the 
cloned FileSinkOperators. That causes the wrong output. I will re-consider the 
fix and update the patch accordingly. Thank you Xuefu for the code review!


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60294
-----------------------------------------------------------


On Nov. 7, 2014, 2:35 a.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 2:35 a.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused 
> by the FileSinkOperator in the ReduceWork is not set the stats config. In the 
> GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new 
> FileSinkOperator is generated and set to the reduce work. However, during 
> processFileSink, the original FileSinkOperator is set the collectStats tag in 
> GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in 
> the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> e8e18a7 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>

Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Reply via email to