> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote: > > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line > > 220 > > <https://reviews.apache.org/r/27719/diff/1/?file=754282#file754282line220> > > > > Could you please remove the trailing spaces?
Sure. > On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote: > > ql/src/test/results/clientpositive/spark/stats1.q.out, line 182 > > <https://reviews.apache.org/r/27719/diff/1/?file=754283#file754283line182> > > > > This seems slightly different from MR's output. I'm wondering if this > > is expected. Xuefu, thank you for doing the code review. The spark output is missing one filesinkoperator's stats data. I need to fix that. On Nov. 7, 2014, 3:32 a.m., Na Yang wrote: > > The original code is pretty much cloned from Tez, I'm wondering if Tez > > suffers the same problem. We modified the remove union code in spark by removing the newly cloned FileSinkOperators from the fileSinkSet to avoid generating multiple duplicated merge tasks. However, this caused the stats flag missing from the cloned FileSinkOperators which are actually used in the SparkWork. My current patch only adds the stats flag to one of the cloned FileSinkOperators, not all of the cloned FileSinkOperators. That causes the wrong output. I will re-consider the fix and update the patch accordingly. Thank you Xuefu for the code review! - Na ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27719/#review60294 ----------------------------------------------------------- On Nov. 7, 2014, 2:35 a.m., Na Yang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/27719/ > ----------------------------------------------------------- > > (Updated Nov. 7, 2014, 2:35 a.m.) > > > Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang. > > > Bugs: Hive-8756 > https://issues.apache.org/jira/browse/Hive-8756 > > > Repository: hive-git > > > Description > ------- > > numRows and rawDataSize are not collected by the Spark stats. That is caused > by the FileSinkOperator in the ReduceWork is not set the stats config. In the > GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new > FileSinkOperator is generated and set to the reduce work. However, during > processFileSink, the original FileSinkOperator is set the collectStats tag in > GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in > the ReduceWork. > > > Diffs > ----- > > itests/src/test/resources/testconfiguration.properties 79a0132 > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java > 8290568 > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java > e8e18a7 > ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION > > Diff: https://reviews.apache.org/r/27719/diff/ > > > Testing > ------- > > > Thanks, > > Na Yang > >