Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Na Yang Fri, 07 Nov 2014 13:16:35 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/
-----------------------------------------------------------


(Updated Nov. 7, 2014, 9:16 p.m.)


Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.


Changes
-------

1. removed whilespace characters
2. handle operators which have multiple children
3. update stats config info for all cloned FileSinkOperators


Bugs: Hive-8756
    https://issues.apache.org/jira/browse/Hive-8756


Repository: hive-git


Description
-------

numRows and rawDataSize are not collected by the Spark stats. That is caused by 
the FileSinkOperator in the ReduceWork is not set the stats config. In the 
GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new 
FileSinkOperator is generated and set to the reduce work. However, during 
processFileSink, the original FileSinkOperator is set the collectStats tag in 
GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the 
ReduceWork.  


Diffs (updated)
-----

  itests/src/test/resources/testconfiguration.properties 79a0132 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
8290568 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
  ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
  ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
  ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
  ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/27719/diff/


Testing
-------


Thanks,

Na Yang

Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Reply via email to