subject:"Re\: Review Request 68121\: HIVE\-20220 \: Incorrect result when hive.groupby.skewindata is enabled"

Re: Review Request 68121: HIVE-20220 : Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Ganesha Shreedhara


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68121/
---

(Updated Aug. 10, 2018, 6:49 p.m.)


Review request for hive and Ashutosh Chauhan.


Repository: hive-git


Description
---

hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped by 
keys to the reducers and hence avoids overloading a single reducer when there 
is a skew in data. 

This random distribution of keys is buggy when the reducer fails to fetch the 
mapper output due to a faulty datanode or any other reason. When reducer finds 
that it can't fetch mapper output, it sends a signal to Application Master to 
reattempt the corresponding map task. The reattempted map task will now get the 
different random value from rand function and hence the keys that gets 
distributed now to the reducer will not be same as the previous run. 

 

Steps to reproduce:

create table test(id int);

insert into test values 
(1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);

SET hive.groupby.skewindata=true;

SET mapreduce.reduce.reduces=2;

//Add a debug port for reducer

select count(1) from test group by id;

//Remove mapper's intermediate output file when map stage is completed and one 
out of 2 reduce tasks is completed and then continue the run. This causes 2nd 
reducer to send event to Application Master to rerun the map task. 

The following is the expected result. 

1
2
3
4
5
6
8
8
9 

 

But you may get different result due to a different value returned by the rand 
function in the second run causing different distribution of keys.

This needs to be fixed such that the mapper distributes the same keys even if 
it is reattempted multiple times.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java 250a085084 
  ql/src/test/results/clientpositive/autoColumnStats_7.q.out 8c07d61390 
  ql/src/test/results/clientpositive/groupby1.q.out 70b39d887d 
  ql/src/test/results/clientpositive/groupby10.q.out 665bf929b5 
  ql/src/test/results/clientpositive/groupby11.q.out fd13d3a7bb 
  ql/src/test/results/clientpositive/groupby1_map_skew.q.out 91bfc00648 
  ql/src/test/results/clientpositive/groupby3.q.out aff57704d8 
  ql/src/test/results/clientpositive/groupby4.q.out 976376118a 
  ql/src/test/results/clientpositive/groupby5.q.out 7e9d928545 
  ql/src/test/results/clientpositive/groupby6.q.out 5928da6411 
  ql/src/test/results/clientpositive/groupby6_map_skew.q.out 71aa30c8bf 
  ql/src/test/results/clientpositive/groupby7_map_skew.q.out 10a3ae48b7 
  ql/src/test/results/clientpositive/groupby8.q.out ceb8a5b61a 
  ql/src/test/results/clientpositive/groupby_cube1.q.out da87cbf4a2 
  ql/src/test/results/clientpositive/groupby_rollup1.q.out 5f9d2d5691 
  ql/src/test/results/clientpositive/groupby_sort_skew_1_23.q.out 1eddcf57c7 
  ql/src/test/results/clientpositive/llap/explainuser_1.q.out a98191653f 
  ql/src/test/results/clientpositive/llap/groupby1.q.out e1cc298415 
  ql/src/test/results/clientpositive/llap/groupby2.q.out 434be1710d 
  ql/src/test/results/clientpositive/llap/groupby3.q.out 896a2ba505 
  ql/src/test/results/clientpositive/llap/groupby_resolution.q.out 11bb452135 
  ql/src/test/results/clientpositive/llap/vector_groupby4.q.out 6912d7b80f 
  ql/src/test/results/clientpositive/llap/vector_groupby6.q.out d3c654896a 
  ql/src/test/results/clientpositive/llap/vector_groupby_cube1.q.out 5c0d6bbb73 
  ql/src/test/results/clientpositive/llap/vector_groupby_grouping_id2.q.out 
dce2930943 
  ql/src/test/results/clientpositive/llap/vector_groupby_rollup1.q.out 
d1f8ac5505 
  ql/src/test/results/clientpositive/mapjoin_distinct.q.out 19ab4e5137 
  ql/src/test/results/clientpositive/nullgroup.q.out 04b1ad967a 
  ql/src/test/results/clientpositive/nullgroup2.q.out 60b6aefe49 
  ql/src/test/results/clientpositive/spark/groupby1.q.out 802b453d26 
  ql/src/test/results/clientpositive/spark/groupby1_map_skew.q.out cb909cb8c1 
  ql/src/test/results/clientpositive/spark/groupby4.q.out 296d0a009e 
  ql/src/test/results/clientpositive/spark/groupby5.q.out 22dacc5721 
  ql/src/test/results/clientpositive/spark/groupby6.q.out d31c48353b 
  ql/src/test/results/clientpositive/spark/groupby6_map_skew.q.out a694d55084 
  ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out 3d23715eba 
  ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 4a3a450d0c 
  ql/src/test/results/clientpositive/spark/groupby_resolution.q.out a6e9b46afe 
  ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out dea43e0a54 
  ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 
016cb3bcfd 
  ql/src/test/results/clientpositive/spark/mapjoin_distinct.q.out e54e46a8ba 
  ql/src/test/results/clientpositive/spark/nullgroup

Re: Review Request 68121: HIVE-20220 : Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Ganesha Shreedhara


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68121/
---

(Updated July 31, 2018, 2:06 p.m.)


Review request for hive and Ashutosh Chauhan.


Repository: hive-git


Description
---

hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped by 
keys to the reducers and hence avoids overloading a single reducer when there 
is a skew in data. 

This random distribution of keys is buggy when the reducer fails to fetch the 
mapper output due to a faulty datanode or any other reason. When reducer finds 
that it can't fetch mapper output, it sends a signal to Application Master to 
reattempt the corresponding map task. The reattempted map task will now get the 
different random value from rand function and hence the keys that gets 
distributed now to the reducer will not be same as the previous run. 

 

Steps to reproduce:

create table test(id int);

insert into test values 
(1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);

SET hive.groupby.skewindata=true;

SET mapreduce.reduce.reduces=2;

//Add a debug port for reducer

select count(1) from test group by id;

//Remove mapper's intermediate output file when map stage is completed and one 
out of 2 reduce tasks is completed and then continue the run. This causes 2nd 
reducer to send event to Application Master to rerun the map task. 

The following is the expected result. 

1
2
3
4
5
6
8
8
9 

 

But you may get different result due to a different value returned by the rand 
function in the second run causing different distribution of keys.

This needs to be fixed such that the mapper distributes the same keys even if 
it is reattempted multiple times.


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 39c77b3fe5 
  ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java 250a085084 
  ql/src/test/queries/clientpositive/groupby_skew_rand_seed.q PRE-CREATION 
  ql/src/test/results/clientpositive/groupby_skew_rand_seed.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/68121/diff/2/

Changes: https://reviews.apache.org/r/68121/diff/1-2/


Testing
---

Qtests added


Thanks,

Ganesha Shreedhara

Re: Review Request 68121: HIVE-20220 : Incorrect result when hive.groupby.skewindata is enabled

Re: Review Request 68121: HIVE-20220 : Incorrect result when hive.groupby.skewindata is enabled

2 matches

Site Navigation

Mail list logo

Footer information