Hi, We are using DISTRIBUTE BY with custom reducer scripts in our query workload.
After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY and custom reducer scripts produced incorrect results. Particularly, rows with same value on DISTRIBUTE BY column ends up in multiple reducers and thus produce multiple rows in final result, when we expect only one. I investigated a little bit and discovered the following behavior for Hive 0.11: - Hive 0.11 produces a different plan for these queries with incorrect results. The extra stage for the DISTRIBUTE BY + Transform is missing and the Transform operator for the custom reducer script is pushed into the reduce operator tree containing GROUP BY itself. - However, *if the SORT BY in the query has a DESC order in it*, the right plan is produced, and the results look correct too. Hive 0.10 produces the expected plan with right results in all cases. To illustrate, here is a simplified repro setup: Table: *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;* Query: *ADD FILE reducer.py;* *FROM(* * SELECT grp, val2 * * FROM test_cluster * * GROUP BY grp, val2 * * DISTRIBUTE BY grp * * SORT BY grp, val2 -- add DESC here to get correct results* *) **a* * * *REDUCE a.** *USING 'reducer.py'* *AS grp, reducedValue* If i understand correctly, this is a bug. Is this a known issue? Any other insights? We have reverted to Hive 0.10 to avoid the incorrect results while we investigate this. I have the repro sample, with test data and scripts, if anybody is interested. Thanks, pala