Hi,

We are using DISTRIBUTE BY with custom reducer scripts in our query
workload.

After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY and
custom reducer scripts produced incorrect results. Particularly, rows with
same value on DISTRIBUTE BY column ends up in multiple reducers and thus
produce multiple rows in final result, when we expect only one.

I investigated a little bit and discovered the following behavior for Hive
0.11:

- Hive 0.11 produces a different plan for these queries with incorrect
results. The extra stage for the DISTRIBUTE BY + Transform is missing and
the Transform operator for the custom reducer script is pushed into the
reduce operator tree containing GROUP BY itself.

- However, *if the SORT BY in the query has a DESC order in it*, the right
plan is produced, and the results look correct too.

Hive 0.10 produces the expected plan with right results in all cases.


To illustrate, here is a simplified repro setup:

Table:

*CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 STRING,
val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY
'\n' STORED AS TEXTFILE;*

Query:

*ADD FILE reducer.py;*

*FROM(*
*  SELECT grp, val2 *
*  FROM test_cluster *
*  GROUP BY grp, val2 *
*  DISTRIBUTE BY grp *
*  SORT BY grp, val2  -- add DESC here to get correct results*
*) **a*
*
*
*REDUCE a.**
*USING 'reducer.py'*
*AS grp, reducedValue*


If i understand correctly, this is a bug. Is this a known issue? Any other
insights? We have reverted to Hive 0.10 to avoid the incorrect results
while we investigate this.

I have the repro sample, with test data and scripts, if anybody is
interested.



Thanks,
pala

Reply via email to