I think the problem lies with in the group by operation. For this optimization to work the group bys partitioning should be on the column 1 only.
It wont effect the correctness of group by, can make it slow but int this case will fasten the overall query performance. On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <mchett...@rocketfuelinc.com > wrote: > I have attached the hive 10 and 11 query plans, for the sample query > below, for illustration. > > > On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia < > mchett...@rocketfuelinc.com> wrote: > >> Hi, >> >> We are using DISTRIBUTE BY with custom reducer scripts in our query >> workload. >> >> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY >> and custom reducer scripts produced incorrect results. Particularly, rows >> with same value on DISTRIBUTE BY column ends up in multiple reducers and >> thus produce multiple rows in final result, when we expect only one. >> >> I investigated a little bit and discovered the following behavior for >> Hive 0.11: >> >> - Hive 0.11 produces a different plan for these queries with incorrect >> results. The extra stage for the DISTRIBUTE BY + Transform is missing and >> the Transform operator for the custom reducer script is pushed into the >> reduce operator tree containing GROUP BY itself. >> >> - However, *if the SORT BY in the query has a DESC order in it*, the >> right plan is produced, and the results look correct too. >> >> Hive 0.10 produces the expected plan with right results in all cases. >> >> >> To illustrate, here is a simplified repro setup: >> >> Table: >> >> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 >> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES >> TERMINATED BY '\n' STORED AS TEXTFILE;* >> >> Query: >> >> *ADD FILE reducer.py;* >> >> *FROM(* >> * SELECT grp, val2 * >> * FROM test_cluster * >> * GROUP BY grp, val2 * >> * DISTRIBUTE BY grp * >> * SORT BY grp, val2 -- add DESC here to get correct results* >> *) **a* >> * >> * >> *REDUCE a.** >> *USING 'reducer.py'* >> *AS grp, reducedValue* >> >> >> If i understand correctly, this is a bug. Is this a known issue? Any >> other insights? We have reverted to Hive 0.10 to avoid the incorrect >> results while we investigate this. >> >> I have the repro sample, with test data and scripts, if anybody is >> interested. >> >> >> >> Thanks, >> pala >> > >