Seems ReduceSinkDeDuplication picked the wrong partitioning columns.
On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <s...@rocketfuel.com> wrote: > I think the problem lies with in the group by operation. For this > optimization to work the group bys partitioning should be on the column 1 > only. > > It wont effect the correctness of group by, can make it slow but int this > case will fasten the overall query performance. > > > On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia < > mchett...@rocketfuelinc.com> wrote: > >> I have attached the hive 10 and 11 query plans, for the sample query >> below, for illustration. >> >> >> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia < >> mchett...@rocketfuelinc.com> wrote: >> >>> Hi, >>> >>> We are using DISTRIBUTE BY with custom reducer scripts in our query >>> workload. >>> >>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY >>> and custom reducer scripts produced incorrect results. Particularly, rows >>> with same value on DISTRIBUTE BY column ends up in multiple reducers and >>> thus produce multiple rows in final result, when we expect only one. >>> >>> I investigated a little bit and discovered the following behavior for >>> Hive 0.11: >>> >>> - Hive 0.11 produces a different plan for these queries with incorrect >>> results. The extra stage for the DISTRIBUTE BY + Transform is missing and >>> the Transform operator for the custom reducer script is pushed into the >>> reduce operator tree containing GROUP BY itself. >>> >>> - However, *if the SORT BY in the query has a DESC order in it*, the >>> right plan is produced, and the results look correct too. >>> >>> Hive 0.10 produces the expected plan with right results in all cases. >>> >>> >>> To illustrate, here is a simplified repro setup: >>> >>> Table: >>> >>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 >>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES >>> TERMINATED BY '\n' STORED AS TEXTFILE;* >>> >>> Query: >>> >>> *ADD FILE reducer.py;* >>> >>> *FROM(* >>> * SELECT grp, val2 * >>> * FROM test_cluster * >>> * GROUP BY grp, val2 * >>> * DISTRIBUTE BY grp * >>> * SORT BY grp, val2 -- add DESC here to get correct results* >>> *) **a* >>> * >>> * >>> *REDUCE a.** >>> *USING 'reducer.py'* >>> *AS grp, reducedValue* >>> >>> >>> If i understand correctly, this is a bug. Is this a known issue? Any >>> other insights? We have reverted to Hive 0.10 to avoid the incorrect >>> results while we investigate this. >>> >>> I have the repro sample, with test data and scripts, if anybody is >>> interested. >>> >>> >>> >>> Thanks, >>> pala >>> >> >> >