Created a jira https://issues.apache.org/jira/browse/HIVE-5149
On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <huaiyin....@gmail.com> wrote: > Seems ReduceSinkDeDuplication picked the wrong partitioning columns. > > > On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <s...@rocketfuel.com> wrote: > >> I think the problem lies with in the group by operation. For this >> optimization to work the group bys partitioning should be on the column >> 1 only. >> >> It wont effect the correctness of group by, can make it slow but int this >> case will fasten the overall query performance. >> >> >> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia < >> mchett...@rocketfuelinc.com> wrote: >> >>> I have attached the hive 10 and 11 query plans, for the sample query >>> below, for illustration. >>> >>> >>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia < >>> mchett...@rocketfuelinc.com> wrote: >>> >>>> Hi, >>>> >>>> We are using DISTRIBUTE BY with custom reducer scripts in our query >>>> workload. >>>> >>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY >>>> and custom reducer scripts produced incorrect results. Particularly, rows >>>> with same value on DISTRIBUTE BY column ends up in multiple reducers and >>>> thus produce multiple rows in final result, when we expect only one. >>>> >>>> I investigated a little bit and discovered the following behavior for >>>> Hive 0.11: >>>> >>>> - Hive 0.11 produces a different plan for these queries with incorrect >>>> results. The extra stage for the DISTRIBUTE BY + Transform is missing and >>>> the Transform operator for the custom reducer script is pushed into the >>>> reduce operator tree containing GROUP BY itself. >>>> >>>> - However, *if the SORT BY in the query has a DESC order in it*, the >>>> right plan is produced, and the results look correct too. >>>> >>>> Hive 0.10 produces the expected plan with right results in all cases. >>>> >>>> >>>> To illustrate, here is a simplified repro setup: >>>> >>>> Table: >>>> >>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 >>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES >>>> TERMINATED BY '\n' STORED AS TEXTFILE;* >>>> >>>> Query: >>>> >>>> *ADD FILE reducer.py;* >>>> >>>> *FROM(* >>>> * SELECT grp, val2 * >>>> * FROM test_cluster * >>>> * GROUP BY grp, val2 * >>>> * DISTRIBUTE BY grp * >>>> * SORT BY grp, val2 -- add DESC here to get correct results* >>>> *) **a* >>>> * >>>> * >>>> *REDUCE a.** >>>> *USING 'reducer.py'* >>>> *AS grp, reducedValue* >>>> >>>> >>>> If i understand correctly, this is a bug. Is this a known issue? Any >>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect >>>> results while we investigate this. >>>> >>>> I have the repro sample, with test data and scripts, if anybody is >>>> interested. >>>> >>>> >>>> >>>> Thanks, >>>> pala >>>> >>> >>> >> >