forgot to add in my last reply.... To generate correct results, you can set hive.optimize.reducededuplication to false to turn off ReduceSinkDeDuplication
On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <huaiyin....@gmail.com> wrote: > Created a jira https://issues.apache.org/jira/browse/HIVE-5149 > > > On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <huaiyin....@gmail.com> wrote: > >> Seems ReduceSinkDeDuplication picked the wrong partitioning columns. >> >> >> On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <s...@rocketfuel.com> wrote: >> >>> I think the problem lies with in the group by operation. For this >>> optimization to work the group bys partitioning should be on the column >>> 1 only. >>> >>> It wont effect the correctness of group by, can make it slow but int >>> this case will fasten the overall query performance. >>> >>> >>> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia < >>> mchett...@rocketfuelinc.com> wrote: >>> >>>> I have attached the hive 10 and 11 query plans, for the sample query >>>> below, for illustration. >>>> >>>> >>>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia < >>>> mchett...@rocketfuelinc.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> We are using DISTRIBUTE BY with custom reducer scripts in our query >>>>> workload. >>>>> >>>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT >>>>> BY and custom reducer scripts produced incorrect results. Particularly, >>>>> rows with same value on DISTRIBUTE BY column ends up in multiple reducers >>>>> and thus produce multiple rows in final result, when we expect only one. >>>>> >>>>> I investigated a little bit and discovered the following behavior for >>>>> Hive 0.11: >>>>> >>>>> - Hive 0.11 produces a different plan for these queries with incorrect >>>>> results. The extra stage for the DISTRIBUTE BY + Transform is missing and >>>>> the Transform operator for the custom reducer script is pushed into the >>>>> reduce operator tree containing GROUP BY itself. >>>>> >>>>> - However, *if the SORT BY in the query has a DESC order in it*, the >>>>> right plan is produced, and the results look correct too. >>>>> >>>>> Hive 0.10 produces the expected plan with right results in all cases. >>>>> >>>>> >>>>> To illustrate, here is a simplified repro setup: >>>>> >>>>> Table: >>>>> >>>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 >>>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES >>>>> TERMINATED BY '\n' STORED AS TEXTFILE;* >>>>> >>>>> Query: >>>>> >>>>> *ADD FILE reducer.py;* >>>>> >>>>> *FROM(* >>>>> * SELECT grp, val2 * >>>>> * FROM test_cluster * >>>>> * GROUP BY grp, val2 * >>>>> * DISTRIBUTE BY grp * >>>>> * SORT BY grp, val2 -- add DESC here to get correct results* >>>>> *) **a* >>>>> * >>>>> * >>>>> *REDUCE a.** >>>>> *USING 'reducer.py'* >>>>> *AS grp, reducedValue* >>>>> >>>>> >>>>> If i understand correctly, this is a bug. Is this a known issue? Any >>>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect >>>>> results while we investigate this. >>>>> >>>>> I have the repro sample, with test data and scripts, if anybody is >>>>> interested. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> pala >>>>> >>>> >>>> >>> >> >