Thanks for following up Yin. We realized later this was due to the reduce deduplication optimization, and found turning off the flag avoids the issue.
-pala On Mon, Aug 26, 2013 at 4:40 AM, Yin Huai <huaiyin....@gmail.com> wrote: > forgot to add in my last reply.... To generate correct results, you can > set hive.optimize.reducededuplication to false to turn off > ReduceSinkDeDuplication > > > On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <huaiyin....@gmail.com> wrote: > > > Created a jira https://issues.apache.org/jira/browse/HIVE-5149 > > > > > > On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <huaiyin....@gmail.com> wrote: > > > >> Seems ReduceSinkDeDuplication picked the wrong partitioning columns. > >> > >> > >> On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <s...@rocketfuel.com> > wrote: > >> > >>> I think the problem lies with in the group by operation. For this > >>> optimization to work the group bys partitioning should be on the column > >>> 1 only. > >>> > >>> It wont effect the correctness of group by, can make it slow but int > >>> this case will fasten the overall query performance. > >>> > >>> > >>> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia < > >>> mchett...@rocketfuelinc.com> wrote: > >>> > >>>> I have attached the hive 10 and 11 query plans, for the sample query > >>>> below, for illustration. > >>>> > >>>> > >>>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia < > >>>> mchett...@rocketfuelinc.com> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> We are using DISTRIBUTE BY with custom reducer scripts in our query > >>>>> workload. > >>>>> > >>>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT > >>>>> BY and custom reducer scripts produced incorrect results. > Particularly, > >>>>> rows with same value on DISTRIBUTE BY column ends up in multiple > reducers > >>>>> and thus produce multiple rows in final result, when we expect only > one. > >>>>> > >>>>> I investigated a little bit and discovered the following behavior for > >>>>> Hive 0.11: > >>>>> > >>>>> - Hive 0.11 produces a different plan for these queries with > incorrect > >>>>> results. The extra stage for the DISTRIBUTE BY + Transform is > missing and > >>>>> the Transform operator for the custom reducer script is pushed into > the > >>>>> reduce operator tree containing GROUP BY itself. > >>>>> > >>>>> - However, *if the SORT BY in the query has a DESC order in it*, the > >>>>> right plan is produced, and the results look correct too. > >>>>> > >>>>> Hive 0.10 produces the expected plan with right results in all cases. > >>>>> > >>>>> > >>>>> To illustrate, here is a simplified repro setup: > >>>>> > >>>>> Table: > >>>>> > >>>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 > >>>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES > >>>>> TERMINATED BY '\n' STORED AS TEXTFILE;* > >>>>> > >>>>> Query: > >>>>> > >>>>> *ADD FILE reducer.py;* > >>>>> > >>>>> *FROM(* > >>>>> * SELECT grp, val2 * > >>>>> * FROM test_cluster * > >>>>> * GROUP BY grp, val2 * > >>>>> * DISTRIBUTE BY grp * > >>>>> * SORT BY grp, val2 -- add DESC here to get correct results* > >>>>> *) **a* > >>>>> * > >>>>> * > >>>>> *REDUCE a.** > >>>>> *USING 'reducer.py'* > >>>>> *AS grp, reducedValue* > >>>>> > >>>>> > >>>>> If i understand correctly, this is a bug. Is this a known issue? Any > >>>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect > >>>>> results while we investigate this. > >>>>> > >>>>> I have the repro sample, with test data and scripts, if anybody is > >>>>> interested. > >>>>> > >>>>> > >>>>> > >>>>> Thanks, > >>>>> pala > >>>>> > >>>> > >>>> > >>> > >> > > >