Seems ReduceSinkDeDuplication picked the wrong partitioning columns.

On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <s...@rocketfuel.com> wrote:

> I think the problem lies with in the group by operation. For this
> optimization to work the group bys partitioning should be on the column 1
> only.
>
> It wont effect the correctness of group by, can make it slow but int this
> case will fasten the overall query performance.
>
>
> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> I have attached the hive 10 and 11 query plans, for the sample query
>> below, for illustration.
>>
>>
>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com> wrote:
>>
>>> Hi,
>>>
>>> We are using DISTRIBUTE BY with custom reducer scripts in our query
>>> workload.
>>>
>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY
>>> and custom reducer scripts produced incorrect results. Particularly, rows
>>> with same value on DISTRIBUTE BY column ends up in multiple reducers and
>>> thus produce multiple rows in final result, when we expect only one.
>>>
>>> I investigated a little bit and discovered the following behavior for
>>> Hive 0.11:
>>>
>>> - Hive 0.11 produces a different plan for these queries with incorrect
>>> results. The extra stage for the DISTRIBUTE BY + Transform is missing and
>>> the Transform operator for the custom reducer script is pushed into the
>>> reduce operator tree containing GROUP BY itself.
>>>
>>> - However, *if the SORT BY in the query has a DESC order in it*, the
>>> right plan is produced, and the results look correct too.
>>>
>>> Hive 0.10 produces the expected plan with right results in all cases.
>>>
>>>
>>> To illustrate, here is a simplified repro setup:
>>>
>>> Table:
>>>
>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3
>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
>>> TERMINATED BY '\n' STORED AS TEXTFILE;*
>>>
>>> Query:
>>>
>>> *ADD FILE reducer.py;*
>>>
>>> *FROM(*
>>> *  SELECT grp, val2 *
>>> *  FROM test_cluster *
>>> *  GROUP BY grp, val2 *
>>> *  DISTRIBUTE BY grp *
>>> *  SORT BY grp, val2  -- add DESC here to get correct results*
>>> *) **a*
>>> *
>>> *
>>> *REDUCE a.**
>>> *USING 'reducer.py'*
>>> *AS grp, reducedValue*
>>>
>>>
>>> If i understand correctly, this is a bug. Is this a known issue? Any
>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect
>>> results while we investigate this.
>>>
>>> I have the repro sample, with test data and scripts, if anybody is
>>> interested.
>>>
>>>
>>>
>>> Thanks,
>>> pala
>>>
>>
>>
>

Reply via email to