I think the problem lies with in the group by operation. For this
optimization to work the group bys partitioning should be on the column 1
only.

It wont effect the correctness of group by, can make it slow but int this
case will fasten the overall query performance.


On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <mchett...@rocketfuelinc.com
> wrote:

> I have attached the hive 10 and 11 query plans, for the sample query
> below, for illustration.
>
>
> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> We are using DISTRIBUTE BY with custom reducer scripts in our query
>> workload.
>>
>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY
>> and custom reducer scripts produced incorrect results. Particularly, rows
>> with same value on DISTRIBUTE BY column ends up in multiple reducers and
>> thus produce multiple rows in final result, when we expect only one.
>>
>> I investigated a little bit and discovered the following behavior for
>> Hive 0.11:
>>
>> - Hive 0.11 produces a different plan for these queries with incorrect
>> results. The extra stage for the DISTRIBUTE BY + Transform is missing and
>> the Transform operator for the custom reducer script is pushed into the
>> reduce operator tree containing GROUP BY itself.
>>
>> - However, *if the SORT BY in the query has a DESC order in it*, the
>> right plan is produced, and the results look correct too.
>>
>> Hive 0.10 produces the expected plan with right results in all cases.
>>
>>
>> To illustrate, here is a simplified repro setup:
>>
>> Table:
>>
>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3
>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
>> TERMINATED BY '\n' STORED AS TEXTFILE;*
>>
>> Query:
>>
>> *ADD FILE reducer.py;*
>>
>> *FROM(*
>> *  SELECT grp, val2 *
>> *  FROM test_cluster *
>> *  GROUP BY grp, val2 *
>> *  DISTRIBUTE BY grp *
>> *  SORT BY grp, val2  -- add DESC here to get correct results*
>> *) **a*
>> *
>> *
>> *REDUCE a.**
>> *USING 'reducer.py'*
>> *AS grp, reducedValue*
>>
>>
>> If i understand correctly, this is a bug. Is this a known issue? Any
>> other insights? We have reverted to Hive 0.10 to avoid the incorrect
>> results while we investigate this.
>>
>> I have the repro sample, with test data and scripts, if anybody is
>> interested.
>>
>>
>>
>> Thanks,
>> pala
>>
>
>

Reply via email to