[ https://issues.apache.org/jira/browse/PIG-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1169: ------------------------------ Description: ??We tried to get top N results after a groupby and sort, and got different results with or without storing the full sorted results. Here is a skeleton of our pig script.?? {code} raw_data = Load '<input_files>' AS (f1, f2, ..., fn); grouped = group raw_data by (f1, f2); data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value; ordered = order data by value DESC parallel 10; topn = limit ordered 10; store ordered into 'outputdir/full'; store topn into 'outputdir/topn'; {code} ??With the statement 'store ordered ...', top N results are incorrect, but without the statement, results are correct. Has anyone seen this before? I know a similar bug has been fixed in the multi-query release. We are on pig .4 and hadoop .20.1.?? was: Recently, a couple of problems related to the Top N queries were reported by users. * From Chuang Liu: We tried to get top N results after a groupby and sort, and got different results with or without storing the full sorted results. Here is a skeleton of our pig script. {code} raw_data = Load '<input_files>' AS (f1, f2, ..., fn); grouped = group raw_data by (f1, f2); data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value; ordered = order data by value DESC parallel 10; topn = limit ordered 10; store ordered into 'outputdir/full'; store topn into 'outputdir/topn'; {code} With the statement 'store ordered ...', top N results are incorrect, but without the statement, results are correct. Has anyone seen this before? I know a similar bug has been fixed in the multi-query release. We are on pig .4 and hadoop .20.1. * From Corry Haines: I am not sure if this is a bug, or something more subtle, but here is the problem that I am having. When I LOAD a dataset, change it with an ORDER, LIMIT it, then CROSS it with itself, the results are not correct. I expect to see the cross of the limited, ordered dataset, but instead I see the cross of the limited dataset. Effectively, its like the LIMIT is being excluded. Pig Version: 0.5.0 Hadoop Version: 0.20.1 I would greatly appreciate some help, as this is somewhat frustrating. Example code (and output) follows: {code} A = load 'foo' as (f1:int, f2:int, f3:int); B = load 'foo' as (f1:int, f2:int, f3:int); a = ORDER A BY f1 DESC; b = ORDER B BY f1 DESC; aa = LIMIT a 1; bb = LIMIT b 1; C = CROSS aa, bb; DUMP C; {code} Summary: Top-N queries produce incorrect results when a store statement is added between order by and limit statement (was: Problems with some top N queries) > Top-N queries produce incorrect results when a store statement is added > between order by and limit statement > ------------------------------------------------------------------------------------------------------------ > > Key: PIG-1169 > URL: https://issues.apache.org/jira/browse/PIG-1169 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.7.0 > Reporter: Richard Ding > Assignee: Richard Ding > > ??We tried to get top N results after a groupby and sort, and got different > results with or without storing the full sorted results. Here is a skeleton > of our pig script.?? > {code} > raw_data = Load '<input_files>' AS (f1, f2, ..., fn); > grouped = group raw_data by (f1, f2); > data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value; > ordered = order data by value DESC parallel 10; > topn = limit ordered 10; > store ordered into 'outputdir/full'; > store topn into 'outputdir/topn'; > {code} > ??With the statement 'store ordered ...', top N results are incorrect, but > without the statement, results are correct. Has anyone seen this before? I > know a similar bug has been fixed in the multi-query release. We are on pig > .4 and hadoop .20.1.?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.