[ 
https://issues.apache.org/jira/browse/PIG-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179385#comment-15179385
 ] 

Daniel Dai commented on PIG-4789:
---------------------------------

It is PIG-4690 fixed the issue.

> Pig on TEZ creates wrong result with replicated join
> ----------------------------------------------------
>
>                 Key: PIG-4789
>                 URL: https://issues.apache.org/jira/browse/PIG-4789
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.15.0
>            Reporter: Michael Prim
>            Priority: Critical
>         Attachments: tez_bug.pig, tez_bug_input1.csv, tez_bug_input2.csv, 
> tez_bug_input3.csv
>
>
> Please find below a minimal example of a Pig script that uses splits and 
> replicated joins and where the output differs between MapReduce and TEZ as 
> execution engine. The attachment also contains the sample input data.
> The expected output, as created by MapReduce engine is:
> {code}
> (id1,123,A,)
> (id2,234,,B)
> (id3,456,,)
> (id4,567,A,)
> {code}
> whereas TEZ produces
> {code}
> (id1,123,A,A)
> (id2,234,B,B)
> (id3,456,,)
> (id4,567,A,A)
> {code}
> Removing the {{USING 'replicated'}} and using a regular join yields correct 
> results. I am not sure if this is a Pig issue or a TEZ issue. However, as 
> this issue silently can lead to data corruption I rated it critical. So far 
> searching didn't indicate a similar bug or anybody being aware of it.
> {code}
> classdata = LOAD '/tez_bug_input1.csv' USING PigStorage(',') AS 
> (classid:chararray, class:chararray);
> data = LOAD '/tez_bug_input2.csv' USING PigStorage(',') AS 
> (eventid:chararray, classid:chararray);
> basedata = LOAD '/tez_bug_input3.csv' USING PigStorage(',') AS 
> (eventid:chararray, foo:int);
> dataJclassdata = JOIN classdata BY classid, data BY classid;
> SPLIT dataJclassdata INTO classA IF class == 'A', classB IF class == 'B';
> dataA = JOIN basedata BY eventid LEFT OUTER, classA BY data::eventid USING 
> 'replicated';
> dataA = foreach dataA generate basedata::eventid as eventid
>       , basedata::foo as foo
>       , classA::classdata::class as classA;
> dataB = JOIN dataA BY eventid LEFT OUTER, classB BY eventid USING 
> 'replicated';
> dataB = foreach dataB generate dataA::eventid as eventid
>       , dataA::foo as foo
>       , dataA::classA as classA
>     , classB::classdata::class as classB;
> DUMP dataB;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to