[ 
https://issues.apache.org/jira/browse/PIG-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Smith updated PIG-4834:
------------------------------
    Attachment: non-skewed-version.png
                skewed-version.png

> Left Outer Skewed Join produces incorrect results
> -------------------------------------------------
>
>                 Key: PIG-4834
>                 URL: https://issues.apache.org/jira/browse/PIG-4834
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.15.0
>         Environment: HDP 2.3.2
> Pig 0.15.0.2.3.2.0-2950
> 5 node cluster (2 name, 3 data)
>            Reporter: Nathan Smith
>         Attachments: non-skewed-version.png, skewed-version.png
>
>
> I've been working on a Pig script to join some datasets recently and I think 
> I found a bug in Left Outer Join using "skewed". In an attempt to speed up 
> what seemed to be some joins on skewed data I used the 'skewed' keyword, but 
> the skewed version produced a different number of results. The dataflow is 
> quite complicated, but I've isolated the jobs where the results start to 
> differ.
> Non-skewed version:
> * 36 map tasks
> * 5 reduce tasks
> * shortest reducer: 46sec
> * longest reducer: 7min, 9sec
> * input records: 16,903,866
> * output records: 16,891,935
> {code}
> out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1;
> {code}
> Skewed version:
> * 36 map tasks
> * 5 reduce tasks
> * shortest reducer: 1min, 34sec
> * longest reducer: 2min, 15sec
> * input records: 16,903,866
> * output records: 7,916,768
> {code}
> out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1 USING 'skewed';
> {code}
> The two scripts are identical except for each join has {{ USING 'skewed' }}. 
> My understanding is that using "skewed" should produce the same results, 
> except that it does a preliminary scan to determine the best reducer 
> distribution scheme.
> See attached for screenshots of the counters page for both versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to