[ https://issues.apache.org/jira/browse/PIG-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai resolved PIG-4834. ----------------------------- Resolution: Duplicate This should be PIG-4587. > Left Outer Skewed Join produces incorrect results > ------------------------------------------------- > > Key: PIG-4834 > URL: https://issues.apache.org/jira/browse/PIG-4834 > Project: Pig > Issue Type: Bug > Affects Versions: 0.15.0 > Environment: HDP 2.3.2 > Pig 0.15.0.2.3.2.0-2950 > 5 node cluster (2 name, 3 data) > Reporter: Nathan Smith > Attachments: non-skewed-version.png, skewed-version.png > > > I've been working on a Pig script to join some datasets recently and I think > I found a bug in Left Outer Join using "skewed". In an attempt to speed up > what seemed to be some joins on skewed data I used the 'skewed' keyword, but > the skewed version produced a different number of results. The dataflow is > quite large, but I've isolated the jobs where the results start to differ. > Non-skewed version: > * 36 map tasks > * 5 reduce tasks > * shortest reducer: 46sec > * longest reducer: 7min, 9sec > * input records: 16,903,866 > * output records: 16,891,935 > {code} > out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1; > {code} > Skewed version: > * 36 map tasks > * 5 reduce tasks > * shortest reducer: 1min, 34sec > * longest reducer: 2min, 15sec > * input records: 16,903,866 > * output records: 7,916,768 > {code} > out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1 USING 'skewed'; > {code} > The two scripts are identical except for each join has {{USING 'skewed'}}. My > understanding is that using "skewed" should produce the same results, except > that it does a preliminary scan to determine the best reducer distribution > scheme. > See attached for screenshots of the counters page for both versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)