[ https://issues.apache.org/jira/browse/PIG-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954341#comment-14954341 ]
Rohini Palaniswamy commented on PIG-4695: ----------------------------------------- With current trunk code, I get the right results. Haven't checked with 0.15 though. > Using 'replicated' left join results in different result from regular left > join. > -------------------------------------------------------------------------------- > > Key: PIG-4695 > URL: https://issues.apache.org/jira/browse/PIG-4695 > Project: Pig > Issue Type: Bug > Affects Versions: 0.15.0 > Reporter: Zbigniew Rzepka > > There seems to be a difference in results between regular LEFT JOIN and > replicated LEFT JOIN. This may be a case only with very small data sets, as > we're using piece of code shown below in production with correct results. > EDIT: > This issue only occurs when running PIG on Tez. (We're using Tez 7.0). > Example: > I have two data sets: > first_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > (108,17,all_users,all_users) > (138,11,all_users,all_users) > {code} > second_period_users: > {code} > (108,11,all_users,all_users) > (108,13,all_users,all_users) > {code} > When I use regular LEFT JOIN on these two I get the correct output: > {code:sql} > joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value); > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (138,11,all_users,all_users,,,,) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users,,,,) > {code} > BUT, if I add {{USING 'replicated'}}, the result is completely different: > {code} > $joined_periods_users = JOIN > $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, > $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) > USING 'replicated'; > {code} > output: > {code} > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,11,all_users,all_users,108,11,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,13,all_users,all_users,108,13,all_users,all_users) > (108,17,all_users,all_users,,,,) > (138,11,all_users,all_users,,,,) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)