[ 
https://issues.apache.org/jira/browse/PIG-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianda Ke updated PIG-4848:
---------------------------
    Attachment: PIG-4848.patch

Set the flag pig.noSplitCombination=true internally for merge join.

> pig.noSplitCombination=true should always be set internally for a merge join
> ----------------------------------------------------------------------------
>
>                 Key: PIG-4848
>                 URL: https://issues.apache.org/jira/browse/PIG-4848
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4848.patch
>
>
> In spark mode, for a merge join, the flag is NOT set as true internally. The 
> input splits will be in the order of file size. The output is out of order.
> Scenaro:
> cat input1
> {code}
> 1     1
> {code}
> cat input2
> {code}
> 2     2
> {code}
> cat input3
> {code}
> 33    33
> {code}
> A = LOAD 'input*' as (a:int, b:int);
> B = LOAD 'input*' as (a:int, b:int);
> C = JOIN A BY $0, B BY $0 USING 'merge';
> DUMP C;
> expected result:
> {code}
> (1,1,1,1)
> (2,2,2,2)
> (33,33,33,33)
> {code}
> actual result:
> {code}
> (33,33,33,33)
> (1,1,1,1)
> (2,2,2,2)
> {code}
> In MR mode, the flag was set as true internally for a merge join(see: 
> PIG-2773). However, it doesn't work now. The output is still out of order, 
> because the splits will be ordered again by hadoop-client. In spark mode, we 
> can solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to