[ 
https://issues.apache.org/jira/browse/PIG-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963793#comment-15963793
 ] 

liyunzhang_intel commented on PIG-5212:
---------------------------------------

script:
{code}
a = load './studenttab10k.mk';
b = filter a by $1 > 25;
c = join a by $0, b by $0 using 'skewed' parallel 7;
store c into './skewed.out';
{code}

physical plan
{code}

c: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage)
 - scope-14
|
|---c: SkewedJoin[tuple] - scope-13
    |   |
    |   Project[bytearray][0] - scope-11
    |   |
    |   Project[bytearray][0] - scope-12
    |
    |---a: Filter[bag] - scope-2
    |   |   |
    |   |   Constant(true) - scope-3
    |   |
    |   |---a: Split - scope-1
    |       |
    |       |---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
 - scope-0
    |
    |---b: Filter[bag] - scope-6
        |   |
        |   Greater Than[boolean] - scope-10
        |   |
        |   |---Cast[int] - scope-8
        |   |   |
        |   |   |---Project[bytearray][1] - scope-7
        |   |
        |   |---Constant(25) - scope-9
        |
        |---a: Filter[bag] - scope-4
            |   |
            |   Constant(true) - scope-5
            |
            |---a: Split - scope-1
                |
                |---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
 - scope-0
{code}

spark plan
{code}

scope-15->scope-36
scope-21->scope-36
scope-36
#--------------------------------------------------
# Spark Plan                                 
#--------------------------------------------------

Spark node scope-21
BroadcastSpark - scope-35
|
|---New For Each(false)[tuple] - scope-34
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] - 
scope-33
    |   |
    |   |---Project[tuple][*] - scope-32
    |
    |---New For Each(false,false)[tuple] - scope-31
        |   |
        |   Constant(7) - scope-30
        |   |
        |   Project[bag][1] - scope-29
        |
        |---POSparkSort[tuple]() - scope-13
            |   |
            |   Project[bytearray][0] - scope-11
            |
            |---New For Each(false,true)[tuple] - scope-28
                |   |
                |   Project[bytearray][0] - scope-11
                |   |
                |   
POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-26
                |   |
                |   |---Project[tuple][*] - scope-25
                |
                |---PoissonSampleSpark - scope-27
                    |
                    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
 - scope-24--------

Spark node scope-36
c: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage)
 - scope-14
|
|---c: SkewedJoin[tuple] - scope-13
    |   |
    |   Project[bytearray][0] - scope-11
    |   |
    |   Project[bytearray][0] - scope-12
    |
    |---b: Filter[bag] - scope-6
    |   |   |
    |   |   Greater Than[boolean] - scope-10
    |   |   |
    |   |   |---Cast[int] - scope-8
    |   |   |   |
    |   |   |   |---Project[bytearray][1] - scope-7
    |   |   |
    |   |   |---Constant(25) - scope-9
    |   |
    |   
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
 - scope-19
    |
    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
 - scope-17--------

Spark node scope-15
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
 - scope-16
|
|---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
 - scope-0--------
+
{code}

the reason why the result is inverted comparing with mr is because the 
predecessor of scope-13(Skewedjoin) is scope-6(Filter) and scope-17(Load) after 
POSplit splits the physical plan. While in previous physical plan the 
predecessor of scope-13(Skewedjoin) is scope-2(Load) and scope-6(Filter).

also print the mr plan in mr mode, this problem is not reproduced in MR because 
the predecessor of Union(scope-44) is LocalRearrange(scope-40) and Partition 
Rearrange(scope-41).
{code}
#--------------------------------------------------
# Map Reduce Plan                                 
#--------------------------------------------------
MapReduce node scope-15
Map Plan
Split - scope-51
|   |
|   
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.io.InterStorage)
 - scope-21
|   |
|   
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp-1906398218:org.apache.pig.impl.io.InterStorage)
 - scope-16
|
|---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
 - scope-0--------
Global sort: false
----------------

MapReduce node scope-25
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-28
|   |
|   Constant(all) - scope-27
|
|---New For Each(false,true)[tuple] - scope-26
    |   |
    |   Project[bytearray][0] - scope-11
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-23
    |   |
    |   |---Project[tuple][*] - scope-22
    |
    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.builtin.PoissonSampleLoader('org.apache.pig.impl.io.InterStorage','100'))
 - scope-24--------
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1192647344:org.apache.pig.impl.io.InterStorage)
 - scope-37
|
|---New For Each(false)[tuple] - scope-36
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] - 
scope-35
    |   |
    |   |---Project[tuple][*] - scope-34
    |
    |---New For Each(false,false)[tuple] - scope-33
        |   |
        |   Constant(7) - scope-32
        |   |
        |   Project[bag][1] - scope-30
        |
        |---Package(Packager)[tuple]{chararray} - scope-29--------
Global sort: false
Secondary sort: true
----------------

MapReduce node scope-43
Map Plan
Union[tuple] - scope-44
|
|---Local Rearrange[tuple]{bytearray}(false) - scope-40
|   |   |
|   |   Project[bytearray][0] - scope-11
|   |
|   
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.io.InterStorage)
 - scope-38
|
|---Partition rearrange [bag]{bytearray}(false) - scope-41
    |   |
    |   Project[bytearray][0] - scope-12
    |
    |---b: Filter[bag] - scope-6
        |   |
        |   Greater Than[boolean] - scope-10
        |   |
        |   |---Cast[int] - scope-8
        |   |   |
        |   |   |---Project[bytearray][1] - scope-7
        |   |
        |   |---Constant(25) - scope-9
        |
        
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp-1906398218:org.apache.pig.impl.io.InterStorage)
 - scope-19--------
Reduce Plan
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-14
|
|---Package(JoinPackager(true,true))[tuple]{bytearray} - scope-45--------
Global sort: false
----------------

{code}

> SkewedJoin_6 is failing on Spark
> --------------------------------
>
>                 Key: PIG-5212
>                 URL: https://issues.apache.org/jira/browse/PIG-5212
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Nandor Kollar
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>
> result are different:
> {code}
> diff <(head -20 SkewedJoin_6_benchmark.out/out_sorted) <(head -20 
> SkewedJoin_6.out/out_sorted)
> < alice allen 19      1.930   alice allen     27      1.950
> < alice allen 19      1.930   alice allen     34      1.230
> < alice allen 19      1.930   alice allen     36      2.270
> < alice allen 19      1.930   alice allen     38      0.810
> < alice allen 19      1.930   alice allen     38      1.800
> < alice allen 19      1.930   alice allen     42      2.460
> < alice allen 19      1.930   alice allen     43      0.880
> < alice allen 19      1.930   alice allen     45      2.800
> < alice allen 19      1.930   alice allen     46      3.970
> < alice allen 19      1.930   alice allen     51      1.080
> < alice allen 19      1.930   alice allen     68      3.390
> < alice allen 19      1.930   alice allen     68      3.510
> < alice allen 19      1.930   alice allen     72      1.750
> < alice allen 19      1.930   alice allen     72      3.630
> < alice allen 19      1.930   alice allen     74      0.020
> < alice allen 19      1.930   alice allen     74      2.400
> < alice allen 19      1.930   alice allen     77      2.520
> < alice allen 20      2.470   alice allen     27      1.950
> < alice allen 20      2.470   alice allen     34      1.230
> < alice allen 20      2.470   alice allen     36      2.270
> ---
> > alice allen 27      1.950   alice allen     19      1.930
> > alice allen 27      1.950   alice allen     20      2.470
> > alice allen 27      1.950   alice allen     27      1.950
> > alice allen 27      1.950   alice allen     34      1.230
> > alice allen 27      1.950   alice allen     36      2.270
> > alice allen 27      1.950   alice allen     38      0.810
> > alice allen 27      1.950   alice allen     38      1.800
> > alice allen 27      1.950   alice allen     42      2.460
> > alice allen 27      1.950   alice allen     43      0.880
> > alice allen 27      1.950   alice allen     45      2.800
> > alice allen 27      1.950   alice allen     46      3.970
> > alice allen 27      1.950   alice allen     51      1.080
> > alice allen 27      1.950   alice allen     68      3.390
> > alice allen 27      1.950   alice allen     68      3.510
> > alice allen 27      1.950   alice allen     72      1.750
> > alice allen 27      1.950   alice allen     72      3.630
> > alice allen 27      1.950   alice allen     74      0.020
> > alice allen 27      1.950   alice allen     74      2.400
> > alice allen 27      1.950   alice allen     77      2.520
> > alice allen 34      1.230   alice allen     19      1.930
> {code}
> It looks like the two tables are in wrong order, columns from 'a' should come 
> first, then columns from 'b'. In spark mode this is inverted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to