[
https://issues.apache.org/jira/browse/PIG-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963793#comment-15963793
]
liyunzhang_intel commented on PIG-5212:
---------------------------------------
script:
{code}
a = load './studenttab10k.mk';
b = filter a by $1 > 25;
c = join a by $0, b by $0 using 'skewed' parallel 7;
store c into './skewed.out';
{code}
physical plan
{code}
c:
Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage)
- scope-14
|
|---c: SkewedJoin[tuple] - scope-13
| |
| Project[bytearray][0] - scope-11
| |
| Project[bytearray][0] - scope-12
|
|---a: Filter[bag] - scope-2
| | |
| | Constant(true) - scope-3
| |
| |---a: Split - scope-1
| |
| |---a:
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
- scope-0
|
|---b: Filter[bag] - scope-6
| |
| Greater Than[boolean] - scope-10
| |
| |---Cast[int] - scope-8
| | |
| | |---Project[bytearray][1] - scope-7
| |
| |---Constant(25) - scope-9
|
|---a: Filter[bag] - scope-4
| |
| Constant(true) - scope-5
|
|---a: Split - scope-1
|
|---a:
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
- scope-0
{code}
spark plan
{code}
scope-15->scope-36
scope-21->scope-36
scope-36
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-21
BroadcastSpark - scope-35
|
|---New For Each(false)[tuple] - scope-34
| |
| POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] -
scope-33
| |
| |---Project[tuple][*] - scope-32
|
|---New For Each(false,false)[tuple] - scope-31
| |
| Constant(7) - scope-30
| |
| Project[bag][1] - scope-29
|
|---POSparkSort[tuple]() - scope-13
| |
| Project[bytearray][0] - scope-11
|
|---New For Each(false,true)[tuple] - scope-28
| |
| Project[bytearray][0] - scope-11
| |
|
POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-26
| |
| |---Project[tuple][*] - scope-25
|
|---PoissonSampleSpark - scope-27
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
- scope-24--------
Spark node scope-36
c:
Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage)
- scope-14
|
|---c: SkewedJoin[tuple] - scope-13
| |
| Project[bytearray][0] - scope-11
| |
| Project[bytearray][0] - scope-12
|
|---b: Filter[bag] - scope-6
| | |
| | Greater Than[boolean] - scope-10
| | |
| | |---Cast[int] - scope-8
| | | |
| | | |---Project[bytearray][1] - scope-7
| | |
| | |---Constant(25) - scope-9
| |
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
- scope-19
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
- scope-17--------
Spark node scope-15
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp202864860/tmp-625525596:org.apache.pig.impl.io.InterStorage)
- scope-16
|
|---a:
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
- scope-0--------
+
{code}
the reason why the result is inverted comparing with mr is because the
predecessor of scope-13(Skewedjoin) is scope-6(Filter) and scope-17(Load) after
POSplit splits the physical plan. While in previous physical plan the
predecessor of scope-13(Skewedjoin) is scope-2(Load) and scope-6(Filter).
also print the mr plan in mr mode, this problem is not reproduced in MR because
the predecessor of Union(scope-44) is LocalRearrange(scope-40) and Partition
Rearrange(scope-41).
{code}
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-15
Map Plan
Split - scope-51
| |
|
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.io.InterStorage)
- scope-21
| |
|
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp-1906398218:org.apache.pig.impl.io.InterStorage)
- scope-16
|
|---a:
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
- scope-0--------
Global sort: false
----------------
MapReduce node scope-25
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-28
| |
| Constant(all) - scope-27
|
|---New For Each(false,true)[tuple] - scope-26
| |
| Project[bytearray][0] - scope-11
| |
| POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-23
| |
| |---Project[tuple][*] - scope-22
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.builtin.PoissonSampleLoader('org.apache.pig.impl.io.InterStorage','100'))
- scope-24--------
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1192647344:org.apache.pig.impl.io.InterStorage)
- scope-37
|
|---New For Each(false)[tuple] - scope-36
| |
| POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] -
scope-35
| |
| |---Project[tuple][*] - scope-34
|
|---New For Each(false,false)[tuple] - scope-33
| |
| Constant(7) - scope-32
| |
| Project[bag][1] - scope-30
|
|---Package(Packager)[tuple]{chararray} - scope-29--------
Global sort: false
Secondary sort: true
----------------
MapReduce node scope-43
Map Plan
Union[tuple] - scope-44
|
|---Local Rearrange[tuple]{bytearray}(false) - scope-40
| | |
| | Project[bytearray][0] - scope-11
| |
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp1966124615:org.apache.pig.impl.io.InterStorage)
- scope-38
|
|---Partition rearrange [bag]{bytearray}(false) - scope-41
| |
| Project[bytearray][0] - scope-12
|
|---b: Filter[bag] - scope-6
| |
| Greater Than[boolean] - scope-10
| |
| |---Cast[int] - scope-8
| | |
| | |---Project[bytearray][1] - scope-7
| |
| |---Constant(25) - scope-9
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp838403890/tmp-1906398218:org.apache.pig.impl.io.InterStorage)
- scope-19--------
Reduce Plan
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-14
|
|---Package(JoinPackager(true,true))[tuple]{bytearray} - scope-45--------
Global sort: false
----------------
{code}
> SkewedJoin_6 is failing on Spark
> --------------------------------
>
> Key: PIG-5212
> URL: https://issues.apache.org/jira/browse/PIG-5212
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Nandor Kollar
> Assignee: Xianda Ke
> Fix For: spark-branch
>
>
> result are different:
> {code}
> diff <(head -20 SkewedJoin_6_benchmark.out/out_sorted) <(head -20
> SkewedJoin_6.out/out_sorted)
> < alice allen 19 1.930 alice allen 27 1.950
> < alice allen 19 1.930 alice allen 34 1.230
> < alice allen 19 1.930 alice allen 36 2.270
> < alice allen 19 1.930 alice allen 38 0.810
> < alice allen 19 1.930 alice allen 38 1.800
> < alice allen 19 1.930 alice allen 42 2.460
> < alice allen 19 1.930 alice allen 43 0.880
> < alice allen 19 1.930 alice allen 45 2.800
> < alice allen 19 1.930 alice allen 46 3.970
> < alice allen 19 1.930 alice allen 51 1.080
> < alice allen 19 1.930 alice allen 68 3.390
> < alice allen 19 1.930 alice allen 68 3.510
> < alice allen 19 1.930 alice allen 72 1.750
> < alice allen 19 1.930 alice allen 72 3.630
> < alice allen 19 1.930 alice allen 74 0.020
> < alice allen 19 1.930 alice allen 74 2.400
> < alice allen 19 1.930 alice allen 77 2.520
> < alice allen 20 2.470 alice allen 27 1.950
> < alice allen 20 2.470 alice allen 34 1.230
> < alice allen 20 2.470 alice allen 36 2.270
> ---
> > alice allen 27 1.950 alice allen 19 1.930
> > alice allen 27 1.950 alice allen 20 2.470
> > alice allen 27 1.950 alice allen 27 1.950
> > alice allen 27 1.950 alice allen 34 1.230
> > alice allen 27 1.950 alice allen 36 2.270
> > alice allen 27 1.950 alice allen 38 0.810
> > alice allen 27 1.950 alice allen 38 1.800
> > alice allen 27 1.950 alice allen 42 2.460
> > alice allen 27 1.950 alice allen 43 0.880
> > alice allen 27 1.950 alice allen 45 2.800
> > alice allen 27 1.950 alice allen 46 3.970
> > alice allen 27 1.950 alice allen 51 1.080
> > alice allen 27 1.950 alice allen 68 3.390
> > alice allen 27 1.950 alice allen 68 3.510
> > alice allen 27 1.950 alice allen 72 1.750
> > alice allen 27 1.950 alice allen 72 3.630
> > alice allen 27 1.950 alice allen 74 0.020
> > alice allen 27 1.950 alice allen 74 2.400
> > alice allen 27 1.950 alice allen 77 2.520
> > alice allen 34 1.230 alice allen 19 1.930
> {code}
> It looks like the two tables are in wrong order, columns from 'a' should come
> first, then columns from 'b'. In spark mode this is inverted.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)