[ 
https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236573#comment-15236573
 ] 

liyunzhang_intel commented on PIG-4797:
---------------------------------------

[~mohitsabharwal],[~pallavi.rao],[~kexianda]:
 PIG-4797 collapses POLocalRearrange,POGlobalRearrange and POPackage to 
POJoinSpark to reduce unnecessary
 map operations to optimize join/group. Let's show the spark plan after 
collapsing LR,GLR,PKG into POJoinSpark:
join.pig
{code}
daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
            date:chararray, open:float, high:float, low:float,
            close:float, volume:int, adj_close:float);
divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
            date:chararray, dividends:float);
jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
store jnd into './join.out';
{code}
before optimization:
{code}
jnd: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage)
 - scope-58
|
|---jnd: New For Each(true,true)[tuple] - scope-57
    |   |
    |   Project[bag][1] - scope-55
    |   |
    |   Project[bag][2] - scope-56
    |
    |---jnd: Package(Packager)[tuple]{tuple} - scope-48
        |
        |---jnd: Global Rearrange[tuple] - scope-47
            |
            |---jnd: Local Rearrange[tuple]{tuple}(false) - scope-49
            |   |   |
            |   |   Project[chararray][0] - scope-50
            |   |   |
            |   |   Project[chararray][1] - scope-51
            |   |
            |   |---daily: New For 
Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28
            |       |   |
            |       |   Cast[chararray] - scope-2
            |       |   |
            |       |   |---Project[bytearray][0] - scope-1
            |       |   |
            |       |   Cast[chararray] - scope-5
            |       |   |
            |       |   |---Project[bytearray][1] - scope-4
            |       |   |
            |       |   Cast[chararray] - scope-8
            |       |   |
            |       |   |---Project[bytearray][2] - scope-7
            |       |   |
            |       |   Cast[float] - scope-11
            |       |   |
            |       |   |---Project[bytearray][3] - scope-10
            |       |   |
            |       |   Cast[float] - scope-14
            |       |   |
            |       |   |---Project[bytearray][4] - scope-13
            |       |   |
            |       |   Cast[float] - scope-17
            |       |   |
            |       |   |---Project[bytearray][5] - scope-16
            |       |   |
            |       |   Cast[float] - scope-20
            |       |   |
            |       |   |---Project[bytearray][6] - scope-19
            |       |   |
            |       |   Cast[int] - scope-23
            |       |   |
            |       |   |---Project[bytearray][7] - scope-22
            |       |   |
            |       |   Cast[float] - scope-26
            |       |   |
            |       |   |---Project[bytearray][8] - scope-25
            |       |
            |       |---daily: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage)
 - scope-0
            |
            |---jnd: Local Rearrange[tuple]{tuple}(false) - scope-52
                |   |
                |   Project[chararray][0] - scope-53
                |   |
                |   Project[chararray][1] - scope-54
                |
                |---divs: New For Each(false,false,false,false)[bag] - scope-42
                    |   |
                    |   Cast[chararray] - scope-31
                    |   |
                    |   |---Project[bytearray][0] - scope-30
                    |   |
                    |   Cast[chararray] - scope-34
                    |   |
                    |   |---Project[bytearray][1] - scope-33
                    |   |
                    |   Cast[chararray] - scope-37
                    |   |
                    |   |---Project[bytearray][2] - scope-36
                    |   |
                    |   Cast[float] - scope-40
                    |   |
                    |   |---Project[bytearray][3] - scope-39
                    |
                    |---divs: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage)
 - scope-29----

{code}

After join optimization:
{code}
jnd: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage)
 - scope-58
|
|---jnd: New For Each(true,true)[tuple] - scope-57
    |   |
    |   Project[bag][1] - scope-55
    |   |
    |   Project[bag][2] - scope-56
    |
    |---POJoinSpark[tuple] - scope-47
        |
        |---daily: New For 
Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28
        |   |   |
        |   |   Cast[chararray] - scope-2
        |   |   |
        |   |   |---Project[bytearray][0] - scope-1
        |   |   |
        |   |   Cast[chararray] - scope-5
        |   |   |
        |   |   |---Project[bytearray][1] - scope-4
        |   |   |
        |   |   Cast[chararray] - scope-8
        |   |   |
        |   |   |---Project[bytearray][2] - scope-7
        |   |   |
        |   |   Cast[float] - scope-11
        |   |   |
        |   |   |---Project[bytearray][3] - scope-10
        |   |   |
        |   |   Cast[float] - scope-14
        |   |   |
        |   |   |---Project[bytearray][4] - scope-13
        |   |   |
        |   |   Cast[float] - scope-17
        |   |   |
        |   |   |---Project[bytearray][5] - scope-16
        |   |   |
        |   |   Cast[float] - scope-20
        |   |   |
        |   |   |---Project[bytearray][6] - scope-19
        |   |   |
        |   |   Cast[int] - scope-23
        |   |   |
        |   |   |---Project[bytearray][7] - scope-22
        |   |   |
        |   |   Cast[float] - scope-26
        |   |   |
        |   |   |---Project[bytearray][8] - scope-25
        |   |
        |   |---daily: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage)
 - scope-0
        |
        |---divs: New For Each(false,false,false,false)[bag] - scope-42
            |   |
            |   Cast[chararray] - scope-31
            |   |
            |   |---Project[bytearray][0] - scope-30
            |   |
            |   Cast[chararray] - scope-34
            |   |
            |   |---Project[bytearray][1] - scope-33
            |   |
            |   Cast[chararray] - scope-37
            |   |
            |   |---Project[bytearray][2] - scope-36
            |   |
            |   Cast[float] - scope-40
            |   |
            |   |---Project[bytearray][3] - scope-39
            |
            |---divs: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage)
 - scope-29-

{code}

In the PIG-4797.patch, changes are:
1. change the spark plan in JoinOptimizerSpark. when LRA+GLA+PKG is only 
encountered in the spark plan, LRA+GLA+PKA will be changed to POJoinSpark
2. In JoinSparkConverter, RDD executes LocalRearrangeFunction which converts 
(Tuple) to Tuple2<IndexedKey,Tuple>, CoGroup,GroupPkgFunction which combines 
the action of group and package.

Compare the performance:
before the patch: join.pig uses 60 secs.
After the patch:join.pig uses 46 secs.

> Analyze JOIN performance and improve the same.
> ----------------------------------------------
>
>                 Key: PIG-4797
>                 URL: https://issues.apache.org/jira/browse/PIG-4797
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: liyunzhang_intel
>              Labels: spork
>         Attachments: Join performance analysis.pdf, PIG-4797.patch
>
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
>             date:chararray, open:float, high:float, low:float,
>             close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
>             date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode  $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in 
> https://github.com/alanfgates/programmingpig/blob/master/data/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to