[ 
https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152220#comment-15152220
 ] 

Pallavi Rao commented on PIG-4797:
----------------------------------

Solution Proposal:
Currently, the Spark plan that is generated and the corresponding set of Spark 
operations are as follows:
{noformat}
OUT: Store (map, saveAsNewAPIHadoopDataset)
|
|---OUT: New For Each (mapPartitions)
    |  
    |---OUT: Package (map)
        |
        |---OUT: Global Rearrange (map, cogroup, map)
            |
            |---OUT: Local Rearrange (map)
            |   |   
            |   |---CUST: New For Each (mapPartitions)
            |       |
            |       |---CUST: Load (newHadoopAPI, map)
            |
            |---OUT: Local Rearrange (map)
                |
                |---TRANS: New For Each (mapPartitions)
                    |
                    |---TRANS: Load (newHadoopAPI, map)
{noformat}
The number of operations can be reduced and time saved, if this plan were 
optimized as follows:
{noformat}
OUT: Store (map, saveAsNewAPIHadoopDataset)
|
|---OUT: New For Each (mapPartitions)
    |  
    |---OUT: join (join)
            |   |   
            |   |---CUST: New For Each (mapPartitions)
            |       |
            |       |---CUST: Load (newHadoopAPI, map)
            |
            |---TRANS: New For Each (mapPartitions)
                |
                |---    TRANS: Load (newHadoopAPI, map)
{noformat}

> Analyze JOIN performance and improve the same.
> ----------------------------------------------
>
>                 Key: PIG-4797
>                 URL: https://issues.apache.org/jira/browse/PIG-4797
>             Project: Pig
>          Issue Type: Improvement
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>         Attachments: Join performance analysis.pdf
>
>
> There are a big  performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
>             date:chararray, open:float, high:float, low:float,
>             close:float, volume:int, adj_close:float);
> divs  = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
>             date:chararray, dividends:float);
> jnd   = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode  $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in 
> https://github.com/alanfgates/programmingpig/blob/master/data/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to