[ https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152220#comment-15152220 ]
Pallavi Rao commented on PIG-4797: ---------------------------------- Solution Proposal: Currently, the Spark plan that is generated and the corresponding set of Spark operations are as follows: {noformat} OUT: Store (map, saveAsNewAPIHadoopDataset) | |---OUT: New For Each (mapPartitions) | |---OUT: Package (map) | |---OUT: Global Rearrange (map, cogroup, map) | |---OUT: Local Rearrange (map) | | | |---CUST: New For Each (mapPartitions) | | | |---CUST: Load (newHadoopAPI, map) | |---OUT: Local Rearrange (map) | |---TRANS: New For Each (mapPartitions) | |---TRANS: Load (newHadoopAPI, map) {noformat} The number of operations can be reduced and time saved, if this plan were optimized as follows: {noformat} OUT: Store (map, saveAsNewAPIHadoopDataset) | |---OUT: New For Each (mapPartitions) | |---OUT: join (join) | | | |---CUST: New For Each (mapPartitions) | | | |---CUST: Load (newHadoopAPI, map) | |---TRANS: New For Each (mapPartitions) | |--- TRANS: Load (newHadoopAPI, map) {noformat} > Analyze JOIN performance and improve the same. > ---------------------------------------------- > > Key: PIG-4797 > URL: https://issues.apache.org/jira/browse/PIG-4797 > Project: Pig > Issue Type: Improvement > Components: spark > Reporter: Pallavi Rao > Assignee: Pallavi Rao > Labels: spork > Attachments: Join performance analysis.pdf > > > There are a big performance difference in join between spark and mr mode. > {code} > daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray, > date:chararray, open:float, high:float, low:float, > close:float, volume:int, adj_close:float); > divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray, > date:chararray, dividends:float); > jnd = join daily by (exchange, symbol), divs by (exchange, symbol); > store jnd into './join.out'; > {code} > join.sh > {code} > mode=$1 > start=$(date +%s) > ./pig -x $mode $PIG_HOME/bin/join.pig > end=$(date +%s) > execution_time=$(( $end - $start )) > echo "execution_time:"$excution_time > {code} > The execution time: > || |||mr||spark|| > |join|20 sec|79 sec| > You can download the test data NYSE_daily and NYSE_dividends in > https://github.com/alanfgates/programmingpig/blob/master/data/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)