[ https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236573#comment-15236573 ]
liyunzhang_intel commented on PIG-4797: --------------------------------------- [~mohitsabharwal],[~pallavi.rao],[~kexianda]: PIG-4797 collapses POLocalRearrange,POGlobalRearrange and POPackage to POJoinSpark to reduce unnecessary map operations to optimize join/group. Let's show the spark plan after collapsing LR,GLR,PKG into POJoinSpark: join.pig {code} daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); jnd = join daily by (exchange, symbol), divs by (exchange, symbol); store jnd into './join.out'; {code} before optimization: {code} jnd: Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage) - scope-58 | |---jnd: New For Each(true,true)[tuple] - scope-57 | | | Project[bag][1] - scope-55 | | | Project[bag][2] - scope-56 | |---jnd: Package(Packager)[tuple]{tuple} - scope-48 | |---jnd: Global Rearrange[tuple] - scope-47 | |---jnd: Local Rearrange[tuple]{tuple}(false) - scope-49 | | | | | Project[chararray][0] - scope-50 | | | | | Project[chararray][1] - scope-51 | | | |---daily: New For Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28 | | | | | Cast[chararray] - scope-2 | | | | | |---Project[bytearray][0] - scope-1 | | | | | Cast[chararray] - scope-5 | | | | | |---Project[bytearray][1] - scope-4 | | | | | Cast[chararray] - scope-8 | | | | | |---Project[bytearray][2] - scope-7 | | | | | Cast[float] - scope-11 | | | | | |---Project[bytearray][3] - scope-10 | | | | | Cast[float] - scope-14 | | | | | |---Project[bytearray][4] - scope-13 | | | | | Cast[float] - scope-17 | | | | | |---Project[bytearray][5] - scope-16 | | | | | Cast[float] - scope-20 | | | | | |---Project[bytearray][6] - scope-19 | | | | | Cast[int] - scope-23 | | | | | |---Project[bytearray][7] - scope-22 | | | | | Cast[float] - scope-26 | | | | | |---Project[bytearray][8] - scope-25 | | | |---daily: Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage) - scope-0 | |---jnd: Local Rearrange[tuple]{tuple}(false) - scope-52 | | | Project[chararray][0] - scope-53 | | | Project[chararray][1] - scope-54 | |---divs: New For Each(false,false,false,false)[bag] - scope-42 | | | Cast[chararray] - scope-31 | | | |---Project[bytearray][0] - scope-30 | | | Cast[chararray] - scope-34 | | | |---Project[bytearray][1] - scope-33 | | | Cast[chararray] - scope-37 | | | |---Project[bytearray][2] - scope-36 | | | Cast[float] - scope-40 | | | |---Project[bytearray][3] - scope-39 | |---divs: Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage) - scope-29---- {code} After join optimization: {code} jnd: Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage) - scope-58 | |---jnd: New For Each(true,true)[tuple] - scope-57 | | | Project[bag][1] - scope-55 | | | Project[bag][2] - scope-56 | |---POJoinSpark[tuple] - scope-47 | |---daily: New For Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28 | | | | | Cast[chararray] - scope-2 | | | | | |---Project[bytearray][0] - scope-1 | | | | | Cast[chararray] - scope-5 | | | | | |---Project[bytearray][1] - scope-4 | | | | | Cast[chararray] - scope-8 | | | | | |---Project[bytearray][2] - scope-7 | | | | | Cast[float] - scope-11 | | | | | |---Project[bytearray][3] - scope-10 | | | | | Cast[float] - scope-14 | | | | | |---Project[bytearray][4] - scope-13 | | | | | Cast[float] - scope-17 | | | | | |---Project[bytearray][5] - scope-16 | | | | | Cast[float] - scope-20 | | | | | |---Project[bytearray][6] - scope-19 | | | | | Cast[int] - scope-23 | | | | | |---Project[bytearray][7] - scope-22 | | | | | Cast[float] - scope-26 | | | | | |---Project[bytearray][8] - scope-25 | | | |---daily: Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage) - scope-0 | |---divs: New For Each(false,false,false,false)[bag] - scope-42 | | | Cast[chararray] - scope-31 | | | |---Project[bytearray][0] - scope-30 | | | Cast[chararray] - scope-34 | | | |---Project[bytearray][1] - scope-33 | | | Cast[chararray] - scope-37 | | | |---Project[bytearray][2] - scope-36 | | | Cast[float] - scope-40 | | | |---Project[bytearray][3] - scope-39 | |---divs: Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage) - scope-29- {code} In the PIG-4797.patch, changes are: 1. change the spark plan in JoinOptimizerSpark. when LRA+GLA+PKG is only encountered in the spark plan, LRA+GLA+PKA will be changed to POJoinSpark 2. In JoinSparkConverter, RDD executes LocalRearrangeFunction which converts (Tuple) to Tuple2<IndexedKey,Tuple>, CoGroup,GroupPkgFunction which combines the action of group and package. Compare the performance: before the patch: join.pig uses 60 secs. After the patch:join.pig uses 46 secs. > Analyze JOIN performance and improve the same. > ---------------------------------------------- > > Key: PIG-4797 > URL: https://issues.apache.org/jira/browse/PIG-4797 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Pallavi Rao > Assignee: liyunzhang_intel > Labels: spork > Attachments: Join performance analysis.pdf, PIG-4797.patch > > > There are a big performance difference in join between spark and mr mode. > {code} > daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray, > date:chararray, open:float, high:float, low:float, > close:float, volume:int, adj_close:float); > divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray, > date:chararray, dividends:float); > jnd = join daily by (exchange, symbol), divs by (exchange, symbol); > store jnd into './join.out'; > {code} > join.sh > {code} > mode=$1 > start=$(date +%s) > ./pig -x $mode $PIG_HOME/bin/join.pig > end=$(date +%s) > execution_time=$(( $end - $start )) > echo "execution_time:"$excution_time > {code} > The execution time: > || |||mr||spark|| > |join|20 sec|79 sec| > You can download the test data NYSE_daily and NYSE_dividends in > https://github.com/alanfgates/programmingpig/blob/master/data/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)