[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686194#comment-15686194 ] Adam Szita commented on PIG-5052: - Thanks for the review and committing [~kellyzly] and [~xuefuz] I'll mark this as resolved then > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, > PIG-5052.3.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685739#comment-15685739 ] liyunzhang_intel commented on PIG-5052: --- [~xuefuz]: thanks for committing and please close this jira [~szita]: thank for patch. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, > PIG-5052.3.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685729#comment-15685729 ] Xuefu Zhang commented on PIG-5052: -- PIG-5052.3-incrementalToPatch1.patch is committed. Shall we close this ticket? > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, > PIG-5052.3.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685542#comment-15685542 ] liyunzhang_intel commented on PIG-5052: --- [~szita]: PIG-5052.3-incrementalToPatch1.patch is good to me, +1. [~xuefuz]: Please check in PIG-5052.3-incrementalToPach1.patch. How to use it: 1. checkout latest code, latest code is 815a0f2 2. patch -p1 Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, > PIG-5052.3.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683043#comment-15683043 ] Adam Szita commented on PIG-5052: - Attached new patches with adjusted code style: [^PIG-5052.3.patch] - can be used if we revert the first patch submitted ([^PIG-5052.patch]) [^PIG-5052.3-incrementalToPatch1.patch] - is the incremental diff to [^PIG-5052.patch] in case we don't want to revert My suggestion is to use [^PIG-5052.3.patch] > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.3-incrementalToPatch1.patch, > PIG-5052.3.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682488#comment-15682488 ] liyunzhang_intel commented on PIG-5052: --- [~kexianda],[~mohitsabharwal] and [~xuefuz]: Currently [jenkins|https://builds.apache.org/job/Pig-spark/lastUnsuccessfulBuild/] fail after I tried the patch from [~szita], all unit tests pass in my local jenkins. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676015#comment-15676015 ] liyunzhang_intel commented on PIG-5052: --- [~szita]: +1. But please repatch according to the latest code style. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645119#comment-15645119 ] Adam Szita commented on PIG-5052: - You can try the following: {code} ./pig -x spark_local A = LOAD '../test/org/apache/pig/test/data/passwd' using PigStorage(); dump A dump A {code} The second dump will hang for me. The reason is that jobs 0 and 1 are returned (because of using the same job group id) in JobGraphBuilder#225: {code} sparkContext.statusTracker().getJobIdsForGroup(jobGroupID) {code} ..but JobMetricsListener will only have job 1 here in finishedJobIds: {code} public synchronized boolean waitForJobToEnd(int jobId) throws InterruptedException { if (finishedJobIds.contains(jobId)) { finishedJobIds.remove(jobId); return true; } wait(); return false; } {code} so we will never see job 0 after the second dump, but yet expect to. On top of this I think it's a clearer approach to use different job group IDs for different jobs. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644944#comment-15644944 ] liyunzhang_intel commented on PIG-5052: --- [~szita]: Sorry for reply a bit late. {quote} This can be seen by just repeating the same pig query, (e.g. load, foreach, dump, dump) - the second job will hang in SparkStatsUtil#waitForJobAddStats. Reason is that JobGraphBuilder#getJobIDs will return all jobs accociated with the same groupID, in the case above 0 and 1. Then it will wait for job 0 to finish but that's no longer in sparkContext, it was the previous job.. {quote} not very understand. use following repeating query(load) script and run successfully. {code} A = load './SkewedJoinInput1.txt' as (id,name,n); B = load './SkewedJoinInput1.txt' as (id,name); store A into './duplicate.out.A'; store B into './duplicate.out.B'; explain A; {code} Can you give a script to show the failure you mentioned above? > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636461#comment-15636461 ] Adam Szita commented on PIG-5052: - [~kellyzly] I found that it is indeed problematic to have the same ID set as jobGroupId accross multiple Pig on Spark jobs; and this commit has actually introduced a bug because of this. This can be seen by just repeating the same pig query, (e.g. load, foreach, dump, dump) - the second job will hang in SparkStatsUtil#waitForJobAddStats. Reason is that JobGraphBuilder#getJobIDs will return all jobs accociated with the same groupID, in the case above 0 and 1. Then it will wait for job 0 to finish but that's no longer in sparkContext, it was the previous job.. So I think we should do something like in [^PIG-5052.2.patch], we can combine the appId provided by sparkContext with a random UUID. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5052.2.patch, PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635000#comment-15635000 ] liyunzhang_intel commented on PIG-5052: --- [~xuefuz]: thanks for your commit in . Please also commit PIG-5051. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631391#comment-15631391 ] liyunzhang_intel commented on PIG-5052: --- [~szita]: thanks for review. {quote} I think sparkContext.getConf().getAppId() will return the same value for the same spark context. That means that (since we're not creating a new spark context every time we run a job) that more jobs will get the same ID. Would that still be fine for our use cases (etc. org.apache.pig.builtin.RANDOM#exec) ? {quote} Currently in pig on spark, in most cases 1 physical plan will be converted to 1 spark job except multiquery case like {code} A = load './SkewedJoinInput1.txt' as (id,name,n); B = foreach A generate id,name,RANDOM(); C = foreach A generate name,n,RANDOM(); store B into './multiQ.1.out'; store C into './multiQ.2.out'; explain B; {code} {code} Spark node scope-36 Split - scope-42 | | | B: Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.1.out:org.apache.pig.builtin.PigStorage) - scope-26 | | | |---B: New For Each(false,false,false)[bag] - scope-25 | | | | | Project[bytearray][0] - scope-20 | | | | | Project[bytearray][1] - scope-22 | | | | | POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-24 | | | C: Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.2.out:org.apache.pig.builtin.PigStorage) - scope-35 | | | |---C: New For Each(false,false,false)[bag] - scope-34 | | | | | Project[bytearray][1] - scope-29 | | | | | Project[bytearray][2] - scope-31 | | | | | POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-33 | |---A: New For Each(false,false,false)[bag] - scope-16 | | | Project[bytearray][0] - scope-10 | | | Project[bytearray][1] - scope-12 | | | Project[bytearray][2] - scope-14 | |---A: Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage) - scope-9 {code} This multiquery case will generate two spark jobs but they have same application id. what you pointed is really a good catch. But i think it will *not* influence the output of RANDOM#exec. Because the jobId in mr is more closed to application id in spark in multiquery case because it will only generate 1 mr job in above multiquery case. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630611#comment-15630611 ] Adam Szita commented on PIG-5052: - Just one remark: I think sparkContext.getConf().getAppId() will return the same value for the same spark context. That means that (since we're not creating a new spark context every time we run a job) that more jobs will get the same ID. Would that still be fine for our use cases (etc. org.apache.pig.builtin.RANDOM#exec) ? > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627744#comment-15627744 ] liyunzhang_intel commented on PIG-5052: --- [~xuefuz]: please check in PIG-5052.patch > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627352#comment-15627352 ] Xianda Ke commented on PIG-5052: LGTM +1 (non-bingd) > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly
[ https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624409#comment-15624409 ] liyunzhang_intel commented on PIG-5052: --- [~szita]: Thanks your userful code. [~kexianda]: please help review PIG-5052.patch. modification: 1. get jobGroupId of spark application by spark api and store the value in jobConfiguration. we use the value in org.apache.pig.builtin.RANDOM#exec. > Initialize MRConfiguration.JOB_ID in spark mode correctly > - > > Key: PIG-5052 > URL: https://issues.apache.org/jira/browse/PIG-5052 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-5052.patch > > > currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf. > we just set the value as a random string. > {code} > jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString()); > {code} > We need to find a spark api to initiliaze it correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)