[ https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988910#comment-12988910 ]
Daniel Dai commented on PIG-1831: --------------------------------- This issue is caused by race condition in using static variable PigMapReduce.sJobConf in local mode. In local mode, all mapreduce job share a single VM. Pig keep on overwriting static variable PigMapReduce.sJobConf each time we launch a new mapreduce job. When multiple mapreduce jobs launching simultaneously, one mapreduce job may use config for other mapreduce job, and cause indeterministic behavior. Options to fix this issue are: 1. force local mode run mapreduce job sequentially, if there is a way 2. Make sJobConf an array keyed by mapreduce jobid. However, some UDFs is using sJobConf, we could break backward compatibility > Indeterministic behavior in local mode due to static variable > PigMapReduce.sJobConf > ----------------------------------------------------------------------------------- > > Key: PIG-1831 > URL: https://issues.apache.org/jira/browse/PIG-1831 > Project: Pig > Issue Type: Bug > Affects Versions: 0.8.0 > Reporter: Vivek Padmanabhan > Assignee: Daniel Dai > > The below script when run in local mode gives me a different output. It looks > like in local mode I have to store a relation obtained through streaming in > order to use it afterwards. > For example consider the below script : > DEFINE MySTREAMUDF `test.sh`; > A = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 > ); > B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int); > --STORE B into 'output.B'; > C = JOIN B by wId LEFT OUTER, A by myId; > D = FOREACH C GENERATE B::wId,B::num,data4 ; > D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int); > --STORE D into 'output.D'; > E = foreach B GENERATE wId,num; > F = DISTINCT E; > G = GROUP F ALL; > H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount; > I = CROSS D,H; > STORE I into 'output.I'; > test.sh > --------- > #/bin/bash > cut -f1,3 > And input is > abcd label1 11 feature1 > acbd label2 22 feature2 > adbc label3 33 feature3 > Here if I store relation B and D then everytime i get the result : > acbd 3 > abcd 3 > adbc 3 > But if i dont store relations B and D then I get an empty output. Here again > I have observed that this behaviour is random ie sometimes like 1out of 5 > runs there will be output. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira