[ 
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989739#comment-13989739
 ] 

Cheolsoo Park commented on PIG-3902:
------------------------------------

{quote}Presumably this happens elsewhere but I only ran the multiquery tests 
with my patch.{quote}
Here are unit test failures-
{code}
>>> org.apache.pig.test.TestGrunt.testIllustrate        0.23 sec        1
>>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobSimple       1 min 
>>> 27 sec    1
>>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobMultiStoreOnPred     
>>> 1 min 26 sec    1
>>> org.apache.pig.test.TestPigStorage.testSchemaConversion2    6.6 sec 1
>>> org.apache.pig.test.TestPigStorage.testSchemaConversion 
{code}

> PigServer creates cycle
> -----------------------
>
>                 Key: PIG-3902
>                 URL: https://issues.apache.org/jira/browse/PIG-3902
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11.1
>            Reporter: Jacob Perkins
>            Assignee: Cheolsoo Park
>             Fix For: 0.11.1
>
>         Attachments: broken-plan.png, cycle.diff, multiquery-cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan. 
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an 
> input path, and, if there's no path from the input to the output, make the 
> input a dependency of the output. However, PigServer orders the loads and 
> stores arbitrarily during that logic. Sometimes, in the code above, C is 
> correctly wired as a dependency of B and, since that creates a path from A to 
> D, A won't be made a dependency of D and we're good. On occasion though, the 
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To 
> be fair, it's not actually a cycle, since when A is wired to D, there's a 
> path between C and B so the cycle won't actually get created. But it's still 
> a broken plan.
> The offending PigServer code: 
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice 
> I had to use a store function that wouldn't check the output. If you're just 
> using PigStorage this won't be reproducible since you can't write to the same 
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using 
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in 
> mind for the specific use case that brought this up, but the general problem 
> is more difficult. 
> As an aside, there's other issues with the PigServer code referenced above. 
> For example, it should almost certainly be using the full path (after 
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path 
> then loading from the absolute representation of that path in the same 
> script... Also, why isn't it checking the FuncSpec as well as the location? 
> Just trying to open up the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to