[ 
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3902:
-------------------------------

    Status: Open  (was: Patch Available)

[~thedatachef], your patch breaks the following tests. The area of code that 
you're modifying seems to affect multi query optimization. Can you take a look?
{code}
>>> org.apache.pig.test.TestMultiQueryBasic.testMultiQueryWithCoGroup_2         
>>> 3.7 sec 1
>>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithCoGroup        
>>> 0.13 sec        1
>>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithIntermediateStores
>>>      0.1 sec 1
>>> org.apache.pig.test.TestMultiQueryCompiler.testStoreOrder   0.18 sec        
>>> 1
>>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemoval      
>>> 0.13 sec        1
>>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemovalCollapseSplit
>>>          0.14 sec        1
>>> org.apache.pig.test.TestMultiQueryLocal.testStoreOrder 
{code}
Canceling the patch for now.

> PigServer creates cycle
> -----------------------
>
>                 Key: PIG-3902
>                 URL: https://issues.apache.org/jira/browse/PIG-3902
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jacob Perkins
>         Attachments: broken-plan.png, cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan. 
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an 
> input path, and, if there's no path from the input to the output, make the 
> input a dependency of the output. However, PigServer orders the loads and 
> stores arbitrarily during that logic. Sometimes, in the code above, C is 
> correctly wired as a dependency of B and, since that creates a path from A to 
> D, A won't be made a dependency of D and we're good. On occasion though, the 
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To 
> be fair, it's not actually a cycle, since when A is wired to D, there's a 
> path between C and B so the cycle won't actually get created. But it's still 
> a broken plan.
> The offending PigServer code: 
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice 
> I had to use a store function that wouldn't check the output. If you're just 
> using PigStorage this won't be reproducible since you can't write to the same 
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using 
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in 
> mind for the specific use case that brought this up, but the general problem 
> is more difficult. 
> As an aside, there's other issues with the PigServer code referenced above. 
> For example, it should almost certainly be using the full path (after 
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path 
> then loading from the absolute representation of that path in the same 
> script... Also, why isn't it checking the FuncSpec as well as the location? 
> Just trying to open up the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to