[ https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993831#comment-13993831 ]
Cheolsoo Park commented on PIG-3902: ------------------------------------ Great! I'll run unit tests with the new patch and commit it. Thank you Jacob! > PigServer creates cycle > ----------------------- > > Key: PIG-3902 > URL: https://issues.apache.org/jira/browse/PIG-3902 > Project: Pig > Issue Type: Bug > Affects Versions: 0.13.0 > Reporter: Jacob Perkins > Assignee: Cheolsoo Park > Fix For: 0.13.0 > > Attachments: broken-plan.png, cycle.diff, multiquery-20140509.diff, > multiquery-cycle.diff, multiquery-cycle.diff > > > Under certain conditions PigServer creates a cycle in the logical plan. > Consider the following pseudocode: > {code} > A = load from 'A' using F1; > ...process... > B = store X into 'B' using F2; > C = load from 'B' using F3; > ...process... > D = store Y into 'A' using F1; > {code} > PigServer will, in ordinary cases, notice that an output path is equal to an > input path, and, if there's no path from the input to the output, make the > input a dependency of the output. However, PigServer orders the loads and > stores arbitrarily during that logic. Sometimes, in the code above, C is > correctly wired as a dependency of B and, since that creates a path from A to > D, A won't be made a dependency of D and we're good. On occasion though, the > ordering being arbitrary, A is wired as a dependency of D. That's no good. To > be fair, it's not actually a cycle, since when A is wired to D, there's a > path between C and B so the cycle won't actually get created. But it's still > a broken plan. > The offending PigServer code: > https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693 > And here's some actual pig code that should reproduce the broken plan. Notice > I had to use a store function that wouldn't check the output. If you're just > using PigStorage this won't be reproducible since you can't write to the same > location you read from in that case. > {code} > A = load '$A' as (line:chararray); > A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token; > store A into '$B'; > B = load '$B' as (token:chararray); > B = filter B by SIZE(token) > 3; > store B into '$A' using > org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', > 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)'); > {code} > As far as a fix goes... I'd love some input. I've got some workarounds in > mind for the specific use case that brought this up, but the general problem > is more difficult. > As an aside, there's other issues with the PigServer code referenced above. > For example, it should almost certainly be using the full path (after > LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path > then loading from the absolute representation of that path in the same > script... Also, why isn't it checking the FuncSpec as well as the location? > Just trying to open up the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)