[
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacob Perkins updated PIG-3902:
-------------------------------
Fix Version/s: 0.11.1
Assignee: Cheolsoo Park
Affects Version/s: 0.11.1
Status: Patch Available (was: Open)
This goes deep. I updated the LogicalPlanBuilder such that, when building a
load op, it checks the plan that's been generated "so far" for stores that
match the location the load references. If so it links them.
Some of the multiquery compilation code exploited the <i>bug</i> that
plan.getSinks() always returned the stores. You could get away with that before
the "postProcess" method on PigServer got called because loads were not yet
linked to the stores they depended on. Hence, all the stores were leaves of the
plan. However, with this patch that bug exploitation will no longer work. In
order to get the loads you'll have to get the operators of the plan explicitly
and create a list of stores. Presumably this happens elsewhere but I only ran
the multiquery tests with my patch.
> PigServer creates cycle
> -----------------------
>
> Key: PIG-3902
> URL: https://issues.apache.org/jira/browse/PIG-3902
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.11.1
> Reporter: Jacob Perkins
> Assignee: Cheolsoo Park
> Fix For: 0.11.1
>
> Attachments: broken-plan.png, cycle.diff, multiquery-cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan.
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an
> input path, and, if there's no path from the input to the output, make the
> input a dependency of the output. However, PigServer orders the loads and
> stores arbitrarily during that logic. Sometimes, in the code above, C is
> correctly wired as a dependency of B and, since that creates a path from A to
> D, A won't be made a dependency of D and we're good. On occasion though, the
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To
> be fair, it's not actually a cycle, since when A is wired to D, there's a
> path between C and B so the cycle won't actually get created. But it's still
> a broken plan.
> The offending PigServer code:
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice
> I had to use a store function that wouldn't check the output. If you're just
> using PigStorage this won't be reproducible since you can't write to the same
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver',
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in
> mind for the specific use case that brought this up, but the general problem
> is more difficult.
> As an aside, there's other issues with the PigServer code referenced above.
> For example, it should almost certainly be using the full path (after
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path
> then loading from the absolute representation of that path in the same
> script... Also, why isn't it checking the FuncSpec as well as the location?
> Just trying to open up the discussion.
--
This message was sent by Atlassian JIRA
(v6.2#6252)