Josh Wills created CRUNCH-237:
---------------------------------
Summary: Improper job dependencies for certain types of long
pipelines
Key: CRUNCH-237
URL: https://issues.apache.org/jira/browse/CRUNCH-237
Project: Crunch
Issue Type: Bug
Components: Core
Affects Versions: 0.6.0
Reporter: Josh Wills
Assignee: Josh Wills
Fix For: 0.7.0
The Crunch planner analyzes the dependencies between different phases of a
MapReduce pipeline and uses those dependencies to ensure that the MapReduce
jobs in the pipeline are executed in the correct sequence. For certain kinds of
long pipelines, it's possible for the planner to miss a necessary dependency as
follows:
Pipeline spec: [Input] -> GBK -> [Out1] -> (GBK) -> (Out2) -> GBK -> [Out3]
This pipeline has two explicit outputs (Out1 and Out3) and one implicit output
(Out2). Additionally, assume that there is a map-side join between Out1 that
happens in the map stage of the job that creates Out2. For this pipeline, the
planner will mark a dependency between the job that creates Out1 and the job
that creates Out3, but NOT between the job that creates Out1 and the job that
creates Out2. This makes it possible for the Out2 job to run before Out1 is
created, causing a failure.
The easiest way to fix this I could see was to add a step to the dependency
chain such that every job that is created in a later stage of pipeline creation
depends on all of the jobs in the earlier stages having been run, which is what
the attached patch and example integration test demonstrate.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira