[
https://issues.apache.org/jira/browse/CRUNCH-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534727#comment-13534727
]
Matthias Friedrich commented on CRUNCH-128:
-------------------------------------------
At bit of reasoning so that my opinion doesn't sound too arbitrary (it's still
a matter of taste to some degree): I'm always reluctant to add new
responsibilities to a class or interface because that increases complexity,
both for users and in the implementation. In this case, we don't add new
responsibilities, we just improve something that already exists. The number of
methods on a class alone doesn't increase complexity (see our rather large
static factories), so it's not my primary concern; the number of
responsibilities is a better metric. I hope that makes sense :-)
As for the ParallelDoOperation class: How about adding renaming it to
ParallelDoOptions and having method signatures like this: parallelDo(String,
DoFn, PType, ParallelDoOption)? This would make it clear that it's the most
general parallelDo and it could soak up additional options so we don't need any
new parallelDo() methods in the future.
> Allow one stage of an MR pipeline to depend on another target being created
> ---------------------------------------------------------------------------
>
> Key: CRUNCH-128
> URL: https://issues.apache.org/jira/browse/CRUNCH-128
> Project: Crunch
> Issue Type: Improvement
> Reporter: Josh Wills
> Attachments: CheckpointingIT.java, CRUNCH-128.patch,
> CRUNCH-128v2.patch, CRUNCH-128-with-op.patch
>
>
> There are a couple of problems (e.g., mapside-joins, total orderings, etc.)
> where we need to guarantee that one PCollection has been written to the
> FileSystem before another MapReduce pipeline that depends on that file is
> allowed to run. This doesn't fit cleanly into the current set of abstractions
> for Crunch, which is why we force pipelines to execute via the run command to
> guarantee that the files have been created before the second stage is run.
> We should add the ability for a particular PCollection to require that a
> SourceTarget instance has been created before it can be executed, and the
> planner should incorporate this information into the MR pipeline planning
> process.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira