[
https://issues.apache.org/jira/browse/CRUNCH-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800667#comment-13800667
]
Gabriel Reid commented on CRUNCH-284:
-------------------------------------
Yep, definitely a valid use case, although I do think that the onlyOnce in the
context of a write is pretty unintuitive. Another option would be to add
another WriteMode.CHECKPOINT alternative that specifies the read-once
semantics, but this would probably be at least equally unintuitive.
I'll try to think of another option, but I do think it's pretty important that
we allow reading checkpointed files with read-once semantics.
> Optimize for minimal disk i/o rather than the number of stages?
> ---------------------------------------------------------------
>
> Key: CRUNCH-284
> URL: https://issues.apache.org/jira/browse/CRUNCH-284
> Project: Crunch
> Issue Type: Bug
> Reporter: Chao Shi
>
> I have a pipeline as follows:
> PCollection in = pipeline.read(...)
> PCollection part1 = f1(in)
> PCollection part2 = f2(in)
> pipelien.write(part1.groupByKey...)
> pipeline.write(part2.groupByKey...)
> where f1 extracts a small potion from "in" and f2 returns the rest. Crunch
> optimizes the pipeline into two independent MR jobs, both of which fully read
> the input.
> I think the ideal MRs should be a map-only job reads the input and split them
> to two outputs, and then two MRs read them respectively.
> The problem is that Crunch minimizes the number of MR stages, which is
> optimal for most cases, but not optimal in this case.
> What do you think of this folks?
--
This message was sent by Atlassian JIRA
(v6.1#6144)