[
https://issues.apache.org/jira/browse/CRUNCH-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575523#comment-13575523
]
Gabriel Reid commented on CRUNCH-132:
-------------------------------------
I think that the behavior that you described (needing to use the APPEND
strategy on the second call to Pipeline#write) actually makes a lot of sense,
although I think it would be better to be consistent in that, i.e. that the
second call to Pipeline#write in your last example should fail unless APPEND is
used despite the fact that there's no call to Pipeline#run in between.
Of course, this means that the second call to Pipeline#write could also use the
OVERWRITE strategy, which (although it doesn't make sense) makes it difficult
to decide what the correct thing to do is, as it's not easy to detect this
situation using the eager evaluation approach. I'm not sure how to get around
that at the moment, but I do think that it would be good to be consistent
regardless of whether or not Pipeline#run is called between calls to
Pipeline#write.
As far as the API itself goes, what do you think of calling the mode
enumeration "WriteMode" (with entries DEFAULT, APPEND, and OVERWRITE) instead
of calling it ExistingOutputStrategy? My gut feeling is that WriteMode is a bit
more clear for the public API, while ExistingOutputStrategy is more suited as
an internal name. What do you think?
> Add configurable behavior for when a pipeline output directory already exists
> -----------------------------------------------------------------------------
>
> Key: CRUNCH-132
> URL: https://issues.apache.org/jira/browse/CRUNCH-132
> Project: Crunch
> Issue Type: Improvement
> Affects Versions: 0.4.0
> Reporter: Dave Beech
> Assignee: Josh Wills
> Attachments: CRUNCH-132.patch, CRUNCH-132-proto.patch
>
>
> Usually when you run a mapreduce job and the output directory already exists,
> the job fails (won't start). A Crunch job does run, but results in the output
> data being duplicated in the output directory with numbered files that follow
> on from the previous run.
> Example
> Run 1, single reducer /output -> /output/part-r-00000
> Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001
> I didn't realise I'd run my job twice, so when I looked in the directory it
> seemed that there had been 2 reducers and somehow the output had been
> generated twice, which was confusing.
> I realise this may be by design, but it feels wrong to me. I'd prefer if the
> behaviour of a standard mapreduce job was preserved.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira