[ 
https://issues.apache.org/jira/browse/CRUNCH-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575523#comment-13575523
 ] 

Gabriel Reid commented on CRUNCH-132:
-------------------------------------

I think that the behavior that you described (needing to use the APPEND 
strategy on the second call to Pipeline#write) actually makes a lot of sense, 
although I think it would be better to be consistent in that, i.e. that the 
second call to Pipeline#write in your last example should fail unless APPEND is 
used despite the fact that there's no call to Pipeline#run in between.

Of course, this means that the second call to Pipeline#write could also use the 
OVERWRITE strategy, which (although it doesn't make sense) makes it difficult 
to decide what the correct thing to do is, as it's not easy to detect this 
situation using the eager evaluation approach. I'm not sure how to get around 
that at the moment, but I do think that it would be good to be consistent 
regardless of whether or not Pipeline#run is called between calls to 
Pipeline#write.

As far as the API itself goes, what do you think of calling the mode 
enumeration "WriteMode" (with entries DEFAULT, APPEND, and OVERWRITE) instead 
of calling it ExistingOutputStrategy? My gut feeling is that WriteMode is a bit 
more clear for the public API, while ExistingOutputStrategy is more suited as 
an internal name. What do you think?
                
> Add configurable behavior for when a pipeline output directory already exists
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-132
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-132
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-132.patch, CRUNCH-132-proto.patch
>
>
> Usually when you run a mapreduce job and the output directory already exists, 
> the job fails (won't start). A Crunch job does run, but results in the output 
> data being duplicated in the output directory with numbered files that follow 
> on from the previous run. 
> Example
> Run 1, single reducer /output -> /output/part-r-00000
> Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001
> I didn't realise I'd run my job twice, so when I looked in the directory it 
> seemed that there had been 2 reducers and somehow the output had been 
> generated twice, which was confusing. 
> I realise this may be by design, but it feels wrong to me. I'd prefer if the 
> behaviour of a standard mapreduce job was preserved.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to