[jira] [Commented] (CRUNCH-218) Add new Target.WriteMode to skip the write and continue pipeline if an output target exists

Dave Beech (JIRA) Thu, 13 Jun 2013 05:29:56 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682178#comment-13682178
 ]


Dave Beech commented on CRUNCH-218:
-----------------------------------

Josh - a problem. If the Crunch job fails, the output directories will have 
been created but will be empty. When you then restart the job, the pipeline 
sees these directories and skips the processing. The fact the folders are empty 
isn't really the problem - a job may produce no data - but either way I'd want 
to ensure I'm only checkpointing on a successful pipeline run. 

Because of this I've just realised Crunch output directories don't contain a 
"_SUCCESS" flag file like traditional mapreduce jobs. Maybe this should be a 
separate JIRA. A success flag like this would solve it, because then you'd only 
restart from a checkpoint path if it exists and contains a file named 
"_SUCCESS". 
                
> Add new Target.WriteMode to skip the write and continue pipeline if an output 
> target exists
> -------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-218
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-218
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH-218b.patch, CRUNCH-218.patch
>
>
> Quite often I write pipelines which persist data to the filesystem midway 
> through the process, and then carry on doing further work. 
> If this intermediate data is already present, I think it would be good if I 
> could set a write mode which skips over this first half of processing. This 
> way I'd avoid running jobs unnecessarily and wasting cluster resources 
> regenerating data I already have. 
> Example:
> PCollection<B> inter = 
> pipeline.read(source).parallelDo(something).parallelDo(somethingElse);
> inter.write(At.sequenceFile('output'), WriteMode.SKIP_IF_EXISTS);
> PCollection<C> final = inter.parallelDo(moreWork);
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-218) Add new Target.WriteMode to skip the write and continue pipeline if an output target exists

Reply via email to