[ 
https://issues.apache.org/jira/browse/OOZIE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Roling updated OOZIE-2223:
------------------------------
    Attachment: OOZIE-2223-2.patch

Uploading new patch

> Improve documentation with regard to Java action retries
> --------------------------------------------------------
>
>                 Key: OOZIE-2223
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2223
>             Project: Oozie
>          Issue Type: Improvement
>          Components: docs
>    Affects Versions: 3.3.2, 4.1.0, 4.0.1
>            Reporter: Ben Roling
>         Attachments: OOZIE-2223-1.patch, OOZIE-2223-2.patch
>
>
> My organization has been bitten by a mistake in the way we have written Java 
> action applications.  I would like to introduce a documentation change that 
> might reduce the likelihood that others new to Oozie make the same mistake.
> The mistake is not accounting for the possibility that launcher tasks will 
> fail due to reasons such as cluster maintenance.  We have a number of jobs 
> that take input and output paths as arguments.  Our code had been 
> specifically written such that if the output path already exists the job 
> fails to avoid inadvertently deleting an output that may have been consumed 
> by a downstream job.
> This has bitten us during cluster maintenance that requires TaskTracker 
> restarts.  During such an event any launcher running on a TaskTracker at the 
> time of the TaskTracker restart fails and is retried on another TaskTracker.  
> The new attempt of the launcher task fails due to the output directory 
> already existing.  This in turn fails the whole workflow.  Maintenance that 
> requires restarting all TaskTrackers can end up causing a lot of workflow 
> failures.
> The current documentation does hint at such issues via mention of the 
> “prepare” block, but I don’t think the explanation of this block is clear 
> enough for newcomers to understand its use.  Furthermore, I’m not sure the 
> prepare block is the best answer for how to handle the specific types of 
> issues I am referring to.  A “delete” action in a prepare block will delete 
> content regardless of state, which provides the possibility that a previously 
> completed good output could be deleted.  This can lead to issues such as 
> corrupted traceability when there is a need to trace an output back to the 
> inputs that produced it.
> I believe a more appropriate implementation to address the possibility of 
> launcher task failure is to write the action such that it uses a previous 
> complete output without deleting or reprocessing.  Only if it detects an 
> incomplete output does it delete the output and re-run the processing to 
> produce the output.  This protects from the possibility of accidental output 
> destruction.
> Furthermore, some types of actions spawn activity that runs asynchronously 
> outside the context of the launcher task itself.  In such cases the action 
> author must take care to clean up any stray activity spawned prior to the 
> failure of the initial launcher task to ensure it does not collide with 
> activity produced by the new attempt of the launcher.  In the case of my 
> organization, such activity includes child M/R jobs spawned from the Apache 
> Crunch pipelines we invoke from our Java actions.  Depending on the design of 
> the action, it can be required to find and kill such child jobs before 
> invoking the new pipeline and spawning new child jobs.
> I will attach a patch that demonstrates one possible documentation 
> improvement to shed light on these issues but I appreciate feedback and any 
> other ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to