[ https://issues.apache.org/jira/browse/OOZIE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shwetha G S updated OOZIE-2223: ------------------------------- Fix Version/s: (was: trunk) 4.2 > Improve documentation with regard to Java action retries > -------------------------------------------------------- > > Key: OOZIE-2223 > URL: https://issues.apache.org/jira/browse/OOZIE-2223 > Project: Oozie > Issue Type: Improvement > Components: docs > Affects Versions: 3.3.2, 4.1.0, 4.0.1 > Reporter: Ben Roling > Fix For: 4.2 > > Attachments: OOZIE-2223-2.patch > > > My organization has been bitten by a mistake in the way we have written Java > action applications. I would like to introduce a documentation change that > might reduce the likelihood that others new to Oozie make the same mistake. > The mistake is not accounting for the possibility that launcher tasks will > fail due to reasons such as cluster maintenance. We have a number of jobs > that take input and output paths as arguments. Our code had been > specifically written such that if the output path already exists the job > fails to avoid inadvertently deleting an output that may have been consumed > by a downstream job. > This has bitten us during cluster maintenance that requires TaskTracker > restarts. During such an event any launcher running on a TaskTracker at the > time of the TaskTracker restart fails and is retried on another TaskTracker. > The new attempt of the launcher task fails due to the output directory > already existing. This in turn fails the whole workflow. Maintenance that > requires restarting all TaskTrackers can end up causing a lot of workflow > failures. > The current documentation does hint at such issues via mention of the > “prepare” block, but I don’t think the explanation of this block is clear > enough for newcomers to understand its use. Furthermore, I’m not sure the > prepare block is the best answer for how to handle the specific types of > issues I am referring to. A “delete” action in a prepare block will delete > content regardless of state, which provides the possibility that a previously > completed good output could be deleted. This can lead to issues such as > corrupted traceability when there is a need to trace an output back to the > inputs that produced it. > I believe a more appropriate implementation to address the possibility of > launcher task failure is to write the action such that it uses a previous > complete output without deleting or reprocessing. Only if it detects an > incomplete output does it delete the output and re-run the processing to > produce the output. This protects from the possibility of accidental output > destruction. > Furthermore, some types of actions spawn activity that runs asynchronously > outside the context of the launcher task itself. In such cases the action > author must take care to clean up any stray activity spawned prior to the > failure of the initial launcher task to ensure it does not collide with > activity produced by the new attempt of the launcher. In the case of my > organization, such activity includes child M/R jobs spawned from the Apache > Crunch pipelines we invoke from our Java actions. Depending on the design of > the action, it can be required to find and kill such child jobs before > invoking the new pipeline and spawning new child jobs. > I will attach a patch that demonstrates one possible documentation > improvement to shed light on these issues but I appreciate feedback and any > other ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)