[
https://issues.apache.org/jira/browse/HIVE-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034698#comment-14034698
]
Ivan Mitic commented on HIVE-7190:
----------------------------------
Thanks Thejas and Eugene for the review and commit!
> WebHCat launcher task failure can cause two concurent user jobs to run
> ----------------------------------------------------------------------
>
> Key: HIVE-7190
> URL: https://issues.apache.org/jira/browse/HIVE-7190
> Project: Hive
> Issue Type: Bug
> Components: WebHCat
> Affects Versions: 0.13.0
> Reporter: Ivan Mitic
> Assignee: Ivan Mitic
> Fix For: 0.14.0
>
> Attachments: HIVE-7190.2.patch, HIVE-7190.3.patch, HIVE-7190.patch
>
>
> Templeton uses launcher jobs to launch the actual user jobs. Launcher jobs
> are 1-map jobs (a single task jobs) which kick off the actual user job and
> monitor it until it finishes. Given that the launcher is a task, like any
> other MR task, it has a retry policy in case it fails (due to a task crash,
> tasktracker/nodemanager crash, machine level outage, etc.). Further, when
> launcher task is retried, it will again launch the same user job, *however*
> the previous attempt user job is already running. What this means is that we
> can have two identical user jobs running in parallel.
> In case of MRv2, there will be an MRAppMaster and the launcher task, which
> are subject to failure. In case any of the two fails, another instance of a
> user job will be launched again in parallel.
> Above situation is already a bug.
> Now going further to RM HA, what RM does on failover/restart is that it kills
> all containers, and it restarts all applications. This means that if our
> customer had 10 jobs on the cluster (this is 10 launcher jobs and 10 user
> jobs), on RM failover, all 20 jobs will be restarted, and launcher jobs will
> queue user jobs again. There are two issues with this design:
> 1. There are *possible* chances for corruption of job outputs (it would be
> useful to analyze this scenario more and confirm this statement).
> 2. Cluster resources are spent on jobs redundantly
> To address the issue at least on Yarn (Hadoop 2.0) clusters, webhcat should
> do the same thing Oozie does in this scenario, and that is to tag all its
> child jobs with an id, and kill those jobs on task restart before they are
> kicked off again.
--
This message was sent by Atlassian JIRA
(v6.2#6252)