[jira] [Created] (HIVE-7190) WebHCat launcher task failure can cause two concurent user jobs to run

Ivan Mitic (JIRA) Fri, 06 Jun 2014 13:57:18 -0700

Ivan Mitic created HIVE-7190:
--------------------------------

             Summary: WebHCat launcher task failure can cause two concurent 
user jobs to run
                 Key: HIVE-7190
                 URL: https://issues.apache.org/jira/browse/HIVE-7190
             Project: Hive
          Issue Type: Bug
          Components: WebHCat
            Reporter: Ivan Mitic



Templeton uses launcher jobs to launch the actual user jobs. Launcher jobs are 
1-map jobs (a single task jobs) which kick off the actual user job and monitor 
it until it finishes. Given that the launcher is a task, like any other MR 
task, it has a retry policy in case it fails (due to a task crash, 
tasktracker/nodemanager crash, machine level outage, etc.). Further, when 
launcher task is retried, it will again launch the same user job, *however* the 
previous attempt user job is already running. What this means is that we can 
have two identical user jobs running in parallel. 

In case of MRv2, there will be an MRAppMaster and the launcher task, which are 
subject to failure. In case any of the two fails, another instance of a user 
job will be launched again in parallel. 

Above situation is already a bug.

Now going further to RM HA, what RM does on failover/restart is that it kills 
all containers, and it restarts all applications. This means that if our 
customer had 10 jobs on the cluster (this is 10 launcher jobs and 10 user 
jobs), on RM failover, all 20 jobs will be restarted, and launcher jobs will 
queue user jobs again. There are two issues with this design:
1. There are *possible* chances for corruption of job outputs (it would be 
useful to analyze this scenario more and confirm this statement).
2. Cluster resources are spent on jobs redundantly

To address the issue at least on Yarn (Hadoop 2.0) clusters, webhcat should do 
the same thing Oozie does in this scenario, and that is to tag all its child 
jobs with an id, and kill those jobs on task restart before they are kicked off 
again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7190) WebHCat launcher task failure can cause two concurent user jobs to run

Reply via email to