Kevin W Monroe created BIGTOP-2836:
--------------------------------------

             Summary: charm metric collector race condition
                 Key: BIGTOP-2836
                 URL: https://issues.apache.org/jira/browse/BIGTOP-2836
             Project: Bigtop
          Issue Type: Bug
          Components: deployment
    Affects Versions: 1.2.0, 1.2.1
            Reporter: Kevin W Monroe
            Assignee: Kevin W Monroe
            Priority: Minor
             Fix For: 1.3.0


Initially thought fixed in BIGTOP-2801, it seems the charm metric collector can 
still cause a failed deployment.  As a refresher, metrics give users the 
ability see stuff like how many datanodes or zookeeper peers are deployed in an 
environment.

The first attempt at fixing this was to include a precondition before 
collecting metrics, for example, ensure the namenode is "ready" before running 
"hdfs getconf".

However, in this example, there can be a period of time where the charm tells 
the NN to start (at which point the "ready" state is set), yet the NN takes a 
while to format HDFS.  If the metric collector runs during this time, 'hdfs 
getconf' will fail, which means the metric hook fails, which means the 
deployment fails.

There are a variety of ways to mitigate this:

1. Don't set "ready" until the NN is all the way up.
2. Don't let a metric hook fail the entire deployment.
3. Alter the collector so it handles a failed 'hdfs getconf' gracefully.

#1: added to our todo, but will take more time to implement.
#2: opened an issue against the metric layer to see if this is possible.

This JIRA will focus on fixing the problem with option #3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to