-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/
-----------------------------------------------------------
(Updated Aug. 11, 2015, 3:30 p.m.)
Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, Robert
Levas, and Sumit Mohanty.
Bugs: AMBARI-12720
https://issues.apache.org/jira/browse/AMBARI-12720
Repository: ambari
Description
-------
This patch provides a fix for AMBARI-12720.
Blueprint Cluster deployments are failing under load, with clusters of 50+
nodes. One or more hosts in the cluster fail to be registered, even though the
ambari-agent registration occurs properly for these hosts. This problem seems
to occur when certain hosts are registered at roughly the same time as the call
to TopologyManager.processRequest().
Upon further analysis, it appears that the TopologyManager.outstandingRequests
object is not fully sychronized.
There are two paths involved in reading/writing to this data structure:
1. TopologyManager.processRequest() (occurs on the main request thread handling
the cluster creation request). When all known hosts have been processed, but
some requests are pending, the LogicalRequest instance is added to
TopologyManager.outstandingRequests. Note: The call to add this request is
not synchronized on the outstandingRequests object, but is only synchronized on
TopologyManager.availableHosts.
2. TopologyManager.onHostRegistered() (occurs on a separate thread that handles
ambari-agent registrations for new hosts joining the cluster). This method
does synchronize on TopologyManager.outstandingRequests, and attempts to
iterate over these requests to determine if the newly registered host can
satisfy the request.
The incorrect synchronization on TopologyManager.outstandingRequests appears to
be the root of the intermittent cluster deployment problems described in
AMBARI-12720. Since this datastructure is read/updated by both threads, and
since this object is not protected by the same locks, there appears to be a
race condition in this code.
To fix this problem, this patch implements the following:
1. Updates the TopologyManager.onHostRegistered() method to lock on
--> availableHosts
--> outstandingRequests
prior to reading TopologyManager.outstandingRequests
2. Updates the TopologyManager.processRequest() method to lock on
--> availableHosts
--> outstandingRequests
prior to adding a LogicalRequest instance to this collection
The locking order (availableHosts, outstandingRequests) must be kept in place
in order to avoid any potential deadlocks from making a change like this.
Diffs
-----
ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
8e7fb7f
Diff: https://reviews.apache.org/r/37356/diff/
Testing (updated)
-------
1. Ran the ambari-server unit test suite ("mvn clean test"). All tests are
passing.
2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this change
does not break existing deployments.
3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make
sure this change does not break deployments of larger clusters.
Thanks,
Robert Nettleton