----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/37356/#review94948 -----------------------------------------------------------
ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java (line 169) <https://reviews.apache.org/r/37356/#comment149676> Isn't this just asking for a deadlock? All it takes is a change in a different method that causes the locks to be taken out of order. Why not just use a single lock to control access to this manager? The contention would only be there during automated blueprint deployment, so it won't affect user experience at all. - Jonathan Hurley On Aug. 11, 2015, 11:30 a.m., Robert Nettleton wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37356/ > ----------------------------------------------------------- > > (Updated Aug. 11, 2015, 11:30 a.m.) > > > Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, > Robert Levas, and Sumit Mohanty. > > > Bugs: AMBARI-12720 > https://issues.apache.org/jira/browse/AMBARI-12720 > > > Repository: ambari > > > Description > ------- > > This patch provides a fix for AMBARI-12720. > > Blueprint Cluster deployments are failing under load, with clusters of 50+ > nodes. One or more hosts in the cluster fail to be registered, even though > the ambari-agent registration occurs properly for these hosts. This problem > seems to occur when certain hosts are registered at roughly the same time as > the call to TopologyManager.processRequest(). > > Upon further analysis, it appears that the > TopologyManager.outstandingRequests object is not fully sychronized. > > There are two paths involved in reading/writing to this data structure: > > 1. TopologyManager.processRequest() (occurs on the main request thread > handling the cluster creation request). When all known hosts have been > processed, but some requests are pending, the LogicalRequest instance is > added to TopologyManager.outstandingRequests. Note: The call to add this > request is not synchronized on the outstandingRequests object, but is only > synchronized on TopologyManager.availableHosts. > 2. TopologyManager.onHostRegistered() (occurs on a separate thread that > handles ambari-agent registrations for new hosts joining the cluster). This > method does synchronize on TopologyManager.outstandingRequests, and attempts > to iterate over these requests to determine if the newly registered host can > satisfy the request. > > The incorrect synchronization on TopologyManager.outstandingRequests appears > to be the root of the intermittent cluster deployment problems described in > AMBARI-12720. Since this datastructure is read/updated by both threads, and > since this object is not protected by the same locks, there appears to be a > race condition in this code. > > To fix this problem, this patch implements the following: > > 1. Updates the TopologyManager.onHostRegistered() method to lock on > --> availableHosts > --> outstandingRequests > prior to reading TopologyManager.outstandingRequests > > 2. Updates the TopologyManager.processRequest() method to lock on > --> availableHosts > --> outstandingRequests > prior to adding a LogicalRequest instance to this collection > > > The locking order (availableHosts, outstandingRequests) must be kept in place > in order to avoid any potential deadlocks from making a change like this. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java > 8e7fb7f > > Diff: https://reviews.apache.org/r/37356/diff/ > > > Testing > ------- > > 1. Ran the ambari-server unit test suite ("mvn clean test"). All tests are > passing. > 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this > change does not break existing deployments. > 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make > sure this change does not break deployments of larger clusters. > > > Thanks, > > Robert Nettleton > >
