> On Aug. 11, 2015, 8:24 p.m., John Speidel wrote: > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java, > > line 169 > > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line169> > > > > Just got done discussing with Bob on a call. > > > > This synchronized block should be moved up to line 153 and end at the > > bottom of the method. The existing availbleHosts synchronization block on > > line 195 should be removed due to the existence of the new availableHosts > > syncronization described above. > > > > The important thing here is that we synchronize on a common mutual > > exclusion monitor in both onHostRegistered and processRequest. With this > > patch it will be the availableHosts monitor. > > > > I agree with the other comments that this is dangerous any likely > > inefficient and also agree that we should change as little as possible > > initially while still fixing the issue. If this patch causes a significant > > performance degradation then we can then look at optimizing the locking in > > this class in another patch.
Version 2 of the patch addresses this issue. - Robert ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/37356/#review94991 ----------------------------------------------------------- On Aug. 12, 2015, 9:54 p.m., Robert Nettleton wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37356/ > ----------------------------------------------------------- > > (Updated Aug. 12, 2015, 9:54 p.m.) > > > Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, > Robert Levas, and Sumit Mohanty. > > > Bugs: AMBARI-12720 > https://issues.apache.org/jira/browse/AMBARI-12720 > > > Repository: ambari > > > Description > ------- > > This patch provides a fix for AMBARI-12720. > > Blueprint Cluster deployments are failing under load, with clusters of 50+ > nodes. One or more hosts in the cluster fail to be registered, even though > the ambari-agent registration occurs properly for these hosts. This problem > seems to occur when certain hosts are registered at roughly the same time as > the call to TopologyManager.processRequest(). > > Upon further analysis, it appears that the > TopologyManager.outstandingRequests object is not fully sychronized. > > There are two paths involved in reading/writing to this data structure: > > 1. TopologyManager.processRequest() (occurs on the main request thread > handling the cluster creation request). When all known hosts have been > processed, but some requests are pending, the LogicalRequest instance is > added to TopologyManager.outstandingRequests. Note: The call to add this > request is not synchronized on the outstandingRequests object, but is only > synchronized on TopologyManager.availableHosts. > 2. TopologyManager.onHostRegistered() (occurs on a separate thread that > handles ambari-agent registrations for new hosts joining the cluster). This > method does synchronize on TopologyManager.outstandingRequests, and attempts > to iterate over these requests to determine if the newly registered host can > satisfy the request. > > The incorrect synchronization on TopologyManager.outstandingRequests appears > to be the root of the intermittent cluster deployment problems described in > AMBARI-12720. Since this datastructure is read/updated by both threads, and > since this object is not protected by the same locks, there appears to be a > race condition in this code. > > To fix this problem, this patch implements the following: > > 1. Updates the TopologyManager.onHostRegistered() method to lock on > --> availableHosts > --> outstandingRequests > prior to reading TopologyManager.outstandingRequests > > 2. Updates the TopologyManager.processRequest() method to lock on > --> availableHosts > --> outstandingRequests > prior to adding a LogicalRequest instance to this collection > > > The locking order (availableHosts, outstandingRequests) must be kept in place > in order to avoid any potential deadlocks from making a change like this. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java > 4f51d47 > > Diff: https://reviews.apache.org/r/37356/diff/ > > > Testing > ------- > > 1. Ran the ambari-server unit test suite ("mvn clean test"). All tests are > passing. > 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this > change does not break existing deployments. > 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make > sure this change does not break deployments of larger clusters. > > > Thanks, > > Robert Nettleton > >
