> On Aug. 11, 2015, 5:46 p.m., Jonathan Hurley wrote: > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java, > > line 169 > > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line169> > > > > Isn't this just asking for a deadlock? All it takes is a change in a > > different method that causes the locks to be taken out of order. > > > > Why not just use a single lock to control access to this manager? The > > contention would only be there during automated blueprint deployment, so it > > won't affect user experience at all.
Thanks for the review. While using nested locks incorrectly can result in a deadlock, I believe the ordering of the two locks in question is correct. That being said, it is certainly an issue that any future change in this class could result in a deadlock. My goal with this patch was just to resolve a particular bug. I am hestitant to introduce a new locking mechanism into the TopologyManager as a bug fix. It may be wiser in the future to move to a model where we synchronize access at the level of the TopologyManager, but that is not currently the case, and would likely involve a larger/riskier change than I'd like to make for this bug fix. There are at least 4 separate locks that are used in this code, sometimes in nested fashion, and I don't think this should be re-written in a bug fix, but likely tackled in a future release. If you'd like, I can file a JIRA to track an investigation on this issue in Ambari 2.2. Sound ok? - Robert ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/37356/#review94948 ----------------------------------------------------------- On Aug. 11, 2015, 3:30 p.m., Robert Nettleton wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37356/ > ----------------------------------------------------------- > > (Updated Aug. 11, 2015, 3:30 p.m.) > > > Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, > Robert Levas, and Sumit Mohanty. > > > Bugs: AMBARI-12720 > https://issues.apache.org/jira/browse/AMBARI-12720 > > > Repository: ambari > > > Description > ------- > > This patch provides a fix for AMBARI-12720. > > Blueprint Cluster deployments are failing under load, with clusters of 50+ > nodes. One or more hosts in the cluster fail to be registered, even though > the ambari-agent registration occurs properly for these hosts. This problem > seems to occur when certain hosts are registered at roughly the same time as > the call to TopologyManager.processRequest(). > > Upon further analysis, it appears that the > TopologyManager.outstandingRequests object is not fully sychronized. > > There are two paths involved in reading/writing to this data structure: > > 1. TopologyManager.processRequest() (occurs on the main request thread > handling the cluster creation request). When all known hosts have been > processed, but some requests are pending, the LogicalRequest instance is > added to TopologyManager.outstandingRequests. Note: The call to add this > request is not synchronized on the outstandingRequests object, but is only > synchronized on TopologyManager.availableHosts. > 2. TopologyManager.onHostRegistered() (occurs on a separate thread that > handles ambari-agent registrations for new hosts joining the cluster). This > method does synchronize on TopologyManager.outstandingRequests, and attempts > to iterate over these requests to determine if the newly registered host can > satisfy the request. > > The incorrect synchronization on TopologyManager.outstandingRequests appears > to be the root of the intermittent cluster deployment problems described in > AMBARI-12720. Since this datastructure is read/updated by both threads, and > since this object is not protected by the same locks, there appears to be a > race condition in this code. > > To fix this problem, this patch implements the following: > > 1. Updates the TopologyManager.onHostRegistered() method to lock on > --> availableHosts > --> outstandingRequests > prior to reading TopologyManager.outstandingRequests > > 2. Updates the TopologyManager.processRequest() method to lock on > --> availableHosts > --> outstandingRequests > prior to adding a LogicalRequest instance to this collection > > > The locking order (availableHosts, outstandingRequests) must be kept in place > in order to avoid any potential deadlocks from making a change like this. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java > 8e7fb7f > > Diff: https://reviews.apache.org/r/37356/diff/ > > > Testing > ------- > > 1. Ran the ambari-server unit test suite ("mvn clean test"). All tests are > passing. > 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this > change does not break existing deployments. > 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make > sure this change does not break deployments of larger clusters. > > > Thanks, > > Robert Nettleton > >
