Re: Review Request 37356: Fix synchronization on TopologyManager's outstanding requests

Robert Nettleton Tue, 11 Aug 2015 12:02:28 -0700


> On Aug. 11, 2015, 5:46 p.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java,
> >  line 169
> > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line169>
> >
> >     Isn't this just asking for a deadlock? All it takes is a change in a 
> > different method that causes the locks to be taken out of order.
> >     
> >     Why not just use a single lock to control access to this manager? The 
> > contention would only be there during automated blueprint deployment, so it 
> > won't affect user experience at all.


Thanks for the review.  While using nested locks incorrectly can result in a 
deadlock, I believe the ordering of the two locks in question is correct.

That being said, it is certainly an issue that any future change in this class 
could result in a deadlock.

My goal with this patch was just to resolve a particular bug.  I am hestitant 
to introduce a new locking mechanism into the TopologyManager as a bug fix.  

It may be wiser in the future to move to a model where we synchronize access at 
the level of the TopologyManager, but that is not currently the case, and would 
likely involve a larger/riskier change than I'd like to make for this bug fix.  
There are at least 4 separate locks that are used in this code, sometimes in 
nested fashion, and I don't think this should be re-written in a bug fix, but 
likely tackled in a future release.  

If you'd like, I can file a JIRA to track an investigation on this issue in 
Ambari 2.2.  Sound ok?


- Robert


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/#review94948
-----------------------------------------------------------


On Aug. 11, 2015, 3:30 p.m., Robert Nettleton wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37356/
> -----------------------------------------------------------
> 
> (Updated Aug. 11, 2015, 3:30 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, 
> Robert Levas, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-12720
>     https://issues.apache.org/jira/browse/AMBARI-12720
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This patch provides a fix for AMBARI-12720. 
> 
> Blueprint Cluster deployments are failing under load, with clusters of 50+ 
> nodes.  One or more hosts in the cluster fail to be registered, even though 
> the ambari-agent registration occurs properly for these hosts.  This problem 
> seems to occur when certain hosts are registered at roughly the same time as 
> the call to TopologyManager.processRequest().  
> 
> Upon further analysis, it appears that the 
> TopologyManager.outstandingRequests object is not fully sychronized.  
> 
> There are two paths involved in reading/writing to this data structure:
> 
> 1. TopologyManager.processRequest() (occurs on the main request thread 
> handling the cluster creation request).  When all known hosts have been 
> processed, but some requests are pending, the LogicalRequest instance is 
> added to TopologyManager.outstandingRequests.  Note:  The call to add this 
> request is not synchronized on the outstandingRequests object, but is only 
> synchronized on TopologyManager.availableHosts. 
> 2. TopologyManager.onHostRegistered() (occurs on a separate thread that 
> handles ambari-agent registrations for new hosts joining the cluster).  This 
> method does synchronize on TopologyManager.outstandingRequests, and attempts 
> to iterate over these requests to determine if the newly registered host can 
> satisfy the request.  
> 
> The incorrect synchronization on TopologyManager.outstandingRequests appears 
> to be the root of the intermittent cluster deployment problems described in 
> AMBARI-12720.  Since this datastructure is read/updated by both threads, and 
> since this object is not protected by the same locks, there appears to be a 
> race condition in this code.  
> 
> To fix this problem, this patch implements the following:
> 
> 1. Updates the TopologyManager.onHostRegistered() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to reading TopologyManager.outstandingRequests
>    
> 2. Updates the TopologyManager.processRequest() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to adding a LogicalRequest instance to this collection
>    
> 
> The locking order (availableHosts, outstandingRequests) must be kept in place 
> in order to avoid any potential deadlocks from making a change like this.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
>  8e7fb7f 
> 
> Diff: https://reviews.apache.org/r/37356/diff/
> 
> 
> Testing
> -------
> 
> 1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are 
> passing.
> 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this 
> change does not break existing deployments.
> 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make 
> sure this change does not break deployments of larger clusters.
> 
> 
> Thanks,
> 
> Robert Nettleton
> 
>

Re: Review Request 37356: Fix synchronization on TopologyManager's outstanding requests

Reply via email to