-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/#review95190
-----------------------------------------------------------

Ship it!


Ship It!

- John Speidel


On Aug. 12, 2015, 9:54 p.m., Robert Nettleton wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37356/
> -----------------------------------------------------------
> 
> (Updated Aug. 12, 2015, 9:54 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, 
> Robert Levas, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-12720
>     https://issues.apache.org/jira/browse/AMBARI-12720
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This patch provides a fix for AMBARI-12720. 
> 
> Blueprint Cluster deployments are failing under load, with clusters of 50+ 
> nodes.  One or more hosts in the cluster fail to be registered, even though 
> the ambari-agent registration occurs properly for these hosts.  This problem 
> seems to occur when certain hosts are registered at roughly the same time as 
> the call to TopologyManager.processRequest().  
> 
> Upon further analysis, it appears that the 
> TopologyManager.outstandingRequests object is not fully sychronized.  
> 
> There are two paths involved in reading/writing to this data structure:
> 
> 1. TopologyManager.processRequest() (occurs on the main request thread 
> handling the cluster creation request).  When all known hosts have been 
> processed, but some requests are pending, the LogicalRequest instance is 
> added to TopologyManager.outstandingRequests.  Note:  The call to add this 
> request is not synchronized on the outstandingRequests object, but is only 
> synchronized on TopologyManager.availableHosts. 
> 2. TopologyManager.onHostRegistered() (occurs on a separate thread that 
> handles ambari-agent registrations for new hosts joining the cluster).  This 
> method does synchronize on TopologyManager.outstandingRequests, and attempts 
> to iterate over these requests to determine if the newly registered host can 
> satisfy the request.  
> 
> The incorrect synchronization on TopologyManager.outstandingRequests appears 
> to be the root of the intermittent cluster deployment problems described in 
> AMBARI-12720.  Since this datastructure is read/updated by both threads, and 
> since this object is not protected by the same locks, there appears to be a 
> race condition in this code.  
> 
> To fix this problem, this patch implements the following:
> 
> 1. Updates the TopologyManager.onHostRegistered() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to reading TopologyManager.outstandingRequests
>    
> 2. Updates the TopologyManager.processRequest() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to adding a LogicalRequest instance to this collection
>    
> 
> The locking order (availableHosts, outstandingRequests) must be kept in place 
> in order to avoid any potential deadlocks from making a change like this.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
>  4f51d47 
> 
> Diff: https://reviews.apache.org/r/37356/diff/
> 
> 
> Testing
> -------
> 
> 1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are 
> passing.
> 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this 
> change does not break existing deployments.
> 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make 
> sure this change does not break deployments of larger clusters.
> 
> 
> Thanks,
> 
> Robert Nettleton
> 
>

Reply via email to