[jira] [Commented] (KAFKA-4959) remove controller concurrent access to non-threadsafe NetworkClient, Selector, and SSLEngine

ASF GitHub Bot (JIRA) Mon, 27 Mar 2017 15:07:37 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944120#comment-15944120
 ]


ASF GitHub Bot commented on KAFKA-4959:
---------------------------------------

GitHub user onurkaraman opened a pull request:

    https://github.com/apache/kafka/pull/2746

    KAFKA-4959: remove controller concurrent access to non-threadsafe 
NetworkClient, Selector, and SSLEngine

    This brought down a cluster by causing continuous controller moves.
    
    ZkClient's ZkEventThread and a RequestSendThread can concurrently use 
objects that aren't thread-safe:
    * Selector
    * NetworkClient
    * SSLEngine (this was the big one for us. We turn on SSL for interbroker 
communication).
    
    As per the "Concurrency Notes" section from 
https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html:
    > two threads must not attempt to call the same method (either wrap() or 
unwrap()) concurrently
    
    SSLEngine.wrap gets called in:
    * SslTransportLayer.write
    * SslTransportLayer.handshake
    * SslTransportLayer.close
    
    It turns out that the ZkEventThread and RequestSendThread can concurrently 
call SSLEngine.wrap:
    * ZkEventThread calls SslTransportLayer.close from 
ControllerChannelManager.removeExistingBroker
    * RequestSendThread can call SslTransportLayer.write or 
SslTransportLayer.handshake from NetworkClient.poll
    
    Suppose the controller moves for whatever reason. The former controller 
could have had a RequestSendThread who was in the middle of sending out 
messages to the cluster while the ZkEventThread began executing 
KafkaController.onControllerResignation, which calls 
ControllerChannelManager.shutdown, which sequentially cleans up the 
controller-to-broker queue and connection for every broker in the cluster. This 
cleanup includes the call to ControllerChannelManager.removeExistingBroker as 
mentioned earlier, causing the concurrent call to SSLEngine.wrap. This 
concurrent call throws a BufferOverflowException which 
ControllerChannelManager.removeExistingBroker catches so the 
ControllerChannelManager.shutdown moves onto cleaning up the next 
controller-to-broker queue and connection, skipping the cleanup steps such as 
clearing the queue, stopping the RequestSendThread, and removing the entry from 
its brokerStateInfo map.
    
    By failing out of the Selector.close, the sensors corresponding to the 
broker connection has not been cleaned up. Any later attempt at initializing an 
identical Selector will result in a sensor collision and therefore cause 
Selector initialization to throw an exception. In other words, any later 
attempts by this broker to become controller again will fail on initialization. 
When controller initialization fails, the controller deletes the /controller 
znode and lets another broker take over.
    
    Now suppose the controller moves enough times such that every broker hits 
the BufferOverflowException concurrency issue. We're now guaranteed to fail 
controller initialization due to the sensor collision on every controller 
transition, so the controller will move across brokers continuously.
    
    This patch avoids the concurrent use of non-threadsafe classes in 
ControllerChannelManager.removeExistingBroker by shutting down the 
RequestSendThreaad before closing the NetworkClient.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/onurkaraman/kafka KAFKA-4959

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/2746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2746
    
----
commit 7c956de3a95fd8080a682a48d0900ca39dee19f3
Author: Onur Karaman <[email protected]>
Date:   2017-03-27T21:10:53Z

    KAFKA-4959: remove controller concurrent access to non-threadsafe 
NetworkClient, Selector, and SSLEngine
    
    This brought down a cluster by causing continuous controller moves.
    
    ZkClient's ZkEventThread and a RequestSendThread can concurrently use 
objects that aren't thread-safe:
    * Selector
    * NetworkClient
    * SSLEngine (this was the big one for us. We turn on SSL for interbroker 
communication).
    
    As per the "Concurrency Notes" section from 
https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html:
    two threads must not attempt to call the same method (either wrap() or 
unwrap()) concurrently
    
    SSLEngine.wrap gets called in:
    * SslTransportLayer.write
    * SslTransportLayer.handshake
    * SslTransportLayer.close
    
    It turns out that the ZkEventThread and RequestSendThread can concurrently 
call SSLEngine.wrap:
    * ZkEventThread calls SslTransportLayer.close from 
ControllerChannelManager.removeExistingBroker
    * RequestSendThread can call SslTransportLayer.write or 
SslTransportLayer.handshake from NetworkClient.poll
    
    Suppose the controller moves for whatever reason. The former controller 
could have had a RequestSendThread who was in the middle of sending out 
messages to the cluster while the ZkEventThread began executing 
KafkaController.onControllerResignation, which calls 
ControllerChannelManager.shutdown, which sequentially cleans up the 
controller-to-broker queue and connection for every broker in the cluster. This 
cleanup includes the call to ControllerChannelManager.removeExistingBroker as 
mentioned earlier, causing the concurrent call to SSLEngine.wrap. This 
concurrent call throws a BufferOverflowException which 
ControllerChannelManager.removeExistingBroker catches so the 
ControllerChannelManager.shutdown moves onto cleaning up the next 
controller-to-broker queue and connection, skipping the cleanup steps such as 
clearing the queue, stopping the RequestSendThread, and removing the entry from 
its brokerStateInfo map.
    
    By failing out of the Selector.close, the sensors corresponding to the 
broker connection has not been cleaned up. Any later attempt at initializing an 
identical Selector will result in a sensor collision and therefore cause 
Selector initialization to throw an exception. In other words, any later 
attempts by this broker to become controller again will fail on initialization. 
When controller initialization fails, the controller deletes the /controller 
znode and lets another broker take over.
    
    Now suppose the controller moves enough times such that every broker hits 
the BufferOverflowException concurrency issue. We're now guaranteed to fail 
controller initialization due to the sensor collision on every controller 
transition, so the controller will move across brokers continuously.
    
    This patch avoids the concurrent use of non-threadsafe classes in 
ControllerChannelManager.removeExistingBroker by shutting down the 
RequestSendThreaad before closing the NetworkClient.

----


> remove controller concurrent access to non-threadsafe NetworkClient, 
> Selector, and SSLEngine
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4959
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4959
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
>
> This brought down a cluster by causing continuous controller moves.
> ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects 
> that aren't thread-safe:
> * Selector
> * NetworkClient
> * SSLEngine (this was the big one for us. We turn on SSL for interbroker 
> communication).
> As per the "Concurrency Notes" section from the [SSLEngine 
> javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]:
> bq. two threads must not attempt to call the same method (either wrap() or 
> unwrap()) concurrently
> SSLEngine.wrap gets called in:
> * SslTransportLayer.write
> * SslTransportLayer.handshake
> * SslTransportLayer.close
> It turns out that the ZkEventThread and RequestSendThread can concurrently 
> call SSLEngine.wrap:
> * ZkEventThread calls SslTransportLayer.close from 
> ControllerChannelManager.removeExistingBroker
> * RequestSendThread can call SslTransportLayer.write or 
> SslTransportLayer.handshake from NetworkClient.poll
> Suppose the controller moves for whatever reason. The former controller could 
> have had a RequestSendThread who was in the middle of sending out messages to 
> the cluster while the ZkEventThread began executing 
> KafkaController.onControllerResignation, which calls 
> ControllerChannelManager.shutdown, which sequentially cleans up the 
> controller-to-broker queue and connection for every broker in the cluster. 
> This cleanup includes the call to 
> ControllerChannelManager.removeExistingBroker as mentioned earlier, causing 
> the concurrent call to SSLEngine.wrap. This concurrent call throws a 
> BufferOverflowException which ControllerChannelManager.removeExistingBroker 
> catches so the ControllerChannelManager.shutdown moves onto cleaning up the 
> next controller-to-broker queue and connection, skipping the cleanup steps 
> such as clearing the queue, stopping the RequestSendThread, and removing the 
> entry from its brokerStateInfo map.
> By failing out of the Selector.close, the sensors corresponding to the broker 
> connection has not been cleaned up. Any later attempt at initializing an 
> identical Selector will result in a sensor collision and therefore cause 
> Selector initialization to throw an exception. In other words, any later 
> attempts by this broker to become controller again will fail on 
> initialization. When controller initialization fails, the controller deletes 
> the /controller znode and lets another broker take over.
> Now suppose the controller moves enough times such that every broker hits the 
> BufferOverflowException concurrency issue. We're now guaranteed to fail 
> controller initialization due to the sensor collision on every controller 
> transition, so the controller will move across brokers continuously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-4959) remove controller concurrent access to non-threadsafe NetworkClient, Selector, and SSLEngine

Reply via email to