[ 
https://issues.apache.org/jira/browse/KAFKA-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15733842#comment-15733842
 ] 

ASF GitHub Bot commented on KAFKA-4509:
---------------------------------------

GitHub user mjsax opened a pull request:

    https://github.com/apache/kafka/pull/2233

    KAFKA-4509: Task reusage on rebalance fails for threads on same host

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mjsax/kafka kafka-4509-task-reusage-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/2233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2233
    
----
commit e93a96773b806096cc907a25e61a933bb4b8903e
Author: Matthias J. Sax <matth...@confluent.io>
Date:   2016-12-08T01:08:50Z

    KAFKA-4509: Task reusage on rebalance fails for threads on same host

----


> Task reusage on rebalance fails for threads on same host
> --------------------------------------------------------
>
>                 Key: KAFKA-4509
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4509
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Matthias J. Sax
>
> In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance 
> was introduces as a performance optimization. Instead of closing a task on 
> rebalance (ie, {{onPartitionsRevoked()}}, it only get's suspended for a 
> potential reuse in {{onPartitionsAssigned()}}. Only if a task cannot be 
> reused, it will eventually get closed in {{onPartitionsAssigned()}}.
> This mechanism can fail, if multiple {{StreamThreads}} run in the same host 
> (same or different JVM). The scenario is as follows:
>  - assume 2 running threads A and B
>  - assume 3 tasks t1, t2, t3
>  - assignment: A-(t1,t2) and B-(t3)
>  - on the same host, a new single threaded Stream application (same app-id) 
> gets started (thread C)
>  - on rebalance, t2 (could also be t1 -- does not matter) will be moved from 
> A to C
>  - as assignment is only sticky base on an heurictic t1 can sometimes be 
> assigned to B, too -- and t3 get's assigned to A (thre is a race condition if 
> this "task flipping" happens or not)
>  - on revoke, A will suspend task t1 and t2 (not releasing any locks)
>  - on assign
>     - A tries to create t3 but as B did not release it yet, A dies with an 
> "cannot get lock" exception
>     - B tries to create t1 but as A did not release it yet, B dies with an 
> "cannot get lock" exception
>     - as A and B trie to create the task first, this will always fail if task 
> flipping happened
>    - C tries to create t2 but A did not release t2 lock yet (race condition) 
> and C dies with an exception (this could even happen without "task flipping" 
> between A and B)
> We want to fix this, by:
>   # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, 
> and afterward create new tasks (this fixes the "task flipping" issue)
>   # use a "backoff and retry mechanism" if a task cannot be created (to 
> handle release-create race condition between different threads)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to