[ 
https://issues.apache.org/jira/browse/SAMZA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shanthoosh Venkataraman resolved SAMZA-1979.
--------------------------------------------
    Resolution: Fixed

> StreamPartitionCountMonitor bug.
> --------------------------------
>
>                 Key: SAMZA-1979
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1979
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Shanthoosh Venkataraman
>            Assignee: Shanthoosh Venkataraman
>            Priority: Major
>
> Samza relies on a daemon thread named StreamPartitionCountMonitor which runs 
> in the JobCoordinator to determine any change in the partition count of input 
> streams.
>  
> Here's the control flow of StreamPartitionCountMonitor:
>  * JobCoordinator initializes the StreamPartitionCountMonitor thread with the 
> partition count of the input streams.
>  * StreamPartitionCountMonitor periodically fetches the partition count of 
> the input streams and compares it against the initialized partition count.
>  * In case of stateful jobs, if there's a difference in partition count, then 
> it stops the JobCoordinator. 
>  
> However for stateful jobs, the above solution might not work in the following 
> scenarios: 
> A. In between the scheduling interval of the StreamPartitionCountMonitor, 
> let's say that there's a change in partition count of input stream. Before 
> the StreamPartitionCountMonitor's successive scheduled run, let's say the 
> JobCoordinator process is killed(due to nodemanager failure). Yarn will 
> reschedule the Jobcoordinator process to run in a different nodemanager 
> machine. When the StreamPartitionCountMonitor is run in a different 
> application attempt of Jobcoordinator, then it would have missed the signal 
> of input partition count change.
>  B. Let's say that when the input stream partition count changes, the user 
> stops the samza job. The daemon thread spawned in the subsequent run of the 
> samza job would not have the adequate state to determine that the partition 
> count of the input streams has changed.
>  
> The general pattern here is that if there's a change in partition count of 
> input streams and JobCoordinator process is killed before the scheduled to 
> run then we would fail to detect the partition count change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to