[ https://issues.apache.org/jira/browse/SAMZA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shanthoosh Venkataraman resolved SAMZA-1979. -------------------------------------------- Resolution: Fixed > StreamPartitionCountMonitor bug. > -------------------------------- > > Key: SAMZA-1979 > URL: https://issues.apache.org/jira/browse/SAMZA-1979 > Project: Samza > Issue Type: Bug > Reporter: Shanthoosh Venkataraman > Assignee: Shanthoosh Venkataraman > Priority: Major > > Samza relies on a daemon thread named StreamPartitionCountMonitor which runs > in the JobCoordinator to determine any change in the partition count of input > streams. > > Here's the control flow of StreamPartitionCountMonitor: > * JobCoordinator initializes the StreamPartitionCountMonitor thread with the > partition count of the input streams. > * StreamPartitionCountMonitor periodically fetches the partition count of > the input streams and compares it against the initialized partition count. > * In case of stateful jobs, if there's a difference in partition count, then > it stops the JobCoordinator. > > However for stateful jobs, the above solution might not work in the following > scenarios: > A. In between the scheduling interval of the StreamPartitionCountMonitor, > let's say that there's a change in partition count of input stream. Before > the StreamPartitionCountMonitor's successive scheduled run, let's say the > JobCoordinator process is killed(due to nodemanager failure). Yarn will > reschedule the Jobcoordinator process to run in a different nodemanager > machine. When the StreamPartitionCountMonitor is run in a different > application attempt of Jobcoordinator, then it would have missed the signal > of input partition count change. > B. Let's say that when the input stream partition count changes, the user > stops the samza job. The daemon thread spawned in the subsequent run of the > samza job would not have the adequate state to determine that the partition > count of the input streams has changed. > > The general pattern here is that if there's a change in partition count of > input streams and JobCoordinator process is killed before the scheduled to > run then we would fail to detect the partition count change. -- This message was sent by Atlassian JIRA (v7.6.3#76005)