Hi Tom, This sounds like a bug. ApplicationRunner should return the correct status when the processor has shut down. We fixed a similar standalone bug recently, are you already using Samza 1.0. If this is reproducible / happens again, a thread dump + logs would also be very helpful for debugging and verifying if the issue is already fixed.
Thanks, Prateek On Fri, Mar 22, 2019 at 7:23 AM Tom Davis <t...@recursivedream.com> wrote: > > Prateek Maheshwari <prateek...@gmail.com> writes: > > > Hi Tom, > > > > This would depend on what your k8s container orchestration logic looks > > like. For example, in YARN, 'status' returns 'not running' after 'start' > > until all the containers requested from the AM are 'running'. We also > > leverage YARN to restart containers/job automatically on failures (within > > some bounds). Additionally, we set up a monitoring alert that goes off if > > the number of running containers stays lower than the number of expected > > containers for extended periods of time (~ 5 minutes). > > > > Are you saying that you noticed that the LocalApplicationRunner status > > returns 'running' even if its stream processor / SamzaContainer has > stopped > > processing? > > > > Yeah, this is what I mean. We have a health check for the overall > ApplicationStatus but if the containers enter a failed state that > doesn't result in a shut down of the runner itself. An example from last > night: Kafka became unavailable at some point and Samza failed to write > checkpoints for a while, ultimately leading to container failures. The > last log line is: > > o.a.s.c.SamzaContainer - Shutdown is no-op since the container is already > in > state: FAILED > > This doesn't cause the Pod to be killed, though, so we just silently > stop processing events. How do you determine the number of expected > containers? Or are you speaking of containers in terms of YARN and not > Samza processors? > > > > > - Prateek > > > > On Fri, Mar 15, 2019 at 7:26 AM Tom Davis <t...@recursivedream.com> > wrote: > > > >> I'm using the LocalApplicationRunner and had added a liveness check > >> around the `status` method. The app is running in Kubernetes so, in > >> theory, it could be restarted if exceptions happened during processing. > >> However, it seems that "container failure" is divorced from "app > >> failure" because the app continues to run even after all the task > >> containers have shut down. Is there a better way to check for > >> application health? Is there a way to shut down the application if all > >> containers have failed? Should I simply ensure exceptions never escape > >> operators? Thanks! > >> >