[ https://issues.apache.org/jira/browse/SAMZA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Maes resolved SAMZA-1508. ------------------------------ Resolution: Fixed > JobRunner should not return success until the job is healthy > ------------------------------------------------------------ > > Key: SAMZA-1508 > URL: https://issues.apache.org/jira/browse/SAMZA-1508 > Project: Samza > Issue Type: Bug > Reporter: Jake Maes > Assignee: Jake Maes > Priority: Major > Fix For: 0.15.0 > > > It can be frustrating for users when run-app.sh returns success before the > job was fully running. > This happens because the JobRunner currently waits for JobStatus=RUNNING, but > in Yarn for example, that happens when the AM is launched, not when all the > containers are launched. > What can go wrong? > 1. The job could stay stuck waiting for containers that it cant get because > of capacity issues or an outage. > 2. The job containers may immediately fail due to a runtime error. > In both cases, the user may go on their merry way because run-app.sh returned > successfully, even though the job is already dead. They may not get alerted > for some time. > How do we fix? > There are a few ways to fix it. Each one progressively harder but > progressively better: > 1. Make JobRunner reach out to AM and monitor the needed containers metric > until it reaches 0 > 2. Expose a new healthy endpoint in the AM which is only set to true when a > heartbeat has been received from each of the containers. Have the JobRunner > wait on this (with a timeout) > 3. Expose a hook where users can write custom logic to determine job health > I think #1 is the most bang for buck and the implementation for #1 can easily > be extended for #2 later. > Other notes: > I don't think this is needed for standalone, since users are directly > deploying the processors and can monitor the processes directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)