Abhishek,
Thanks for clarifying and updating the SEP.

Cheers!
Navina

On Wed, May 3, 2017 at 8:20 PM, Jagadish Venkatraman <jagadish1...@gmail.com
> wrote:

> Navina,
>
>
> >> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> internal
> APIs and have a concrete implementations.
>
> More specifically, both of these are purely internal implementation classes
> (and have nothing to do with any pluggable public API that we expose)
>
> Best,
> Jagadish
>
> On Wed, May 3, 2017 at 7:34 PM, Abhishek Shivanna <abks...@gmail.com>
> wrote:
>
> > Hey Navina,
> >
> > Thank you for reviewing the SEP.
> >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > the significance of doing so?
> >
> > Sorry for the confusion of having implementation details under "public
> > interfaces".
> > The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> > internal APIs
> > and have a concrete implementations.
> >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > think environmental variables can contain whitespace??
> >
> > Again, confusion that stemmed from my initial draft. I have fixed the SEP
> > with the actual name in the implementation.
> >
> > > I think the first sentence corresponds to your design. The second one
> is
> > more of an implementation detail. You may want to split it up or just
> > discard one of them. I got confused reading them together because one
> talks
> > about adding to container and the other about the ContainerRunner.
> >
> > Fixed the SEP to make it more clear.
> >
> > Thanks,
> > Abhishek
> >
> >
> > On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) <
> nav...@apache.org>
> > wrote:
> >
> > > Hi Abhishek,
> > > I checked your latest proposal in SEP and it looks good to me.
> > >
> > > QQ:
> > > > A new ContainerHeartbeatMonitor class that accepts a
> > > ContainerHeartbeatClient (which has the business logic to make
> heartbeat
> > > checks on the JC endpoint) and a callback.
> > >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > > the significance of doing so?
> > >
> > > > set an environment variable with the "Execution Container ID" during
> > > container launch. This can be read from the container to make requests
> to
> > > the above endpoint.
> > >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > > think environmental variables can contain whitespace??
> > >
> > > > On the container side we start a new thread that periodically polls
> > this
> > > endpoint described above to check if the container is valid. If its
> not,
> > we
> > > shutdown the run loop and raise an error (so that the exit code is non
> 0
> > so
> > > that YARN reschedules the container)
> > > The plan is to setup a monitor in the LocalContainerRunner class that
> > > schedules a thread to check the above endpoint at regular intervals. On
> > > failure the thread modifies state on the LocalContainerRunner to denote
> > > that there was an error. This state is checked during exit in the
> > > LocalContainerRunner to exit with a non-zero code.
> > >
> > > I think the first sentence corresponds to your design. The second one
> is
> > > more of an implementation detail. You may want to split it up or just
> > > discard one of them. I got confused reading them together because one
> > talks
> > > about adding to container and the other about the ContainerRunner.
> > >
> > > Design looks pretty elegant and easily portable.
> > >
> > > Thanks!
> > > Navina
> > >
> > >
> > > On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <abks...@gmail.com>
> > > wrote:
> > >
> > > > Hey Jagadish,
> > > >
> > > > Thank you for taking the time to review the design.
> > > > I agree with moving the heartbeat into the the LocalContainerRunner
> > > instead
> > > > of fitting it into the SamzaContainer. I will update the SEP with the
> > new
> > > > design changes.
> > > > Also agree with the changes to the configuration and choosing
> suitable
> > > > defaults should be good enough.
> > > >
> > > > Thanks,
> > > > Abhishek
> > > >
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> > > > jagadish1...@gmail.com> wrote:
> > > >
> > > > > Hi Abhishek,
> > > > >
> > > > > Heartbeat between the AM and container has been a long awaited
> Samza
> > > > > feature. It will go a long way in ensuring our reliability! +1 for
> > this
> > > > > SEP.
> > > > >
> > > > > *High level comments:*
> > > > >
> > > > > Currently, the only use-case for the heartbeat mechanism seems to
> be
> > > when
> > > > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat
> > > logic
> > > > > into the *LocalContainerRunner* instead of baking it into the
> > > > > *SamzaContainer* class. Long term, we can re-visit this when we
> have
> > a
> > > > > pluggable liveness detection mechanism.
> > > > >
> > > > > I'm thinking of a flow like this:
> > > > >
> > > > > There is a separate component (or a thread) inside
> > LocalContainerRunner
> > > > > that periodically polls the coordinator, and determines if it
> should
> > > > > continue running. If the coordinator determines that the container
> > > should
> > > > > not run, the *LocalContainerRunner* cleanly shuts-down the
> container
> > > and
> > > > > the process exits with a non-zero exit status.
> > > > >
> > > > > The following nice properties fall out:
> > > > >
> > > > >    - We can remove the proposed config *job.container.validator.
> > > enabled.
> > > > *
> > > > >    - We can also remove the proposed *Killable* interface since
> > > > >    *SamzaContainer* (and runLoops) don't have to implement
> *Killable
> > *
> > > > >    anymore. The life-cycle is managed by the *LocalContainerRunner*
> > > that
> > > > >    started it.
> > > > >
> > > > > *On the proposed public interfaces:*
> > > > >
> > > > > *job.container.validator.enabled:  *I am not in favor of adding
> this
> > > as
> > > > a
> > > > > new public config. IIUC, When running Samza jobs on Yarn, we always
> > > want
> > > > > the validator/heartbeats to be enabled. OTOH, when running Samza
> jobs
> > > in
> > > > > standalone mode, we currently do not have a pluggable mechanism for
> > > > > heartbeat.
> > > > >
> > > > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It
> > > does
> > > > > seem that we can pick a sensible default, and be done with it
> > (instead
> > > of
> > > > > adding a new config)? Is there a reason this needs to be
> > configurable?
> > > > >
> > > > > *On proposed Killable interface: *
> > > > >
> > > > > Not entirely sure we need this new "*Killable"* interface (esp.
> given
> > > > that
> > > > > there's currently only one implementation - *SamzaContainer*).
> > > > >
> > > > >    - The *LocalContainerRunner* can instead directly invoke
> shut-down
> > > on
> > > > >    the *SamzaContainer* when its heart-beat expires. The extra
> level
> > of
> > > > >    indirection (making *SamzaContainer* to implement *Killable*) is
> > > > >    probably unnecessary IMHO.
> > > > >
> > > > >
> > > > >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> > > > >    *SamzaContainer*, it seems simpler also have it invoke
> *shutdown*
> > on
> > > > the
> > > > >    *SamzaContainer. *
> > > > >
> > > > > *Minor Comments:*
> > > > >
> > > > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is
> to
> > > get
> > > > > requests from the Samza container periodically and respond back
> > weather
> > > > the
> > > > > container is in the Job Coordinator's current list of valid
> > containers.
> > > > >
> > > > > Wondering if it'd be slightly cleaner to rename this to
> > > > */heartBeatRequest*
> > > > > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > > > > *isContainerValid
> > > > > * and the definition of validity does seem slightly broad?
> > > > >
> > > > > Thanks again for taking the time to draft the SEP, and volunteering
> > to
> > > > > implement this. Nice work!
> > > > >
> > > > > Best,
> > > > > Jagadish
> > > > >
> > > > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <
> > abks...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > In order to fix the issue of orphaned/leaky containers seen when
> > the
> > > > > > YARN Node Manager crashes, I have created a SEP discussing the
> > design
> > > > for
> > > > > > implementing a heartbeat between the containers and the job
> > > > coordinator:
> > > > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > > > > all+running+containers
> > > > > >
> > > > > > Please take a look and provide feedback. I would also really
> > > appreciate
> > > > > > help in designing a way to propagate the error up from
> > SamzaContainer
> > > > in
> > > > > > order to exit the container with a non-zero exit code.
> > > > > >
> > > > > > Thanks,
> > > > > > Abhishek
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jagadish V,
> > > > > Graduate Student,
> > > > > Department of Computer Science,
> > > > > Stanford University
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>



-- 
Navina R.

Reply via email to