Re: [DISCUSS] SEP-1: Semantics of ProcessorId in Samza

Prateek Maheshwari Thu, 23 Mar 2017 12:08:51 -0700

Hi Navina,

Thanks for SEP-1, looks pretty good to me. A few questions/comments:


Implementation/Interface related:
1. Do you have any examples of custom processor IDs? Wondering what
information/classes ProcessorIdGenerator would need to be able to generate
one.
2. The default "static" getProcessorIdGeneratorFromConfig should be on the
ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer
removing the fromConfig suffix from the method name and calling it create
instead of get? Not sure what the convention here is. @Xinyu, any
preferences?
3. +1 for removing the constructor parameter, but I think the
JobCoordinator should still only have a #getProcessorId method instead of
#getProcessorIdGenerator. Theoretically, a processor/container doesn't need
to generate multiple IDs, it only needs to know its own. @Jagadish, prefer
the more restrictive API unless you have a use case for the more general
one in mind.

Documentation/SEP related:
1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as
synonyms. Let's update to use ProcessorIdGenerator consistently.
2. Minor: 'processor.id' configuration: I'm assuming this still needs to be
unique for each processor in the job? If so, probably worth calling out in
the SEP or configuration docs. We can also document it as deprecated and a
candidate for removal in near future (maybe 0.14?).

Thanks,
Prateek


On Tue, Mar 21, 2017 at 2:34 PM, Navina Ramesh (Apache) <[email protected]>
wrote:

> Hi everyone,
> I have updated the SEP
> <https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> 1%3A+Semantics+of+ProcessorId+in+Samza>
> based
> on all the feedback. Feel free to comment.
>
> I will start the [vote] mail thread, if there are no further questions
> within the next 24 hours.
>
> Thanks!
> Navina
>
> On Tue, Mar 21, 2017 at 10:33 AM, Navina Ramesh (Apache) <
> [email protected]>
> wrote:
>
> > Hi Jagadish,
> > Thanks for the suggestion. You are right in that it should be the
> > responsibility of the JobCoordinator to assign identifiers.
> >
> > > 'm only wondering if this logic could instead reside inside the
> > Job Coordinator (which is internal to the StreamProcessor) instead of
> > relying on something external to it?
> >
> > I think this is a consequence of our initial StandaloneJobCoordinator,
> > which is pretty much a pass-through. I didn't see any usage for
> > getProcessorId() and was wondering why we put it in the JobCoordinator
> > interface. I think I should keep your design proposal from last year
> handy
> > :) Thanks for pitching in!
> >
> >
> > @All:
> > Yesterday, there was a discussion on naming of the configuration used in
> > this SEP - whether it should be within the "job" scope or "app" scope
> > (introduced by SAMZA-1041
> > <https://issues.apache.org/jira/browse/SAMZA-1041>).  Multi-stage
> feature
> > and fluent-api for Samza introduces the notion of "application". Since
> the
> > processorId generator config applies to all jobs within a Samza
> > application, we decided to add the config for generator under "app"
> scope.
> > Further details on config scope changes can be found in SAMZA-1120.
> > <https://issues.apache.org/jira/browse/SAMZA-1120>
> >
> > I will send out an update once I change the SEP based on yesterday's
> > meeting and Jagadish's idea.
> >
> > Thanks!
> > Navina
> >
> > On Mon, Mar 20, 2017 at 5:22 PM, Jagadish Venkatraman <
> > [email protected]> wrote:
> >
> >> Thanks for writing this SEP!
> >>
> >> Here's an alternate approach instead of taking the "String processorId"
> as
> >> a parameter in the constructor. In my view, the "processorId" could be
> >> generated by the StreamProcessor internally (instead of being generated
> >> up-stream and passed in). The Job Coordinator API could be as follows:
> >>
> >>
> >> public interface JobCoordinator {
> >>
> >>  ProcessorIdGenerator getProcessorIDGenerator();
> >>
> >> // could be String getProcessorID()
> >>
> >>  JobModel getJobModel();
> >>
> >> }
> >>
> >> public interface ProcessorIDGenerator {
> >>
> >>  String getProcessorID();
> >> }
> >>
> >>
> >> For instance, an Yarn job coordinator can merely parse the ID from
> config,
> >> and return it. A Zk backed implementation of the Job coordinator can
> agree
> >> on IDs using coordination leveraging Zk. One nice property with this
> >> approach is that it keeps all logic related to coordination, agreement
> on
> >> the Job Model, leader election (with potentially pluggable components
> for
> >> each) inside the JobCoordinator.
> >>
> >> To be clear, I'm all for pluggability for ID generation logic that this
> >> SEP
> >> advocates. I'm only wondering if this logic could instead reside inside
> >> the
> >> Job Coordinator (which is internal to the StreamProcessor) instead of
> >> relying on something external to it?
> >>
> >> Of course, there may be other considerations around the way the current
> >> code is structured that may prevent this. Let me know if you agree with
> >> this change.
> >>
> >> Thanks,
> >> Jag
> >>
> >>
> >> On Thu, Mar 16, 2017 at 5:21 PM, Navina Ramesh
> >> <[email protected]
> >> > wrote:
> >>
> >> > > I am working on the ApplicationRunner SEP right now. Will send out
> the
> >> > discussion email once I am done.
> >> >
> >> > Perfect! :)
> >> >
> >> > On Thu, Mar 16, 2017 at 5:13 PM, xinyu liu <[email protected]>
> >> wrote:
> >> >
> >> > > Right, the static factory is very simple as you said. It's pretty
> >> > > convenient for the client to use.
> >> > >
> >> > > I am working on the ApplicationRunner SEP right now. Will send out
> the
> >> > > discussion email once I am done.
> >> > >
> >> > > Thanks,
> >> > > Xinyu
> >> > >
> >> > > On Thu, Mar 16, 2017 at 4:50 PM, Navina Ramesh (Apache) <
> >> > [email protected]
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > > One minor thing I found is that the name of the config is camel
> >> case
> >> > > > (*processor.idGenerator.class*). Seems Samza's practice is to use
> >> all
> >> > > > lower
> >> > > > case configs with "." delimiter. Do you think we should stick to
> >> this
> >> > > > convention?
> >> > > >
> >> > > > I am always torn between the "convention" we have and the better
> >> way of
> >> > > > doing things. But I don't have strong opinions about it. I can
> >> change
> >> > it.
> >> > > >
> >> > > > > One more suggestion is to have a static factory method in the
> >> > > > ProcessorIdGenerator (Like what we have in ApplicationRunner):
> >> > > >
> >> > > > I couldn't grasp these requirements from the ApplicationRunner
> >> design.
> >> > It
> >> > > > will be great if you can put it out in an SEP :)
> >> > > >
> >> > > > I can add the static factory method for it. Just to clarify, the
> >> static
> >> > > > method simply class loads the ProcessorIdGenerator ? It uses
> >> reflection
> >> > > to
> >> > > > create the instance ?
> >> > > >
> >> > > > Thanks!
> >> > > > Navina
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Mar 16, 2017 at 4:31 PM, xinyu liu <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > > The proposal looks great to me! Changing the id type to string
> >> will
> >> > > make
> >> > > > > sure this can work with other types of cluster which doesn't
> >> support
> >> > > > > integer id. The interface and config provides a pluggable way to
> >> have
> >> > > > > different id generators for different use cases. One minor
> thing I
> >> > > found
> >> > > > is
> >> > > > > that the name of the config is camel case
> >> > > (*processor.idGenerator.class*
> >> > > > ).
> >> > > > > Seems Samza's practice is to use all lower case configs with "."
> >> > > > delimiter.
> >> > > > > Do you think we should stick to this convention?
> >> > > > >
> >> > > > > One more suggestion is to have a static factory method in
> >> > > > > the ProcessorIdGenerator (Like what we have in
> ApplicationRunner):
> >> > > > >
> >> > > > > static ProcessIdGenerator fromConfig(Config config) { ... }.
> >> > > > >
> >> > > > > With this, It will be more convenient for the ApplicationRunner
> to
> >> > > > > construct the generator. What do you think?
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Xinyu
> >> > > > >
> >> > > > > On Wed, Mar 15, 2017 at 10:59 PM, Navina Ramesh (Apache) <
> >> > > > > [email protected]>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi everyone,
> >> > > > > > I created a proposal for SAMZA-1126, which addresses the
> >> semantics
> >> > of
> >> > > > > > ProcessorId in Samza. For most purposes, ProcessorId is same
> as
> >> the
> >> > > > > logical
> >> > > > > > id that Samza assigns for each Yarn container. It is primarily
> >> used
> >> > > in
> >> > > > > > JobModel as a key for the corresponding ContainerModel and
> >> also, in
> >> > > > > > container-level metrics. We are expanding the applicability of
> >> > > > > processorId
> >> > > > > > to be beyond a fixed set of processors.
> >> > > > > >
> >> > > > > > Please review and comment on this SEP.
> >> > > > > >
> >> > > > > > For those who are not actively following the master branch,
> you
> >> may
> >> > > > have
> >> > > > > > more questions than others. Feel free to ask them here.
> >> > > > > >
> >> > > > > > @Xinyu: Since you are working on SAMZA-1067 and other related
> >> > > > integration
> >> > > > > > APIs, can you please add an SEP for SAMZA-1067 ? This will
> help
> >> > > others
> >> > > > > (adn
> >> > > > > > me as well) get on the same page with your design/code. Let me
> >> know
> >> > > if
> >> > > > > > SEP-1 will work per your design for ApplicationRunner.
> >> > > > > >
> >> > > > > > Thanks!
> >> > > > > > Navina
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Navina R.
> >> >
> >>
> >>
> >>
> >> --
> >> Jagadish V,
> >> Graduate Student,
> >> Department of Computer Science,
> >> Stanford University
> >>
> >
> >
>

Re: [DISCUSS] SEP-1: Semantics of ProcessorId in Samza

Reply via email to