Hi Navina,

1. Assuming the environment can put the processor ID in the config,
ProcessorIdGenerator#generateProcessorId(Config config) makes sense.
Passing all of Config is rather broad, but I don't think we have an
environment specific subset class for config yet, so should be OK.

2. I don't yet see the value of putting the class-loading helper default
methods in multiple public interfaces. Seems like they're (usually) going
to be used by the framework, are pretty simple to write, and could probably
be written as a common Util method if we find them repetitive. Maybe skip
this method for now and add this once we have some clarity on this new
pattern?

If we keep it, let's name it "ProcessorIdGenerator#fromConfig(Config
config)" to be consistent with "ApplicationRunner#fromConfig(Config
config)"?

3. I think we're in agreement that the Processor doesn't need to have
access to the ProcessorIdGenerator, only the JobCoordinator does. With
JobCoordinator only exposing #getProcessorId I think we're good.

+1 from me with these minor changes. Thanks for the proposal!

Best,
Prateek

On Thu, Mar 23, 2017 at 11:35 PM, Navina R <navi.trin...@gmail.com> wrote:

> Hi Prateek,
> > 1. Do you have any examples of custom processor IDs? Wondering what
> information/classes ProcessorIdGenerator would need to be able to generate
> one.
> Yeah. When I was trying to implement the proposal, I was wondering the
> same thing as well. However, it might end up being specific to the
> environment. In case of Yarn, I would expect the AppMaster to still specify
> the processorID as an environment variable. In case of Rain (which is a
> Linkedin specific deployment framework),it will probably be a combination
> of sliceId and instanceId. Given these variations, I am not sure what can
> be generic enough to encompass all these information. Maybe we can pass the
> config to the instance factory. But, let me know if you have other ideas.
>
>
>  > 2. The default "static" getProcessorIdGeneratorFromConfig should be on
> the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer
> removing the fromConfig suffix from the method name and calling it create
> instead of get? Not sure what the convention here is.
> You are right. It should be in ProcessorIdGenerator. I can remove the
> fromConfig suffix. Should I just call it createInstance, similar to the
> Java apis ? I am not sure what the convention is either.
>
> > I think the JobCoordinator should still only have a #getProcessorId
> method instead of #getProcessorIdGenerator.
> JobCoordinator still has getProcessorId. If I move the
> getProcessorIdGenerator helper to ProcessorIdGenerator, it makes sense?
>
> > Theoretically, a processor/container doesn't need to generate multiple
> IDs, it only needs to know its own
> I believe the reasoning originated from SAMZA-881, where we identified the
> requirements for running Samza in distributed mode and the responsibilities
> of each component in a homogenous stream processing mode. Theoretically
> speaking, it is the job coordinator that will assign identifiers to its
> processor. Practically, since it is bound to the runtime environment, it
> seems appropriate for the job coordinator to generate the id. If you
> haven't read SAMZA-881, you should give it a read. If we go by the
> assumption that a leader processor (in simpler terms, a central authority)
> generates the JobModel, it needs to "know" the identifiers of all
> processors.
> An alternative model is be the one where the leader spawns the processors
> and also, "assigns" identifiers for all processors (for example, in Yarn
> today). The latter model is restrictive in that:
> 1. it expects the leader to know the number of processors required in the
> job
> 2. leader processor is different from other processors, thus, making a
> Samza job a more heterogenous set of processors
>
> Hope these points make sense. Jagadish can add more, and even correct me
> if I got anything wrong here :)
>
> > 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as
> synonyms. Let's update to use ProcessorIdGenerator consistently.
> Will do
>
> > 2. Minor: 'processor.id' configuration: I'm assuming this still needs
> to be unique for each processor in the job? If so, probably worth calling
> out in the SEP or configuration docs. We can also document it as deprecated
> and a candidate for removal in near future (maybe 0.14?).
> Yes. That is still a requirement. I think I updated the document regarding
> deprecating and removing it. I will re-check.
>
> Thanks for your comments.
>
> Cheers!
> Navina
>
> On Thu, Mar 23, 2017 at 12:04 PM, Prateek Maheshwari <
> pmaheshw...@linkedin.com> wrote:
>
>> Hi Navina,
>>
>> Thanks for SEP-1, looks pretty good to me. A few questions/comments:
>>
>> Implementation/Interface related:
>> 1. Do you have any examples of custom processor IDs? Wondering what
>> information/classes ProcessorIdGenerator would need to be able to generate
>> one.
>> 2. The default "static" getProcessorIdGeneratorFromConfig should be on
>> the ProcessorIdGenerator interface, not the JobCoordinator. Also, prefer
>> removing the fromConfig suffix from the method name and calling it create
>> instead of get? Not sure what the convention here is. @Xinyu, any
>> preferences?
>> 3. +1 for removing the constructor parameter, but I think the
>> JobCoordinator should still only have a #getProcessorId method instead of
>> #getProcessorIdGenerator. Theoretically, a processor/container doesn't need
>> to generate multiple IDs, it only needs to know its own. @Jagadish, prefer
>> the more restrictive API unless you have a use case for the more general
>> one in mind.
>>
>> Documentation/SEP related:
>> 1. The SEP uses both ProcessorIdentifier and ProcessorIdGenerator as
>> synonyms. Let's update to use ProcessorIdGenerator consistently.
>> 2. Minor: 'processor.id' configuration: I'm assuming this still needs to
>> be unique for each processor in the job? If so, probably worth calling out
>> in the SEP or configuration docs. We can also document it as deprecated and
>> a candidate for removal in near future (maybe 0.14?).
>>
>> Thanks,
>> Prateek
>>
>>
>> On Tue, Mar 21, 2017 at 2:34 PM, Navina Ramesh (Apache) <
>> nav...@apache.org> wrote:
>>
>>> Hi everyone,
>>> I have updated the SEP
>>> <https://cwiki.apache.org/confluence/display/SAMZA/SEP-1%3A+
>>> Semantics+of+ProcessorId+in+Samza>
>>> based
>>> on all the feedback. Feel free to comment.
>>>
>>> I will start the [vote] mail thread, if there are no further questions
>>> within the next 24 hours.
>>>
>>> Thanks!
>>> Navina
>>>
>>> On Tue, Mar 21, 2017 at 10:33 AM, Navina Ramesh (Apache) <
>>> nav...@apache.org>
>>> wrote:
>>>
>>> > Hi Jagadish,
>>> > Thanks for the suggestion. You are right in that it should be the
>>> > responsibility of the JobCoordinator to assign identifiers.
>>> >
>>> > > 'm only wondering if this logic could instead reside inside the
>>> > Job Coordinator (which is internal to the StreamProcessor) instead of
>>> > relying on something external to it?
>>> >
>>> > I think this is a consequence of our initial StandaloneJobCoordinator,
>>> > which is pretty much a pass-through. I didn't see any usage for
>>> > getProcessorId() and was wondering why we put it in the JobCoordinator
>>> > interface. I think I should keep your design proposal from last year
>>> handy
>>> > :) Thanks for pitching in!
>>> >
>>> >
>>> > @All:
>>> > Yesterday, there was a discussion on naming of the configuration used
>>> in
>>> > this SEP - whether it should be within the "job" scope or "app" scope
>>> > (introduced by SAMZA-1041
>>> > <https://issues.apache.org/jira/browse/SAMZA-1041>).  Multi-stage
>>> feature
>>> > and fluent-api for Samza introduces the notion of "application". Since
>>> the
>>> > processorId generator config applies to all jobs within a Samza
>>> > application, we decided to add the config for generator under "app"
>>> scope.
>>> > Further details on config scope changes can be found in SAMZA-1120.
>>> > <https://issues.apache.org/jira/browse/SAMZA-1120>
>>> >
>>> > I will send out an update once I change the SEP based on yesterday's
>>> > meeting and Jagadish's idea.
>>> >
>>> > Thanks!
>>> > Navina
>>> >
>>> > On Mon, Mar 20, 2017 at 5:22 PM, Jagadish Venkatraman <
>>> > jagadish1...@gmail.com> wrote:
>>> >
>>> >> Thanks for writing this SEP!
>>> >>
>>> >> Here's an alternate approach instead of taking the "String
>>> processorId" as
>>> >> a parameter in the constructor. In my view, the "processorId" could be
>>> >> generated by the StreamProcessor internally (instead of being
>>> generated
>>> >> up-stream and passed in). The Job Coordinator API could be as follows:
>>> >>
>>> >>
>>> >> public interface JobCoordinator {
>>> >>
>>> >>  ProcessorIdGenerator getProcessorIDGenerator();
>>> >>
>>> >> // could be String getProcessorID()
>>> >>
>>> >>  JobModel getJobModel();
>>> >>
>>> >> }
>>> >>
>>> >> public interface ProcessorIDGenerator {
>>> >>
>>> >>  String getProcessorID();
>>> >> }
>>> >>
>>> >>
>>> >> For instance, an Yarn job coordinator can merely parse the ID from
>>> config,
>>> >> and return it. A Zk backed implementation of the Job coordinator can
>>> agree
>>> >> on IDs using coordination leveraging Zk. One nice property with this
>>> >> approach is that it keeps all logic related to coordination,
>>> agreement on
>>> >> the Job Model, leader election (with potentially pluggable components
>>> for
>>> >> each) inside the JobCoordinator.
>>> >>
>>> >> To be clear, I'm all for pluggability for ID generation logic that
>>> this
>>> >> SEP
>>> >> advocates. I'm only wondering if this logic could instead reside
>>> inside
>>> >> the
>>> >> Job Coordinator (which is internal to the StreamProcessor) instead of
>>> >> relying on something external to it?
>>> >>
>>> >> Of course, there may be other considerations around the way the
>>> current
>>> >> code is structured that may prevent this. Let me know if you agree
>>> with
>>> >> this change.
>>> >>
>>> >> Thanks,
>>> >> Jag
>>> >>
>>> >>
>>> >> On Thu, Mar 16, 2017 at 5:21 PM, Navina Ramesh
>>> >> <nram...@linkedin.com.invalid
>>> >> > wrote:
>>> >>
>>> >> > > I am working on the ApplicationRunner SEP right now. Will send
>>> out the
>>> >> > discussion email once I am done.
>>> >> >
>>> >> > Perfect! :)
>>> >> >
>>> >> > On Thu, Mar 16, 2017 at 5:13 PM, xinyu liu <xinyuliu...@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> > > Right, the static factory is very simple as you said. It's pretty
>>> >> > > convenient for the client to use.
>>> >> > >
>>> >> > > I am working on the ApplicationRunner SEP right now. Will send
>>> out the
>>> >> > > discussion email once I am done.
>>> >> > >
>>> >> > > Thanks,
>>> >> > > Xinyu
>>> >> > >
>>> >> > > On Thu, Mar 16, 2017 at 4:50 PM, Navina Ramesh (Apache) <
>>> >> > nav...@apache.org
>>> >> > > >
>>> >> > > wrote:
>>> >> > >
>>> >> > > > > One minor thing I found is that the name of the config is
>>> camel
>>> >> case
>>> >> > > > (*processor.idGenerator.class*). Seems Samza's practice is to
>>> use
>>> >> all
>>> >> > > > lower
>>> >> > > > case configs with "." delimiter. Do you think we should stick to
>>> >> this
>>> >> > > > convention?
>>> >> > > >
>>> >> > > > I am always torn between the "convention" we have and the better
>>> >> way of
>>> >> > > > doing things. But I don't have strong opinions about it. I can
>>> >> change
>>> >> > it.
>>> >> > > >
>>> >> > > > > One more suggestion is to have a static factory method in the
>>> >> > > > ProcessorIdGenerator (Like what we have in ApplicationRunner):
>>> >> > > >
>>> >> > > > I couldn't grasp these requirements from the ApplicationRunner
>>> >> design.
>>> >> > It
>>> >> > > > will be great if you can put it out in an SEP :)
>>> >> > > >
>>> >> > > > I can add the static factory method for it. Just to clarify, the
>>> >> static
>>> >> > > > method simply class loads the ProcessorIdGenerator ? It uses
>>> >> reflection
>>> >> > > to
>>> >> > > > create the instance ?
>>> >> > > >
>>> >> > > > Thanks!
>>> >> > > > Navina
>>> >> > > >
>>> >> > > >
>>> >> > > >
>>> >> > > > On Thu, Mar 16, 2017 at 4:31 PM, xinyu liu <
>>> xinyuliu...@gmail.com>
>>> >> > > wrote:
>>> >> > > >
>>> >> > > > > The proposal looks great to me! Changing the id type to string
>>> >> will
>>> >> > > make
>>> >> > > > > sure this can work with other types of cluster which doesn't
>>> >> support
>>> >> > > > > integer id. The interface and config provides a pluggable way
>>> to
>>> >> have
>>> >> > > > > different id generators for different use cases. One minor
>>> thing I
>>> >> > > found
>>> >> > > > is
>>> >> > > > > that the name of the config is camel case
>>> >> > > (*processor.idGenerator.class*
>>> >> > > > ).
>>> >> > > > > Seems Samza's practice is to use all lower case configs with
>>> "."
>>> >> > > > delimiter.
>>> >> > > > > Do you think we should stick to this convention?
>>> >> > > > >
>>> >> > > > > One more suggestion is to have a static factory method in
>>> >> > > > > the ProcessorIdGenerator (Like what we have in
>>> ApplicationRunner):
>>> >> > > > >
>>> >> > > > > static ProcessIdGenerator fromConfig(Config config) { ... }.
>>> >> > > > >
>>> >> > > > > With this, It will be more convenient for the
>>> ApplicationRunner to
>>> >> > > > > construct the generator. What do you think?
>>> >> > > > >
>>> >> > > > > Thanks,
>>> >> > > > > Xinyu
>>> >> > > > >
>>> >> > > > > On Wed, Mar 15, 2017 at 10:59 PM, Navina Ramesh (Apache) <
>>> >> > > > > nav...@apache.org>
>>> >> > > > > wrote:
>>> >> > > > >
>>> >> > > > > > Hi everyone,
>>> >> > > > > > I created a proposal for SAMZA-1126, which addresses the
>>> >> semantics
>>> >> > of
>>> >> > > > > > ProcessorId in Samza. For most purposes, ProcessorId is
>>> same as
>>> >> the
>>> >> > > > > logical
>>> >> > > > > > id that Samza assigns for each Yarn container. It is
>>> primarily
>>> >> used
>>> >> > > in
>>> >> > > > > > JobModel as a key for the corresponding ContainerModel and
>>> >> also, in
>>> >> > > > > > container-level metrics. We are expanding the applicability
>>> of
>>> >> > > > > processorId
>>> >> > > > > > to be beyond a fixed set of processors.
>>> >> > > > > >
>>> >> > > > > > Please review and comment on this SEP.
>>> >> > > > > >
>>> >> > > > > > For those who are not actively following the master branch,
>>> you
>>> >> may
>>> >> > > > have
>>> >> > > > > > more questions than others. Feel free to ask them here.
>>> >> > > > > >
>>> >> > > > > > @Xinyu: Since you are working on SAMZA-1067 and other
>>> related
>>> >> > > > integration
>>> >> > > > > > APIs, can you please add an SEP for SAMZA-1067 ? This will
>>> help
>>> >> > > others
>>> >> > > > > (adn
>>> >> > > > > > me as well) get on the same page with your design/code. Let
>>> me
>>> >> know
>>> >> > > if
>>> >> > > > > > SEP-1 will work per your design for ApplicationRunner.
>>> >> > > > > >
>>> >> > > > > > Thanks!
>>> >> > > > > > Navina
>>> >> > > > > >
>>> >> > > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Navina R.
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jagadish V,
>>> >> Graduate Student,
>>> >> Department of Computer Science,
>>> >> Stanford University
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Reply via email to