Re: First class support for node roles

Noble Paul Thu, 02 Dec 2021 15:52:20 -0800

Yes. Negative roles is not a bad idea. If I start a node for
machine learning purposes, I wouldn't want that node to ever participate in
overseer election


On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]> wrote:

> If we have non strict roles (like overseer), then it does make sense
> to have negative roles.
> That way I can define which are the two nodes that I'd prefer the
> overseer to run on, and a few other nodes on which it should
> definitely never run for various reasons. And in case these
> "!overseer" are the only nodes left in the cluster, let the cluster
> fail the same way it would if there were no data nodes available.
>
> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <[email protected]>
> wrote:
> >>>
> >>> With the Strict/Loose option and sensible defaults, users cannot trip
> themselves up by default, but the option is there for people to tinker and
> have an iron grip over their cluster.
> >>
> >>
> >> +1 to sensible defaults so users don't trip themselves. The option to
> tinker for tighter grip can be tackled later, either on a per role basis or
> as a generic concept later.
> >
> >
> > +1 - Can definitely be added later if we so desire, not needed for this
> SIP
> >
> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>
> >>
> >>
> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> wrote:
> >>>
> >>> I think the key  is to let the roles have full control of the
> implications of having/not having that role. No need for even a
> strict/loose designation. The question of do you have the role is yes/no
> with no logic to guess if the role is implied or not, The question of will
> it come up with the role is "have_explicit ? use_defaults : use_defaults.
> >>>
> >>> Once you figure out who has a role (or not) what that means is up to
> the role code.
> >>>
> >>> Corollary: we don't have to change the way overseer works in this SIP.
> We can rework it or not as we see fit separately.
> >>
> >>
> >> +1
> >>
> >>>
> >>>
> >>> Only thing we need to do is find a wording that makes the above clear
> on first read through the SIP :)
> >>>
> >>> -Gus
> >>>
> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <[email protected]>
> wrote:
> >>>>>
> >>>>> This doesn't really address my concern around what happens if all of
> our existing OVERSEER candidates are down. When at least one of them is up,
> the overseer will go there, and that is good and expected. But what happens
> if all of the overseer eligible nodes are down. Your comment, and the old
> system, would imply that the overseer election goes to some other
> unrelated, untagged node. I disagree with this implementation choice. This
> sounds like something role specific to determine, but I would like to see
> us be more strict about it. I don't want cores leaking out of my data
> roles, I don't want query processing to leak out of my "query" nodes or
> whatever. Overseer shouldn't be special in this regard.
> >>>>
> >>>>
> >>>> I'm very strongly in favor of not letting users design a system in
> which the cluster can be "live" without an overseer. I understand that the
> overseer can be taxing to the cluster, but honestly what is the point of
> having an untaxed cluster that doesn't have an overseer? I can see
> arguments for the other roles to be stricter about this, but there are also
> a lot of users who wouldn't want those to be strict either (like "query"
> nodes).
> >>>>
> >>>> Maybe we just put in stronger guarantees that if a non-overseer role
> node HAS to be selected to become overseer, it will try to migrate the
> overseer job to a node with the overseer role whenever one becomes live.
> >>>>
> >>>> So maybe we don't have special rules per role, but instead roles can
> either be defined as "Strict" or "Loose" (better names likely exist), and
> the roles come with a default (Overseer -> Loose, Data -> Strict, Query ->
> Loose, etc.). And it is up to each role to define how to behave when
> running in LOOSE mode and a non-role node is used then a role node comes
> online (like the overseer example given above).
> >>>>
> >>>> With the Strict/Loose option and sensible defaults, users cannot trip
> themselves up by default, but the option is there for people to tinker and
> have an iron grip over their cluster.
> >>>>
> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> wrote:
> >>>>>
> >>>>> Noble wrote:
> >>>>> > We are not modifying the way the "overseer role" works today. We
> are just changing the definition and standardizing the configuration &
> discoverability
> >>>>> Ishan wrote:
> >>>>> > As of this SIP, we're not planning to modify the OVERSEER role
> (which currently stands for preferred overseer). We can take a stab at
> refactoring it later.
> >>>>>
> >>>>> Grouping these two comments together, since I think they are saying
> the same thing. I think this is part of my confusion. We have an old system
> that doesn't work the way we want the new system to work. There may be
> people already using the old system. What path do we offer for folks using
> the old system to migrate to the new system? What happens if somebody
> accidentally tries to use both systems at the same time?
> >>>>>
> >>>>> Ishan wrote:
> >>>>> > When I wrote "When one or more such nodes [with OVERSEER role] are
> live, Solr guarantees that one of those nodes becomes the overseer.", I
> meant to somewhat capture the current behaviour as the OVERSEER role
> performs today. Do you see any inconsistency with this statement vs. what
> it does today?
> >>>>>
> >>>>> This doesn't really address my concern around what happens if all of
> our existing OVERSEER candidates are down. When at least one of them is up,
> the overseer will go there, and that is good and expected. But what happens
> if all of the overseer eligible nodes are down. Your comment, and the old
> system, would imply that the overseer election goes to some other
> unrelated, untagged node. I disagree with this implementation choice. This
> sounds like something role specific to determine, but I would like to see
> us be more strict about it. I don't want cores leaking out of my data
> roles, I don't want query processing to leak out of my "query" nodes or
> whatever. Overseer shouldn't be special in this regard.
> >>>>>
> >>>>> Noble wrote:
> >>>>> > If we do that how do we know if xyz is a role or a node in the
> following request?
> >>>>>
> >>>>> You're absolutely correct, thanks for pointing this out. Let's leave
> it as is.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Replying to the top post in this thread because there has been a
> lot of discussion and I don't want to look like I'm continuing any of those
> particular threads.
> >>>>>>>
> >>>>>>> I finally had time to sit down and think about this with the
> attention it deserves and am generally happy with how the conversation has
> shaped the current proposal.
> >>>>>>>
> >>>>>>> GOOD: I think using system properties to define node roles is fine
> and I like that data is the default role when not defined. I think it is
> important to hold on to the guarantee that an active overseer will land on
> an overseer node role.
> >>>>>>> CHANGE REQUEST: I would like to see a migration path for folks
> using the current OVERSEER role. I am not sure that something can be done
> automatically since they need to now specify new properties at startup.
> Maybe we need to include loud warnings or support both approaches for a
> time?
> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer nodes
> fail, then it is implied the overseer will go to one of the data nodes. The
> specific wording in the SIP - "When one or more such nodes are live, Solr
> guarantees that one of those nodes become the overseer." implies to me that
> failover could go from overseer1 to overseer2 to overseerN to random node.
> I feel like we need to have some recording that there were dedicated
> overseer nodes and stop the cascading failure instead of churning through
> our data nodes.
> >>>>>>>
> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of
> "coordinator" roles from a split query/indexing standpoint. I understand
> that these are used as examples, but would like stronger language that new
> roles should also go through their own SIP discussions.
> >>>>>>>
> >>>>>>> CLARIFICATION: I do not like that we are storing node liveness in
> two different places now. We have the live nodes and we have the node roles
> stored in two different places in zookeeper and it feels like this would
> lead to race conditions or split brain or other hard to diagnose bugs when
> those two lists don't agree with each other. This also feels like it
> contradicts the "single source of truth" idea later stated in the proposal.
> I see Gus's arguments for decoupling these and am not strongly opposed, I
> just get a lurking feeling about it. Even if we don't do this, I would like
> this called out explicitly in the alternative approaches section as
> something that we considered and rejected, with details why,
> >>>>>>>
> >>>>>>> GOOD: The API looks pretty clear. I would like an additional call
> out here that all operations are GET because nodes cannot be changed at
> runtime.
> >>>>>>> CLARIFICATION: How does this interact with the previous OVERSEER
> preference role?
> >>>>>>> CHANGE REQUEST: An additional API to get the list of available
> roles for a cluster. I _think_ this could be based on the version that the
> cluster is running? Would be useful to be able to interrogate a cluster in
> the future... we're seeing OOM issues on queries, can we add some query
> nodes? When were they introduced? I don't know what path this API should
> exist at.
> >>>>>>
> >>>>>>
> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP
> document. Not sure if there's a better path that we could go for.
> >>>>>>
> >>>>>>>
> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which parts
> are string literals and which parts are meant to be substituted by the
> operator? GET /api/cluster/roles/data would become GET
> /api/cluster/roles/${rolename} in our SIP/documentation.
> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 should
> be GET /api/cluster/roles/${nodename} dropping the intermediate "nodes"
> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
> intermediate "nodes" node.
> >>>>>>>
> >>>>>>> CLARIFICATION: Should listing roles require some permissions?
> Maybe this requirement is too fundamental to the operation of a cluster and
> everybody would have to be able to do it.
> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to treat
> roles? Implementation detail that the servers will figure out? Or strict
> guidance where the client needs to check where specific roles are before
> sending any further communication to the server?
> >>>>>>> CLARIFICATION: What happens when a node gets a request that it
> can't fulfil? An overseer node gets a query or an update. A data node gets
> a collection creation request. Do they forward it on to an appropriate
> node, or do they reject it? Should this be configurable? If not, then it
> seems like lazy or poorly configured clients will defeat this isolation
> system quite easily.
> >>>>>>>
> >>>>>>> GOOD: Testing the API is very important, yes.
> >>>>>>> CLARIFICATION: What does testing for how nodes behave when roles
> are added mean? I thought we established that they are not dynamic.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Mike
> >>>>>>>
> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Here's an SIP for introducing the concept of node roles:
> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
> >>>>>>>>
> >>>>>>>> We also wish to add first class support for Query nodes that are
> used to process user queries by forwarding to data nodes,
> merging/aggregating them and presenting to users. This concept exists as
> first class citizens in most other search engines. This is a chance for
> Solr to catch up.
> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Ishan / Noble / Hitesh
> >>>
> >>>
> >>>
> >>> --
> >>> http://www.needhamsoftware.com (work)
> >>> http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: First class support for node roles

Reply via email to