Re: First class support for node roles

Mike Drob Sun, 05 Dec 2021 17:44:58 -0800

Ilan,

Can you provide a more detailed concrete example? I’m having a lot of
trouble understanding what you are proposing, beyond that it is somehow
contraindicated with what Ishan/Noble suggest.


Apologies for my failure to understand.

Thanks,
Mike

On Sun, Dec 5, 2021 at 5:21 PM Ilan Ginzburg <[email protected]> wrote:

> If we go with optional role params, we need two defaults:
> 1. the param value to use when the role is specified without a parameter,
> and
> 2. the param value to use for the role on a node for which the role is
> not specified at all.
>
> I don't know how to sensibly name these defaults, but the actual
> values would be:
> overseer: default1=preferred, default2=allowed
> data: default1=on, default2=on
> coordinator: default1=on, default2=off
>
> If we do not allow specifying a role without a parameter, then
> default1 does not exist and the example Noble posted earlier covers
> us. But simple roles will be easier to use without parameters (and the
> transition from existing overseer role would be trivial).
>
> On Sun, Dec 5, 2021 at 7:17 AM Ishan Chattopadhyaya
> <[email protected]> wrote:
> >
> > I'm +1 on this. It "looks" complicated at first, but simplifies all
> headaches going forward.
> >
> > On Sun, Dec 5, 2021 at 11:46 AM Noble Paul <[email protected]> wrote:
> >>
> >> I shall update the SIP proposal if we have a consensus on this
> configuration
> >>
> >> On Sun, Dec 5, 2021 at 4:58 PM Noble Paul <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On Sun, Dec 5, 2021 at 4:47 PM Gus Heck <[email protected]> wrote:
> >>>>
> >>>> I like this in that it's an example of how the overseer might be
> extended without creating a new role :)
> >>>>
> >>>> Not entirely sure if I'm for or against an enum implementation here,
> but it makes me a bit nervous. Enums with complexity can quickly get into
> difficulty for unit tests (especially if one wanted to write a mock object
> based test, something I think we maybe should use a bit more than we do).
> >>>>
> >>>>
> >>>>
> >>>> I would tend to think of a class to represent and collect role
> related functionality, one that perhaps has methods that receive the
> request, or other key objects and thus could be tested without standing up
> an entire server. (Not against also having them exercised in a few
> integrated tests, but the more we can avoid interleaving logic directly
> within DispatchFilter and HttpSolrCall etc. the better.
> >>>>
> >>>>
> >>>> So I guess I'm somewhat biased against any enum with more than a
> couple properties, and definitely don't want to wind up hanging lots of
> methods off of one. Better to use them to consume a configuration value and
> then instantiate a class that really holds the logic and data. I like them
> for constraining values and easy string value conversion but the more they
> look like classes the more I'd rather have a class.
> >>>
> >>>
> >>>  I just meant it is a set of values. Please let us not discuss the
> actual impl here . We should stick to discussing the high level design here
> and specifics should be dealt with in a PR
> >>>>
> >>>>
> >>>> -Gus
> >>>>
> >>>> On Sat, Dec 4, 2021 at 10:37 PM Noble Paul <[email protected]>
> wrote:
> >>>>>
> >>>>> I recommend the following format for the role spec
> >>>>>
> >>>>> roles=<role-name>:<role-value>
> >>>>>
> >>>>> each role will have an enum of allowed values and a default value
> >>>>>
> >>>>> role name: data
> >>>>>
> >>>>> values: [on, off]
> >>>>> default: allowed
> >>>>>
> >>>>> role name: overseer
> >>>>>
> >>>>> values: [allowed, disallowed, preferred]
> >>>>> default : allowed
> >>>>>
> >>>>> role name: coordinator
> >>>>>
> >>>>> values : [on, off]
> >>>>> default: off
> >>>>>
> >>>>>
> >>>>> examples
> >>>>> roles=data:on,overseer:allowed (This is redundant because it uses
> all the default values. If a node is started without any roles value this
> is the default behavior)
> >>>>> roles=data:off,overseer:preferred ( do not allow data, join overseer
> election at head)
> >>>>> roles=coordinator:on,data:on (role as coordinator, but allow data,
> it's same as roles=coordinator:on)
> >>>>> roles=coordinator:on,data:off (role as coordinator, disallow data)
> >>>>>
> >>>>>
> >>>>> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <[email protected]>
> wrote:
> >>>>>>
> >>>>>> If we go with no negative node roles and overseer node role is not
> strict (i.e. it’s a "preferred overseer"), then one would need to define a
> second node role "no_overseer" to explicitly exclude a node from ever
> becoming overseer (which I think is a useful feature until we switch the
> cluster default to not using the overseer), plus the implementation of
> these two node roles will obviously be coupled (and what if a node has both
> defined?).
> >>>>>>
> >>>>>> I prefer strict node roles.
> >>>>>> Maybe we could have node roles with [optional] parameters to let
> the node role implementation decide ?
> >>>>>> The overseer node role for example could have one of 3 values
> defined for each node: “preferred” (default, equivalent to the existing
> overseer role), "accepted" (equivalent to currently not defining the
> overseer role) and "no_way" (does not exist today).
> >>>>>>
> >>>>>> This could be useful in other contexts. A node role “data” could be
> “fast” or “slow” depending on type of local persistent storage for example…
> >>>>>>
> >>>>>> Ilan
> >>>>>>
> >>>>>> On Fri 3 Dec 2021 at 16:10, Gus Heck <[email protected]> wrote:
> >>>>>>>
> >>>>>>> I really don't think we should have types of roles. Not
> negative/positive and not strict/non-strict. You have a role or you don't.
> What that means is up to the code implementing the role.
> >>>>>>>
> >>>>>>> Roles should be free to configure a preference order (binary, or
> n-ary or whatever, strict or loose), prohibit behavior, or enable behavior.
> In this SIP I feel we should focus on How to identify what node has what
> role, How to designate what roles a node has via config/params, and the
> API's for interacting with roles.
> >>>>>>>
> >>>>>>> We should for example be able to support roles such as
> >>>>>>>
> >>>>>>> PREFERRED_OVERSEER
> >>>>>>> DATA
> >>>>>>> NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)
> >>>>>>>
> >>>>>>> Details about role implementation should probably be discussed in
> a thread about that role.  Obviously we should think about the name
> carefully to leave options open should we want to enhance things later so
> maybe
> >>>>>>>
> >>>>>>> OVERSEER_PREF  or just  OVERSEER
> >>>>>>>
> >>>>>>> would be better since it merely reades that the node implements
> some sort of preference or config regarding overseer... but all this can be
> decided on a per role basis
> >>>>>>>
> >>>>>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]>
> wrote:
> >>>>>>>>
> >>>>>>>> Negative roles have a place
> >>>>>>>>
> >>>>>>>> Example is overseer
> >>>>>>>>
> >>>>>>>> There are 3 possible choices for that role
> >>>>>>>>
> >>>>>>>> a) preferred: always be in front of the election queue
> >>>>>>>> b) on: not preferred, but can be an overseer if no preferred
> overseer nodes are available
> >>>>>>>> c) off: never become an overseer
> >>>>>>>>
> >>>>>>>> Today we only have options 'a' and 'b' . In a future ticket, we
> may implement C
> >>>>>>>>
> >>>>>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Negative roles add a lot of complexity, I would really want to
> stay away from them. That’s why I want strict roles up front. It’s maybe ok
> to push this decision out, but it also seems like the sort of thing we
> should consider at the start.
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Yes. Negative roles is not a bad idea. If I start a node for
> machine learning purposes, I wouldn't want that node to ever participate in
> overseer election
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]>
> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> If we have non strict roles (like overseer), then it does make
> sense
> >>>>>>>>>>> to have negative roles.
> >>>>>>>>>>> That way I can define which are the two nodes that I'd prefer
> the
> >>>>>>>>>>> overseer to run on, and a few other nodes on which it should
> >>>>>>>>>>> definitely never run for various reasons. And in case these
> >>>>>>>>>>> "!overseer" are the only nodes left in the cluster, let the
> cluster
> >>>>>>>>>>> fail the same way it would if there were no data nodes
> available.
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <
> [email protected]> wrote:
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> With the Strict/Loose option and sensible defaults, users
> cannot trip themselves up by default, but the option is there for people to
> tinker and have an iron grip over their cluster.
> >>>>>>>>>>> >>
> >>>>>>>>>>> >>
> >>>>>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The
> option to tinker for tighter grip can be tackled later, either on a per
> role basis or as a generic concept later.
> >>>>>>>>>>> >
> >>>>>>>>>>> >
> >>>>>>>>>>> > +1 - Can definitely be added later if we so desire, not
> needed for this SIP
> >>>>>>>>>>> >
> >>>>>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>>>>>>>>>> >>
> >>>>>>>>>>> >>
> >>>>>>>>>>> >>
> >>>>>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]>
> wrote:
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> I think the key  is to let the roles have full control of
> the implications of having/not having that role. No need for even a
> strict/loose designation. The question of do you have the role is yes/no
> with no logic to guess if the role is implied or not, The question of will
> it come up with the role is "have_explicit ? use_defaults : use_defaults.
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> Once you figure out who has a role (or not) what that
> means is up to the role code.
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> Corollary: we don't have to change the way overseer works
> in this SIP. We can rework it or not as we see fit separately.
> >>>>>>>>>>> >>
> >>>>>>>>>>> >>
> >>>>>>>>>>> >> +1
> >>>>>>>>>>> >>
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> Only thing we need to do is find a wording that makes the
> above clear on first read through the SIP :)
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> -Gus
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
> [email protected]> wrote:
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> This doesn't really address my concern around what
> happens if all of our existing OVERSEER candidates are down. When at least
> one of them is up, the overseer will go there, and that is good and
> expected. But what happens if all of the overseer eligible nodes are down.
> Your comment, and the old system, would imply that the overseer election
> goes to some other unrelated, untagged node. I disagree with this
> implementation choice. This sounds like something role specific to
> determine, but I would like to see us be more strict about it. I don't want
> cores leaking out of my data roles, I don't want query processing to leak
> out of my "query" nodes or whatever. Overseer shouldn't be special in this
> regard.
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>> I'm very strongly in favor of not letting users design a
> system in which the cluster can be "live" without an overseer. I understand
> that the overseer can be taxing to the cluster, but honestly what is the
> point of having an untaxed cluster that doesn't have an overseer? I can see
> arguments for the other roles to be stricter about this, but there are also
> a lot of users who wouldn't want those to be strict either (like "query"
> nodes).
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>> Maybe we just put in stronger guarantees that if a
> non-overseer role node HAS to be selected to become overseer, it will try
> to migrate the overseer job to a node with the overseer role whenever one
> becomes live.
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>> So maybe we don't have special rules per role, but
> instead roles can either be defined as "Strict" or "Loose" (better names
> likely exist), and the roles come with a default (Overseer -> Loose, Data
> -> Strict, Query -> Loose, etc.). And it is up to each role to define how
> to behave when running in LOOSE mode and a non-role node is used then a
> role node comes online (like the overseer example given above).
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users
> cannot trip themselves up by default, but the option is there for people to
> tinker and have an iron grip over their cluster.
> >>>>>>>>>>> >>>>
> >>>>>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]>
> wrote:
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> Noble wrote:
> >>>>>>>>>>> >>>>> > We are not modifying the way the "overseer role" works
> today. We are just changing the definition and standardizing the
> configuration & discoverability
> >>>>>>>>>>> >>>>> Ishan wrote:
> >>>>>>>>>>> >>>>> > As of this SIP, we're not planning to modify the
> OVERSEER role (which currently stands for preferred overseer). We can take
> a stab at refactoring it later.
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> Grouping these two comments together, since I think they
> are saying the same thing. I think this is part of my confusion. We have an
> old system that doesn't work the way we want the new system to work. There
> may be people already using the old system. What path do we offer for folks
> using the old system to migrate to the new system? What happens if somebody
> accidentally tries to use both systems at the same time?
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> Ishan wrote:
> >>>>>>>>>>> >>>>> > When I wrote "When one or more such nodes [with
> OVERSEER role] are live, Solr guarantees that one of those nodes becomes
> the overseer.", I meant to somewhat capture the current behaviour as the
> OVERSEER role performs today. Do you see any inconsistency with this
> statement vs. what it does today?
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> This doesn't really address my concern around what
> happens if all of our existing OVERSEER candidates are down. When at least
> one of them is up, the overseer will go there, and that is good and
> expected. But what happens if all of the overseer eligible nodes are down.
> Your comment, and the old system, would imply that the overseer election
> goes to some other unrelated, untagged node. I disagree with this
> implementation choice. This sounds like something role specific to
> determine, but I would like to see us be more strict about it. I don't want
> cores leaking out of my data roles, I don't want query processing to leak
> out of my "query" nodes or whatever. Overseer shouldn't be special in this
> regard.
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> Noble wrote:
> >>>>>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a
> node in the following request?
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out.
> Let's leave it as is.
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>>
> >>>>>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <
> [email protected]> wrote:
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> Replying to the top post in this thread because there
> has been a lot of discussion and I don't want to look like I'm continuing
> any of those particular threads.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> I finally had time to sit down and think about this
> with the attention it deserves and am generally happy with how the
> conversation has shaped the current proposal.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> GOOD: I think using system properties to define node
> roles is fine and I like that data is the default role when not defined. I
> think it is important to hold on to the guarantee that an active overseer
> will land on an overseer node role.
> >>>>>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path
> for folks using the current OVERSEER role. I am not sure that something can
> be done automatically since they need to now specify new properties at
> startup. Maybe we need to include loud warnings or support both approaches
> for a time?
> >>>>>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the
> overseer nodes fail, then it is implied the overseer will go to one of the
> data nodes. The specific wording in the SIP - "When one or more such nodes
> are live, Solr guarantees that one of those nodes become the overseer."
> implies to me that failover could go from overseer1 to overseer2 to
> overseerN to random node. I feel like we need to have some recording that
> there were dedicated overseer nodes and stop the cascading failure instead
> of churning through our data nodes.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed
> scope of "coordinator" roles from a split query/indexing standpoint. I
> understand that these are used as examples, but would like stronger
> language that new roles should also go through their own SIP discussions.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node
> liveness in two different places now. We have the live nodes and we have
> the node roles stored in two different places in zookeeper and it feels
> like this would lead to race conditions or split brain or other hard to
> diagnose bugs when those two lists don't agree with each other. This also
> feels like it contradicts the "single source of truth" idea later stated in
> the proposal. I see Gus's arguments for decoupling these and am not
> strongly opposed, I just get a lurking feeling about it. Even if we don't
> do this, I would like this called out explicitly in the alternative
> approaches section as something that we considered and rejected, with
> details why,
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an
> additional call out here that all operations are GET because nodes cannot
> be changed at runtime.
> >>>>>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the
> previous OVERSEER preference role?
> >>>>>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of
> available roles for a cluster. I _think_ this could be based on the version
> that the cluster is running? Would be useful to be able to interrogate a
> cluster in the future... we're seeing OOM issues on queries, can we add
> some query nodes? When were they introduced? I don't know what path this
> API should exist at.
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated
> the SIP document. Not sure if there's a better path that we could go for.
> >>>>>>>>>>> >>>>>>
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show
> which parts are string literals and which parts are meant to be substituted
> by the operator? GET /api/cluster/roles/data would become GET
> /api/cluster/roles/${rolename} in our SIP/documentation.
> >>>>>>>>>>> >>>>>>> CHANGE REQUEST: I think GET
> /api/cluster/roles/nodes/node1 should be GET /api/cluster/roles/${nodename}
> dropping the intermediate "nodes"
> >>>>>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need
> that intermediate "nodes" node.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some
> permissions? Maybe this requirement is too fundamental to the operation of
> a cluster and everybody would have to be able to do it.
> >>>>>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other
> clients) to treat roles? Implementation detail that the servers will figure
> out? Or strict guidance where the client needs to check where specific
> roles are before sending any further communication to the server?
> >>>>>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request
> that it can't fulfil? An overseer node gets a query or an update. A data
> node gets a collection creation request. Do they forward it on to an
> appropriate node, or do they reject it? Should this be configurable? If
> not, then it seems like lazy or poorly configured clients will defeat this
> isolation system quite easily.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes.
> >>>>>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave
> when roles are added mean? I thought we established that they are not
> dynamic.
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> Thanks,
> >>>>>>>>>>> >>>>>>> Mike
> >>>>>>>>>>> >>>>>>>
> >>>>>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >>>>>>>>>>> >>>>>>>>
> >>>>>>>>>>> >>>>>>>> Hi,
> >>>>>>>>>>> >>>>>>>>
> >>>>>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node
> roles:
> >>>>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
> >>>>>>>>>>> >>>>>>>>
> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
> >>>>>>>>>>> >>>>>>>>
> >>>>>>>>>>> >>>>>>>> We also wish to add first class support for Query
> nodes that are used to process user queries by forwarding to data nodes,
> merging/aggregating them and presenting to users. This concept exists as
> first class citizens in most other search engines. This is a chance for
> Solr to catch up.
> >>>>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
> >>>>>>>>>>> >>>>>>>>
> >>>>>>>>>>> >>>>>>>> Regards,
> >>>>>>>>>>> >>>>>>>> Ishan / Noble / Hitesh
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>>
> >>>>>>>>>>> >>> --
> >>>>>>>>>>> >>> http://www.needhamsoftware.com (work)
> >>>>>>>>>>> >>> http://www.the111shift.com (play)
> >>>>>>>>>>>
> >>>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> http://www.needhamsoftware.com (work)
> >>>>>>> http://www.the111shift.com (play)
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> -----------------------------------------------------
> >>>>> Noble Paul
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://www.needhamsoftware.com (work)
> >>>> http://www.the111shift.com (play)
> >>>
> >>>
> >>>
> >>> --
> >>> -----------------------------------------------------
> >>> Noble Paul
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: First class support for node roles

Reply via email to