Re: First class support for node roles

Noble Paul Sat, 04 Dec 2021 21:58:48 -0800

On Sun, Dec 5, 2021 at 4:47 PM Gus Heck <[email protected]> wrote:

> I like this in that it's an example of how the overseer might be extended
> without creating a new role :)
>
> Not entirely sure if I'm for or against an enum implementation here, but
> it makes me a bit nervous. Enums with complexity can quickly get into
> difficulty for unit tests (especially if one wanted to write a mock object
> based test, something I think we maybe should use a bit more than we do).
>


>
> I would tend to think of a class to represent and collect role related
> functionality, one that perhaps has methods that receive the request, or
> other key objects and thus could be tested without standing up an entire
> server. (Not against also having them exercised in a few integrated tests,
> but the more we can avoid interleaving logic directly within DispatchFilter
> and HttpSolrCall etc. the better.
>

> So I guess I'm somewhat biased against any enum with more than a couple
> properties, and definitely don't want to wind up hanging lots of methods
> off of one. Better to use them to consume a configuration value and then
> instantiate a class that really holds the logic and data. I like them for
> constraining values and easy string value conversion but the more they look
> like classes the more I'd rather have a class.
>

 I just meant it is a set of values. Please let us not discuss the actual
impl here . We should stick to discussing the high level design here
and specifics should be dealt with in a PR

>
> -Gus
>
> On Sat, Dec 4, 2021 at 10:37 PM Noble Paul <[email protected]> wrote:
>
>> I recommend the following format for the role spec
>>
>> roles=<role-name>:<role-value>
>>
>> each role will have an enum of allowed values and a default value
>>
>>
>>    - role name: *data*
>>       - values: [*on*, *off]*
>>       - default: *allowed*
>>    - role name: *overseer*
>>       - values: [*allowed*, *disallowed*, *preferred]*
>>       - default : *allowed*
>>    - role name:* coordinator*
>>       - values : [*on*, *off]*
>>       - default: *off*
>>
>>
>> examples
>> roles=data:on,overseer:allowed (This is redundant because it uses all
>> the default values. If a node is started without any roles value this is
>> the default behavior)
>> roles=data:off,overseer:preferred ( do not allow data, join overseer
>> election at head)
>> roles=coordinator:on,data:on (role as coordinator, but allow data, it's
>> same as roles=coordinator:on)
>> roles=coordinator:on,data:off (role as coordinator, disallow data)
>>
>>
>> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <[email protected]> wrote:
>>
>>> If we go with no negative node roles and overseer node role is not
>>> strict (i.e. it’s a "preferred overseer"), then one would need to define a
>>> second node role "no_overseer" to explicitly exclude a node from ever
>>> becoming overseer (which I think is a useful feature until we switch the
>>> cluster default to not using the overseer), plus the implementation of
>>> these two node roles will obviously be coupled (and what if a node has both
>>> defined?).
>>>
>>> I prefer strict node roles.
>>> Maybe we could have node roles with [optional] parameters to let the
>>> node role implementation decide ?
>>> The overseer node role for example could have one of 3 values defined
>>> for each node: “preferred” (default, equivalent to the existing overseer
>>> role), "accepted" (equivalent to currently not defining the overseer role)
>>> and "no_way" (does not exist today).
>>>
>>> This could be useful in other contexts. A node role “data” could be
>>> “fast” or “slow” depending on type of local persistent storage for example…
>>>
>>> Ilan
>>>
>>> On Fri 3 Dec 2021 at 16:10, Gus Heck <[email protected]> wrote:
>>>
>>>> I really don't think we should have types of roles. Not
>>>> negative/positive and not strict/non-strict. You have a role or you don't.
>>>> What that means is up to the code implementing the role.
>>>>
>>>> Roles should be free to configure a preference order (binary, or n-ary
>>>> or whatever, strict or loose), prohibit behavior, or enable behavior. In
>>>> this SIP I feel we should focus on How to identify what node has what role,
>>>> How to designate what roles a node has via config/params, and the API's for
>>>> interacting with roles.
>>>>
>>>> We should for example be able to support roles such as
>>>>
>>>> PREFERRED_OVERSEER
>>>> DATA
>>>> NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)
>>>>
>>>> Details about role implementation should probably be discussed in a
>>>> thread about that role.  Obviously we should think about the name carefully
>>>> to leave options open should we want to enhance things later so maybe
>>>>
>>>> OVERSEER_PREF  or just  OVERSEER
>>>>
>>>> would be better since it merely reades that the node implements some
>>>> sort of preference or config regarding overseer... but all this can be
>>>> decided on a per role basis
>>>>
>>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]>
>>>> wrote:
>>>>
>>>>> Negative roles have a place
>>>>>
>>>>> Example is overseer
>>>>>
>>>>> There are 3 possible choices for that role
>>>>>
>>>>> a) preferred: always be in front of the election queue
>>>>> b) on: not preferred, but can be an overseer if no preferred overseer
>>>>> nodes are available
>>>>> c) off: never become an overseer
>>>>>
>>>>> Today we only have options 'a' and 'b' . In a future ticket, we may
>>>>> implement C
>>>>>
>>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote:
>>>>>
>>>>>> Negative roles add a lot of complexity, I would really want to stay
>>>>>> away from them. That’s why I want strict roles up front. It’s maybe ok to
>>>>>> push this decision out, but it also seems like the sort of thing we 
>>>>>> should
>>>>>> consider at the start.
>>>>>>
>>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes. Negative roles is not a bad idea. If I start a node for
>>>>>>> machine learning purposes, I wouldn't want that node to ever 
>>>>>>> participate in
>>>>>>> overseer election
>>>>>>>
>>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If we have non strict roles (like overseer), then it does make sense
>>>>>>>> to have negative roles.
>>>>>>>> That way I can define which are the two nodes that I'd prefer the
>>>>>>>> overseer to run on, and a few other nodes on which it should
>>>>>>>> definitely never run for various reasons. And in case these
>>>>>>>> "!overseer" are the only nodes left in the cluster, let the cluster
>>>>>>>> fail the same way it would if there were no data nodes available.
>>>>>>>>
>>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>>
>>>>>>>> >>> With the Strict/Loose option and sensible defaults, users
>>>>>>>> cannot trip themselves up by default, but the option is there for 
>>>>>>>> people to
>>>>>>>> tinker and have an iron grip over their cluster.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The
>>>>>>>> option to tinker for tighter grip can be tackled later, either on a per
>>>>>>>> role basis or as a generic concept later.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > +1 - Can definitely be added later if we so desire, not needed
>>>>>>>> for this SIP
>>>>>>>> >
>>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>>
>>>>>>>> >>> I think the key  is to let the roles have full control of the
>>>>>>>> implications of having/not having that role. No need for even a
>>>>>>>> strict/loose designation. The question of do you have the role is 
>>>>>>>> yes/no
>>>>>>>> with no logic to guess if the role is implied or not, The question of 
>>>>>>>> will
>>>>>>>> it come up with the role is "have_explicit ? use_defaults : 
>>>>>>>> use_defaults.
>>>>>>>> >>>
>>>>>>>> >>> Once you figure out who has a role (or not) what that means is
>>>>>>>> up to the role code.
>>>>>>>> >>>
>>>>>>>> >>> Corollary: we don't have to change the way overseer works in
>>>>>>>> this SIP. We can rework it or not as we see fit separately.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> +1
>>>>>>>> >>
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> Only thing we need to do is find a wording that makes the above
>>>>>>>> clear on first read through the SIP :)
>>>>>>>> >>>
>>>>>>>> >>> -Gus
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>>> them
>>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>>> what
>>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>>> and
>>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>>> other
>>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>>> This
>>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>>> see
>>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> I'm very strongly in favor of not letting users design a
>>>>>>>> system in which the cluster can be "live" without an overseer. I 
>>>>>>>> understand
>>>>>>>> that the overseer can be taxing to the cluster, but honestly what is 
>>>>>>>> the
>>>>>>>> point of having an untaxed cluster that doesn't have an overseer? I 
>>>>>>>> can see
>>>>>>>> arguments for the other roles to be stricter about this, but there are 
>>>>>>>> also
>>>>>>>> a lot of users who wouldn't want those to be strict either (like 
>>>>>>>> "query"
>>>>>>>> nodes).
>>>>>>>> >>>>
>>>>>>>> >>>> Maybe we just put in stronger guarantees that if a
>>>>>>>> non-overseer role node HAS to be selected to become overseer, it will 
>>>>>>>> try
>>>>>>>> to migrate the overseer job to a node with the overseer role whenever 
>>>>>>>> one
>>>>>>>> becomes live.
>>>>>>>> >>>>
>>>>>>>> >>>> So maybe we don't have special rules per role, but instead
>>>>>>>> roles can either be defined as "Strict" or "Loose" (better names likely
>>>>>>>> exist), and the roles come with a default (Overseer -> Loose, Data ->
>>>>>>>> Strict, Query -> Loose, etc.). And it is up to each role to define how 
>>>>>>>> to
>>>>>>>> behave when running in LOOSE mode and a non-role node is used then a 
>>>>>>>> role
>>>>>>>> node comes online (like the overseer example given above).
>>>>>>>> >>>>
>>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users
>>>>>>>> cannot trip themselves up by default, but the option is there for 
>>>>>>>> people to
>>>>>>>> tinker and have an iron grip over their cluster.
>>>>>>>> >>>>
>>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>> Noble wrote:
>>>>>>>> >>>>> > We are not modifying the way the "overseer role" works
>>>>>>>> today. We are just changing the definition and standardizing the
>>>>>>>> configuration & discoverability
>>>>>>>> >>>>> Ishan wrote:
>>>>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER
>>>>>>>> role (which currently stands for preferred overseer). We can take a 
>>>>>>>> stab at
>>>>>>>> refactoring it later.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Grouping these two comments together, since I think they are
>>>>>>>> saying the same thing. I think this is part of my confusion. We have 
>>>>>>>> an old
>>>>>>>> system that doesn't work the way we want the new system to work. There 
>>>>>>>> may
>>>>>>>> be people already using the old system. What path do we offer for folks
>>>>>>>> using the old system to migrate to the new system? What happens if 
>>>>>>>> somebody
>>>>>>>> accidentally tries to use both systems at the same time?
>>>>>>>> >>>>>
>>>>>>>> >>>>> Ishan wrote:
>>>>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER
>>>>>>>> role] are live, Solr guarantees that one of those nodes becomes the
>>>>>>>> overseer.", I meant to somewhat capture the current behaviour as the
>>>>>>>> OVERSEER role performs today. Do you see any inconsistency with this
>>>>>>>> statement vs. what it does today?
>>>>>>>> >>>>>
>>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>>> them
>>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>>> what
>>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>>> and
>>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>>> other
>>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>>> This
>>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>>> see
>>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Noble wrote:
>>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in
>>>>>>>> the following request?
>>>>>>>> >>>>>
>>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out.
>>>>>>>> Let's leave it as is.
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Replying to the top post in this thread because there has
>>>>>>>> been a lot of discussion and I don't want to look like I'm continuing 
>>>>>>>> any
>>>>>>>> of those particular threads.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> I finally had time to sit down and think about this with
>>>>>>>> the attention it deserves and am generally happy with how the 
>>>>>>>> conversation
>>>>>>>> has shaped the current proposal.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> GOOD: I think using system properties to define node roles
>>>>>>>> is fine and I like that data is the default role when not defined. I 
>>>>>>>> think
>>>>>>>> it is important to hold on to the guarantee that an active overseer 
>>>>>>>> will
>>>>>>>> land on an overseer node role.
>>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for
>>>>>>>> folks using the current OVERSEER role. I am not sure that something 
>>>>>>>> can be
>>>>>>>> done automatically since they need to now specify new properties at
>>>>>>>> startup. Maybe we need to include loud warnings or support both 
>>>>>>>> approaches
>>>>>>>> for a time?
>>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer
>>>>>>>> nodes fail, then it is implied the overseer will go to one of the data
>>>>>>>> nodes. The specific wording in the SIP - "When one or more such nodes 
>>>>>>>> are
>>>>>>>> live, Solr guarantees that one of those nodes become the overseer." 
>>>>>>>> implies
>>>>>>>> to me that failover could go from overseer1 to overseer2 to overseerN 
>>>>>>>> to
>>>>>>>> random node. I feel like we need to have some recording that there were
>>>>>>>> dedicated overseer nodes and stop the cascading failure instead of 
>>>>>>>> churning
>>>>>>>> through our data nodes.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope
>>>>>>>> of "coordinator" roles from a split query/indexing standpoint. I 
>>>>>>>> understand
>>>>>>>> that these are used as examples, but would like stronger language that 
>>>>>>>> new
>>>>>>>> roles should also go through their own SIP discussions.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node
>>>>>>>> liveness in two different places now. We have the live nodes and we 
>>>>>>>> have
>>>>>>>> the node roles stored in two different places in zookeeper and it feels
>>>>>>>> like this would lead to race conditions or split brain or other hard to
>>>>>>>> diagnose bugs when those two lists don't agree with each other. This 
>>>>>>>> also
>>>>>>>> feels like it contradicts the "single source of truth" idea later 
>>>>>>>> stated in
>>>>>>>> the proposal. I see Gus's arguments for decoupling these and am not
>>>>>>>> strongly opposed, I just get a lurking feeling about it. Even if we 
>>>>>>>> don't
>>>>>>>> do this, I would like this called out explicitly in the alternative
>>>>>>>> approaches section as something that we considered and rejected, with
>>>>>>>> details why,
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an
>>>>>>>> additional call out here that all operations are GET because nodes 
>>>>>>>> cannot
>>>>>>>> be changed at runtime.
>>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous
>>>>>>>> OVERSEER preference role?
>>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of
>>>>>>>> available roles for a cluster. I _think_ this could be based on the 
>>>>>>>> version
>>>>>>>> that the cluster is running? Would be useful to be able to interrogate 
>>>>>>>> a
>>>>>>>> cluster in the future... we're seeing OOM issues on queries, can we add
>>>>>>>> some query nodes? When were they introduced? I don't know what path 
>>>>>>>> this
>>>>>>>> API should exist at.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the
>>>>>>>> SIP document. Not sure if there's a better path that we could go for.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which
>>>>>>>> parts are string literals and which parts are meant to be substituted 
>>>>>>>> by
>>>>>>>> the operator? GET /api/cluster/roles/data would become GET
>>>>>>>> /api/cluster/roles/${rolename} in our SIP/documentation.
>>>>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1
>>>>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate
>>>>>>>> "nodes"
>>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>>>>>> intermediate "nodes" node.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some
>>>>>>>> permissions? Maybe this requirement is too fundamental to the 
>>>>>>>> operation of
>>>>>>>> a cluster and everybody would have to be able to do it.
>>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients)
>>>>>>>> to treat roles? Implementation detail that the servers will figure 
>>>>>>>> out? Or
>>>>>>>> strict guidance where the client needs to check where specific roles 
>>>>>>>> are
>>>>>>>> before sending any further communication to the server?
>>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that
>>>>>>>> it can't fulfil? An overseer node gets a query or an update. A data 
>>>>>>>> node
>>>>>>>> gets a collection creation request. Do they forward it on to an 
>>>>>>>> appropriate
>>>>>>>> node, or do they reject it? Should this be configurable? If not, then 
>>>>>>>> it
>>>>>>>> seems like lazy or poorly configured clients will defeat this isolation
>>>>>>>> system quite easily.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes.
>>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when
>>>>>>>> roles are added mean? I thought we established that they are not 
>>>>>>>> dynamic.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Thanks,
>>>>>>>> >>>>>>> Mike
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>> Hi,
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>>>> >>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>> We also wish to add first class support for Query nodes
>>>>>>>> that are used to process user queries by forwarding to data nodes,
>>>>>>>> merging/aggregating them and presenting to users. This concept exists 
>>>>>>>> as
>>>>>>>> first class citizens in most other search engines. This is a chance for
>>>>>>>> Solr to catch up.
>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>> Regards,
>>>>>>>> >>>>>>>> Ishan / Noble / Hitesh
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> --
>>>>>>>> >>> http://www.needhamsoftware.com (work)
>>>>>>>> >>> http://www.the111shift.com (play)
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>


-- 
-----------------------------------------------------
Noble Paul

Re: First class support for node roles

Reply via email to