On Sun, Dec 5, 2021 at 4:47 PM Gus Heck <[email protected]> wrote: > I like this in that it's an example of how the overseer might be extended > without creating a new role :) > > Not entirely sure if I'm for or against an enum implementation here, but > it makes me a bit nervous. Enums with complexity can quickly get into > difficulty for unit tests (especially if one wanted to write a mock object > based test, something I think we maybe should use a bit more than we do). >
> > I would tend to think of a class to represent and collect role related > functionality, one that perhaps has methods that receive the request, or > other key objects and thus could be tested without standing up an entire > server. (Not against also having them exercised in a few integrated tests, > but the more we can avoid interleaving logic directly within DispatchFilter > and HttpSolrCall etc. the better. > > So I guess I'm somewhat biased against any enum with more than a couple > properties, and definitely don't want to wind up hanging lots of methods > off of one. Better to use them to consume a configuration value and then > instantiate a class that really holds the logic and data. I like them for > constraining values and easy string value conversion but the more they look > like classes the more I'd rather have a class. > I just meant it is a set of values. Please let us not discuss the actual impl here . We should stick to discussing the high level design here and specifics should be dealt with in a PR > > -Gus > > On Sat, Dec 4, 2021 at 10:37 PM Noble Paul <[email protected]> wrote: > >> I recommend the following format for the role spec >> >> roles=<role-name>:<role-value> >> >> each role will have an enum of allowed values and a default value >> >> >> - role name: *data* >> - values: [*on*, *off]* >> - default: *allowed* >> - role name: *overseer* >> - values: [*allowed*, *disallowed*, *preferred]* >> - default : *allowed* >> - role name:* coordinator* >> - values : [*on*, *off]* >> - default: *off* >> >> >> examples >> roles=data:on,overseer:allowed (This is redundant because it uses all >> the default values. If a node is started without any roles value this is >> the default behavior) >> roles=data:off,overseer:preferred ( do not allow data, join overseer >> election at head) >> roles=coordinator:on,data:on (role as coordinator, but allow data, it's >> same as roles=coordinator:on) >> roles=coordinator:on,data:off (role as coordinator, disallow data) >> >> >> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <[email protected]> wrote: >> >>> If we go with no negative node roles and overseer node role is not >>> strict (i.e. it’s a "preferred overseer"), then one would need to define a >>> second node role "no_overseer" to explicitly exclude a node from ever >>> becoming overseer (which I think is a useful feature until we switch the >>> cluster default to not using the overseer), plus the implementation of >>> these two node roles will obviously be coupled (and what if a node has both >>> defined?). >>> >>> I prefer strict node roles. >>> Maybe we could have node roles with [optional] parameters to let the >>> node role implementation decide ? >>> The overseer node role for example could have one of 3 values defined >>> for each node: “preferred” (default, equivalent to the existing overseer >>> role), "accepted" (equivalent to currently not defining the overseer role) >>> and "no_way" (does not exist today). >>> >>> This could be useful in other contexts. A node role “data” could be >>> “fast” or “slow” depending on type of local persistent storage for example… >>> >>> Ilan >>> >>> On Fri 3 Dec 2021 at 16:10, Gus Heck <[email protected]> wrote: >>> >>>> I really don't think we should have types of roles. Not >>>> negative/positive and not strict/non-strict. You have a role or you don't. >>>> What that means is up to the code implementing the role. >>>> >>>> Roles should be free to configure a preference order (binary, or n-ary >>>> or whatever, strict or loose), prohibit behavior, or enable behavior. In >>>> this SIP I feel we should focus on How to identify what node has what role, >>>> How to designate what roles a node has via config/params, and the API's for >>>> interacting with roles. >>>> >>>> We should for example be able to support roles such as >>>> >>>> PREFERRED_OVERSEER >>>> DATA >>>> NO_ROUTED_ALIAS (just an example, not something I mean to suggest) >>>> >>>> Details about role implementation should probably be discussed in a >>>> thread about that role. Obviously we should think about the name carefully >>>> to leave options open should we want to enhance things later so maybe >>>> >>>> OVERSEER_PREF or just OVERSEER >>>> >>>> would be better since it merely reades that the node implements some >>>> sort of preference or config regarding overseer... but all this can be >>>> decided on a per role basis >>>> >>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]> >>>> wrote: >>>> >>>>> Negative roles have a place >>>>> >>>>> Example is overseer >>>>> >>>>> There are 3 possible choices for that role >>>>> >>>>> a) preferred: always be in front of the election queue >>>>> b) on: not preferred, but can be an overseer if no preferred overseer >>>>> nodes are available >>>>> c) off: never become an overseer >>>>> >>>>> Today we only have options 'a' and 'b' . In a future ticket, we may >>>>> implement C >>>>> >>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote: >>>>> >>>>>> Negative roles add a lot of complexity, I would really want to stay >>>>>> away from them. That’s why I want strict roles up front. It’s maybe ok to >>>>>> push this decision out, but it also seems like the sort of thing we >>>>>> should >>>>>> consider at the start. >>>>>> >>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Yes. Negative roles is not a bad idea. If I start a node for >>>>>>> machine learning purposes, I wouldn't want that node to ever >>>>>>> participate in >>>>>>> overseer election >>>>>>> >>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> If we have non strict roles (like overseer), then it does make sense >>>>>>>> to have negative roles. >>>>>>>> That way I can define which are the two nodes that I'd prefer the >>>>>>>> overseer to run on, and a few other nodes on which it should >>>>>>>> definitely never run for various reasons. And in case these >>>>>>>> "!overseer" are the only nodes left in the cluster, let the cluster >>>>>>>> fail the same way it would if there were no data nodes available. >>>>>>>> >>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman < >>>>>>>> [email protected]> wrote: >>>>>>>> >>> >>>>>>>> >>> With the Strict/Loose option and sensible defaults, users >>>>>>>> cannot trip themselves up by default, but the option is there for >>>>>>>> people to >>>>>>>> tinker and have an iron grip over their cluster. >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The >>>>>>>> option to tinker for tighter grip can be tackled later, either on a per >>>>>>>> role basis or as a generic concept later. >>>>>>>> > >>>>>>>> > >>>>>>>> > +1 - Can definitely be added later if we so desire, not needed >>>>>>>> for this SIP >>>>>>>> > >>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < >>>>>>>> [email protected]> wrote: >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> >>>>>>>> wrote: >>>>>>>> >>> >>>>>>>> >>> I think the key is to let the roles have full control of the >>>>>>>> implications of having/not having that role. No need for even a >>>>>>>> strict/loose designation. The question of do you have the role is >>>>>>>> yes/no >>>>>>>> with no logic to guess if the role is implied or not, The question of >>>>>>>> will >>>>>>>> it come up with the role is "have_explicit ? use_defaults : >>>>>>>> use_defaults. >>>>>>>> >>> >>>>>>>> >>> Once you figure out who has a role (or not) what that means is >>>>>>>> up to the role code. >>>>>>>> >>> >>>>>>>> >>> Corollary: we don't have to change the way overseer works in >>>>>>>> this SIP. We can rework it or not as we see fit separately. >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> +1 >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> Only thing we need to do is find a wording that makes the above >>>>>>>> clear on first read through the SIP :) >>>>>>>> >>> >>>>>>>> >>> -Gus >>>>>>>> >>> >>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>> >>>>>>>> >>>>> This doesn't really address my concern around what happens if >>>>>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>>>>> them >>>>>>>> is up, the overseer will go there, and that is good and expected. But >>>>>>>> what >>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, >>>>>>>> and >>>>>>>> the old system, would imply that the overseer election goes to some >>>>>>>> other >>>>>>>> unrelated, untagged node. I disagree with this implementation choice. >>>>>>>> This >>>>>>>> sounds like something role specific to determine, but I would like to >>>>>>>> see >>>>>>>> us be more strict about it. I don't want cores leaking out of my data >>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> I'm very strongly in favor of not letting users design a >>>>>>>> system in which the cluster can be "live" without an overseer. I >>>>>>>> understand >>>>>>>> that the overseer can be taxing to the cluster, but honestly what is >>>>>>>> the >>>>>>>> point of having an untaxed cluster that doesn't have an overseer? I >>>>>>>> can see >>>>>>>> arguments for the other roles to be stricter about this, but there are >>>>>>>> also >>>>>>>> a lot of users who wouldn't want those to be strict either (like >>>>>>>> "query" >>>>>>>> nodes). >>>>>>>> >>>> >>>>>>>> >>>> Maybe we just put in stronger guarantees that if a >>>>>>>> non-overseer role node HAS to be selected to become overseer, it will >>>>>>>> try >>>>>>>> to migrate the overseer job to a node with the overseer role whenever >>>>>>>> one >>>>>>>> becomes live. >>>>>>>> >>>> >>>>>>>> >>>> So maybe we don't have special rules per role, but instead >>>>>>>> roles can either be defined as "Strict" or "Loose" (better names likely >>>>>>>> exist), and the roles come with a default (Overseer -> Loose, Data -> >>>>>>>> Strict, Query -> Loose, etc.). And it is up to each role to define how >>>>>>>> to >>>>>>>> behave when running in LOOSE mode and a non-role node is used then a >>>>>>>> role >>>>>>>> node comes online (like the overseer example given above). >>>>>>>> >>>> >>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users >>>>>>>> cannot trip themselves up by default, but the option is there for >>>>>>>> people to >>>>>>>> tinker and have an iron grip over their cluster. >>>>>>>> >>>> >>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>> >>>>>>>> >>>>> Noble wrote: >>>>>>>> >>>>> > We are not modifying the way the "overseer role" works >>>>>>>> today. We are just changing the definition and standardizing the >>>>>>>> configuration & discoverability >>>>>>>> >>>>> Ishan wrote: >>>>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER >>>>>>>> role (which currently stands for preferred overseer). We can take a >>>>>>>> stab at >>>>>>>> refactoring it later. >>>>>>>> >>>>> >>>>>>>> >>>>> Grouping these two comments together, since I think they are >>>>>>>> saying the same thing. I think this is part of my confusion. We have >>>>>>>> an old >>>>>>>> system that doesn't work the way we want the new system to work. There >>>>>>>> may >>>>>>>> be people already using the old system. What path do we offer for folks >>>>>>>> using the old system to migrate to the new system? What happens if >>>>>>>> somebody >>>>>>>> accidentally tries to use both systems at the same time? >>>>>>>> >>>>> >>>>>>>> >>>>> Ishan wrote: >>>>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER >>>>>>>> role] are live, Solr guarantees that one of those nodes becomes the >>>>>>>> overseer.", I meant to somewhat capture the current behaviour as the >>>>>>>> OVERSEER role performs today. Do you see any inconsistency with this >>>>>>>> statement vs. what it does today? >>>>>>>> >>>>> >>>>>>>> >>>>> This doesn't really address my concern around what happens if >>>>>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>>>>> them >>>>>>>> is up, the overseer will go there, and that is good and expected. But >>>>>>>> what >>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, >>>>>>>> and >>>>>>>> the old system, would imply that the overseer election goes to some >>>>>>>> other >>>>>>>> unrelated, untagged node. I disagree with this implementation choice. >>>>>>>> This >>>>>>>> sounds like something role specific to determine, but I would like to >>>>>>>> see >>>>>>>> us be more strict about it. I don't want cores leaking out of my data >>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>>>> >>>>> >>>>>>>> >>>>> Noble wrote: >>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in >>>>>>>> the following request? >>>>>>>> >>>>> >>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out. >>>>>>>> Let's leave it as is. >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> Replying to the top post in this thread because there has >>>>>>>> been a lot of discussion and I don't want to look like I'm continuing >>>>>>>> any >>>>>>>> of those particular threads. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> I finally had time to sit down and think about this with >>>>>>>> the attention it deserves and am generally happy with how the >>>>>>>> conversation >>>>>>>> has shaped the current proposal. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> GOOD: I think using system properties to define node roles >>>>>>>> is fine and I like that data is the default role when not defined. I >>>>>>>> think >>>>>>>> it is important to hold on to the guarantee that an active overseer >>>>>>>> will >>>>>>>> land on an overseer node role. >>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for >>>>>>>> folks using the current OVERSEER role. I am not sure that something >>>>>>>> can be >>>>>>>> done automatically since they need to now specify new properties at >>>>>>>> startup. Maybe we need to include loud warnings or support both >>>>>>>> approaches >>>>>>>> for a time? >>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer >>>>>>>> nodes fail, then it is implied the overseer will go to one of the data >>>>>>>> nodes. The specific wording in the SIP - "When one or more such nodes >>>>>>>> are >>>>>>>> live, Solr guarantees that one of those nodes become the overseer." >>>>>>>> implies >>>>>>>> to me that failover could go from overseer1 to overseer2 to overseerN >>>>>>>> to >>>>>>>> random node. I feel like we need to have some recording that there were >>>>>>>> dedicated overseer nodes and stop the cascading failure instead of >>>>>>>> churning >>>>>>>> through our data nodes. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope >>>>>>>> of "coordinator" roles from a split query/indexing standpoint. I >>>>>>>> understand >>>>>>>> that these are used as examples, but would like stronger language that >>>>>>>> new >>>>>>>> roles should also go through their own SIP discussions. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node >>>>>>>> liveness in two different places now. We have the live nodes and we >>>>>>>> have >>>>>>>> the node roles stored in two different places in zookeeper and it feels >>>>>>>> like this would lead to race conditions or split brain or other hard to >>>>>>>> diagnose bugs when those two lists don't agree with each other. This >>>>>>>> also >>>>>>>> feels like it contradicts the "single source of truth" idea later >>>>>>>> stated in >>>>>>>> the proposal. I see Gus's arguments for decoupling these and am not >>>>>>>> strongly opposed, I just get a lurking feeling about it. Even if we >>>>>>>> don't >>>>>>>> do this, I would like this called out explicitly in the alternative >>>>>>>> approaches section as something that we considered and rejected, with >>>>>>>> details why, >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an >>>>>>>> additional call out here that all operations are GET because nodes >>>>>>>> cannot >>>>>>>> be changed at runtime. >>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous >>>>>>>> OVERSEER preference role? >>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of >>>>>>>> available roles for a cluster. I _think_ this could be based on the >>>>>>>> version >>>>>>>> that the cluster is running? Would be useful to be able to interrogate >>>>>>>> a >>>>>>>> cluster in the future... we're seeing OOM issues on queries, can we add >>>>>>>> some query nodes? When were they introduced? I don't know what path >>>>>>>> this >>>>>>>> API should exist at. >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the >>>>>>>> SIP document. Not sure if there's a better path that we could go for. >>>>>>>> >>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which >>>>>>>> parts are string literals and which parts are meant to be substituted >>>>>>>> by >>>>>>>> the operator? GET /api/cluster/roles/data would become GET >>>>>>>> /api/cluster/roles/${rolename} in our SIP/documentation. >>>>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 >>>>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate >>>>>>>> "nodes" >>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>>>>>> intermediate "nodes" node. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some >>>>>>>> permissions? Maybe this requirement is too fundamental to the >>>>>>>> operation of >>>>>>>> a cluster and everybody would have to be able to do it. >>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) >>>>>>>> to treat roles? Implementation detail that the servers will figure >>>>>>>> out? Or >>>>>>>> strict guidance where the client needs to check where specific roles >>>>>>>> are >>>>>>>> before sending any further communication to the server? >>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that >>>>>>>> it can't fulfil? An overseer node gets a query or an update. A data >>>>>>>> node >>>>>>>> gets a collection creation request. Do they forward it on to an >>>>>>>> appropriate >>>>>>>> node, or do they reject it? Should this be configurable? If not, then >>>>>>>> it >>>>>>>> seems like lazy or poorly configured clients will defeat this isolation >>>>>>>> system quite easily. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes. >>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when >>>>>>>> roles are added mean? I thought we established that they are not >>>>>>>> dynamic. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> Thanks, >>>>>>>> >>>>>>> Mike >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>>>> >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We also wish to add first class support for Query nodes >>>>>>>> that are used to process user queries by forwarding to data nodes, >>>>>>>> merging/aggregating them and presenting to users. This concept exists >>>>>>>> as >>>>>>>> first class citizens in most other search engines. This is a chance for >>>>>>>> Solr to catch up. >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Ishan / Noble / Hitesh >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> -- >>>>>>>> >>> http://www.needhamsoftware.com (work) >>>>>>>> >>> http://www.the111shift.com (play) >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>> >>>> -- >>>> http://www.needhamsoftware.com (work) >>>> http://www.the111shift.com (play) >>>> >>> >> >> -- >> ----------------------------------------------------- >> Noble Paul >> > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) > -- ----------------------------------------------------- Noble Paul
