I shall update the SIP proposal if we have a consensus on this configuration
On Sun, Dec 5, 2021 at 4:58 PM Noble Paul <[email protected]> wrote: > > > On Sun, Dec 5, 2021 at 4:47 PM Gus Heck <[email protected]> wrote: > >> I like this in that it's an example of how the overseer might be extended >> without creating a new role :) >> >> Not entirely sure if I'm for or against an enum implementation here, but >> it makes me a bit nervous. Enums with complexity can quickly get into >> difficulty for unit tests (especially if one wanted to write a mock object >> based test, something I think we maybe should use a bit more than we do). >> > >> >> I would tend to think of a class to represent and collect role related >> functionality, one that perhaps has methods that receive the request, or >> other key objects and thus could be tested without standing up an entire >> server. (Not against also having them exercised in a few integrated tests, >> but the more we can avoid interleaving logic directly within DispatchFilter >> and HttpSolrCall etc. the better. >> > >> So I guess I'm somewhat biased against any enum with more than a couple >> properties, and definitely don't want to wind up hanging lots of methods >> off of one. Better to use them to consume a configuration value and then >> instantiate a class that really holds the logic and data. I like them for >> constraining values and easy string value conversion but the more they look >> like classes the more I'd rather have a class. >> > > I just meant it is a set of values. Please let us not discuss the actual > impl here . We should stick to discussing the high level design here > and specifics should be dealt with in a PR > >> >> -Gus >> >> On Sat, Dec 4, 2021 at 10:37 PM Noble Paul <[email protected]> wrote: >> >>> I recommend the following format for the role spec >>> >>> roles=<role-name>:<role-value> >>> >>> each role will have an enum of allowed values and a default value >>> >>> >>> - role name: *data* >>> - values: [*on*, *off]* >>> - default: *allowed* >>> - role name: *overseer* >>> - values: [*allowed*, *disallowed*, *preferred]* >>> - default : *allowed* >>> - role name:* coordinator* >>> - values : [*on*, *off]* >>> - default: *off* >>> >>> >>> examples >>> roles=data:on,overseer:allowed (This is redundant because it uses all >>> the default values. If a node is started without any roles value this is >>> the default behavior) >>> roles=data:off,overseer:preferred ( do not allow data, join overseer >>> election at head) >>> roles=coordinator:on,data:on (role as coordinator, but allow data, it's >>> same as roles=coordinator:on) >>> roles=coordinator:on,data:off (role as coordinator, disallow data) >>> >>> >>> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <[email protected]> >>> wrote: >>> >>>> If we go with no negative node roles and overseer node role is not >>>> strict (i.e. it’s a "preferred overseer"), then one would need to define a >>>> second node role "no_overseer" to explicitly exclude a node from ever >>>> becoming overseer (which I think is a useful feature until we switch the >>>> cluster default to not using the overseer), plus the implementation of >>>> these two node roles will obviously be coupled (and what if a node has both >>>> defined?). >>>> >>>> I prefer strict node roles. >>>> Maybe we could have node roles with [optional] parameters to let the >>>> node role implementation decide ? >>>> The overseer node role for example could have one of 3 values defined >>>> for each node: “preferred” (default, equivalent to the existing overseer >>>> role), "accepted" (equivalent to currently not defining the overseer role) >>>> and "no_way" (does not exist today). >>>> >>>> This could be useful in other contexts. A node role “data” could be >>>> “fast” or “slow” depending on type of local persistent storage for example… >>>> >>>> Ilan >>>> >>>> On Fri 3 Dec 2021 at 16:10, Gus Heck <[email protected]> wrote: >>>> >>>>> I really don't think we should have types of roles. Not >>>>> negative/positive and not strict/non-strict. You have a role or you don't. >>>>> What that means is up to the code implementing the role. >>>>> >>>>> Roles should be free to configure a preference order (binary, or n-ary >>>>> or whatever, strict or loose), prohibit behavior, or enable behavior. In >>>>> this SIP I feel we should focus on How to identify what node has what >>>>> role, >>>>> How to designate what roles a node has via config/params, and the API's >>>>> for >>>>> interacting with roles. >>>>> >>>>> We should for example be able to support roles such as >>>>> >>>>> PREFERRED_OVERSEER >>>>> DATA >>>>> NO_ROUTED_ALIAS (just an example, not something I mean to suggest) >>>>> >>>>> Details about role implementation should probably be discussed in a >>>>> thread about that role. Obviously we should think about the name >>>>> carefully >>>>> to leave options open should we want to enhance things later so maybe >>>>> >>>>> OVERSEER_PREF or just OVERSEER >>>>> >>>>> would be better since it merely reades that the node implements some >>>>> sort of preference or config regarding overseer... but all this can be >>>>> decided on a per role basis >>>>> >>>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]> >>>>> wrote: >>>>> >>>>>> Negative roles have a place >>>>>> >>>>>> Example is overseer >>>>>> >>>>>> There are 3 possible choices for that role >>>>>> >>>>>> a) preferred: always be in front of the election queue >>>>>> b) on: not preferred, but can be an overseer if no preferred overseer >>>>>> nodes are available >>>>>> c) off: never become an overseer >>>>>> >>>>>> Today we only have options 'a' and 'b' . In a future ticket, we may >>>>>> implement C >>>>>> >>>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote: >>>>>> >>>>>>> Negative roles add a lot of complexity, I would really want to stay >>>>>>> away from them. That’s why I want strict roles up front. It’s maybe ok >>>>>>> to >>>>>>> push this decision out, but it also seems like the sort of thing we >>>>>>> should >>>>>>> consider at the start. >>>>>>> >>>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes. Negative roles is not a bad idea. If I start a node for >>>>>>>> machine learning purposes, I wouldn't want that node to ever >>>>>>>> participate in >>>>>>>> overseer election >>>>>>>> >>>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> If we have non strict roles (like overseer), then it does make >>>>>>>>> sense >>>>>>>>> to have negative roles. >>>>>>>>> That way I can define which are the two nodes that I'd prefer the >>>>>>>>> overseer to run on, and a few other nodes on which it should >>>>>>>>> definitely never run for various reasons. And in case these >>>>>>>>> "!overseer" are the only nodes left in the cluster, let the cluster >>>>>>>>> fail the same way it would if there were no data nodes available. >>>>>>>>> >>>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>> >>>>>>>>> >>> With the Strict/Loose option and sensible defaults, users >>>>>>>>> cannot trip themselves up by default, but the option is there for >>>>>>>>> people to >>>>>>>>> tinker and have an iron grip over their cluster. >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The >>>>>>>>> option to tinker for tighter grip can be tackled later, either on a >>>>>>>>> per >>>>>>>>> role basis or as a generic concept later. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > +1 - Can definitely be added later if we so desire, not needed >>>>>>>>> for this SIP >>>>>>>>> > >>>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>> >>>>>>>>> >>> I think the key is to let the roles have full control of the >>>>>>>>> implications of having/not having that role. No need for even a >>>>>>>>> strict/loose designation. The question of do you have the role is >>>>>>>>> yes/no >>>>>>>>> with no logic to guess if the role is implied or not, The question of >>>>>>>>> will >>>>>>>>> it come up with the role is "have_explicit ? use_defaults : >>>>>>>>> use_defaults. >>>>>>>>> >>> >>>>>>>>> >>> Once you figure out who has a role (or not) what that means is >>>>>>>>> up to the role code. >>>>>>>>> >>> >>>>>>>>> >>> Corollary: we don't have to change the way overseer works in >>>>>>>>> this SIP. We can rework it or not as we see fit separately. >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> +1 >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> Only thing we need to do is find a wording that makes the >>>>>>>>> above clear on first read through the SIP :) >>>>>>>>> >>> >>>>>>>>> >>> -Gus >>>>>>>>> >>> >>>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> This doesn't really address my concern around what happens >>>>>>>>> if all of our existing OVERSEER candidates are down. When at least >>>>>>>>> one of >>>>>>>>> them is up, the overseer will go there, and that is good and >>>>>>>>> expected. But >>>>>>>>> what happens if all of the overseer eligible nodes are down. Your >>>>>>>>> comment, >>>>>>>>> and the old system, would imply that the overseer election goes to >>>>>>>>> some >>>>>>>>> other unrelated, untagged node. I disagree with this implementation >>>>>>>>> choice. >>>>>>>>> This sounds like something role specific to determine, but I would >>>>>>>>> like to >>>>>>>>> see us be more strict about it. I don't want cores leaking out of my >>>>>>>>> data >>>>>>>>> roles, I don't want query processing to leak out of my "query" nodes >>>>>>>>> or >>>>>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> I'm very strongly in favor of not letting users design a >>>>>>>>> system in which the cluster can be "live" without an overseer. I >>>>>>>>> understand >>>>>>>>> that the overseer can be taxing to the cluster, but honestly what is >>>>>>>>> the >>>>>>>>> point of having an untaxed cluster that doesn't have an overseer? I >>>>>>>>> can see >>>>>>>>> arguments for the other roles to be stricter about this, but there >>>>>>>>> are also >>>>>>>>> a lot of users who wouldn't want those to be strict either (like >>>>>>>>> "query" >>>>>>>>> nodes). >>>>>>>>> >>>> >>>>>>>>> >>>> Maybe we just put in stronger guarantees that if a >>>>>>>>> non-overseer role node HAS to be selected to become overseer, it will >>>>>>>>> try >>>>>>>>> to migrate the overseer job to a node with the overseer role whenever >>>>>>>>> one >>>>>>>>> becomes live. >>>>>>>>> >>>> >>>>>>>>> >>>> So maybe we don't have special rules per role, but instead >>>>>>>>> roles can either be defined as "Strict" or "Loose" (better names >>>>>>>>> likely >>>>>>>>> exist), and the roles come with a default (Overseer -> Loose, Data -> >>>>>>>>> Strict, Query -> Loose, etc.). And it is up to each role to define >>>>>>>>> how to >>>>>>>>> behave when running in LOOSE mode and a non-role node is used then a >>>>>>>>> role >>>>>>>>> node comes online (like the overseer example given above). >>>>>>>>> >>>> >>>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users >>>>>>>>> cannot trip themselves up by default, but the option is there for >>>>>>>>> people to >>>>>>>>> tinker and have an iron grip over their cluster. >>>>>>>>> >>>> >>>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> Noble wrote: >>>>>>>>> >>>>> > We are not modifying the way the "overseer role" works >>>>>>>>> today. We are just changing the definition and standardizing the >>>>>>>>> configuration & discoverability >>>>>>>>> >>>>> Ishan wrote: >>>>>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER >>>>>>>>> role (which currently stands for preferred overseer). We can take a >>>>>>>>> stab at >>>>>>>>> refactoring it later. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Grouping these two comments together, since I think they are >>>>>>>>> saying the same thing. I think this is part of my confusion. We have >>>>>>>>> an old >>>>>>>>> system that doesn't work the way we want the new system to work. >>>>>>>>> There may >>>>>>>>> be people already using the old system. What path do we offer for >>>>>>>>> folks >>>>>>>>> using the old system to migrate to the new system? What happens if >>>>>>>>> somebody >>>>>>>>> accidentally tries to use both systems at the same time? >>>>>>>>> >>>>> >>>>>>>>> >>>>> Ishan wrote: >>>>>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER >>>>>>>>> role] are live, Solr guarantees that one of those nodes becomes the >>>>>>>>> overseer.", I meant to somewhat capture the current behaviour as the >>>>>>>>> OVERSEER role performs today. Do you see any inconsistency with this >>>>>>>>> statement vs. what it does today? >>>>>>>>> >>>>> >>>>>>>>> >>>>> This doesn't really address my concern around what happens >>>>>>>>> if all of our existing OVERSEER candidates are down. When at least >>>>>>>>> one of >>>>>>>>> them is up, the overseer will go there, and that is good and >>>>>>>>> expected. But >>>>>>>>> what happens if all of the overseer eligible nodes are down. Your >>>>>>>>> comment, >>>>>>>>> and the old system, would imply that the overseer election goes to >>>>>>>>> some >>>>>>>>> other unrelated, untagged node. I disagree with this implementation >>>>>>>>> choice. >>>>>>>>> This sounds like something role specific to determine, but I would >>>>>>>>> like to >>>>>>>>> see us be more strict about it. I don't want cores leaking out of my >>>>>>>>> data >>>>>>>>> roles, I don't want query processing to leak out of my "query" nodes >>>>>>>>> or >>>>>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Noble wrote: >>>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in >>>>>>>>> the following request? >>>>>>>>> >>>>> >>>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out. >>>>>>>>> Let's leave it as is. >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Replying to the top post in this thread because there has >>>>>>>>> been a lot of discussion and I don't want to look like I'm continuing >>>>>>>>> any >>>>>>>>> of those particular threads. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> I finally had time to sit down and think about this with >>>>>>>>> the attention it deserves and am generally happy with how the >>>>>>>>> conversation >>>>>>>>> has shaped the current proposal. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> GOOD: I think using system properties to define node roles >>>>>>>>> is fine and I like that data is the default role when not defined. I >>>>>>>>> think >>>>>>>>> it is important to hold on to the guarantee that an active overseer >>>>>>>>> will >>>>>>>>> land on an overseer node role. >>>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for >>>>>>>>> folks using the current OVERSEER role. I am not sure that something >>>>>>>>> can be >>>>>>>>> done automatically since they need to now specify new properties at >>>>>>>>> startup. Maybe we need to include loud warnings or support both >>>>>>>>> approaches >>>>>>>>> for a time? >>>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer >>>>>>>>> nodes fail, then it is implied the overseer will go to one of the data >>>>>>>>> nodes. The specific wording in the SIP - "When one or more such nodes >>>>>>>>> are >>>>>>>>> live, Solr guarantees that one of those nodes become the overseer." >>>>>>>>> implies >>>>>>>>> to me that failover could go from overseer1 to overseer2 to overseerN >>>>>>>>> to >>>>>>>>> random node. I feel like we need to have some recording that there >>>>>>>>> were >>>>>>>>> dedicated overseer nodes and stop the cascading failure instead of >>>>>>>>> churning >>>>>>>>> through our data nodes. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed >>>>>>>>> scope of "coordinator" roles from a split query/indexing standpoint. I >>>>>>>>> understand that these are used as examples, but would like stronger >>>>>>>>> language that new roles should also go through their own SIP >>>>>>>>> discussions. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node >>>>>>>>> liveness in two different places now. We have the live nodes and we >>>>>>>>> have >>>>>>>>> the node roles stored in two different places in zookeeper and it >>>>>>>>> feels >>>>>>>>> like this would lead to race conditions or split brain or other hard >>>>>>>>> to >>>>>>>>> diagnose bugs when those two lists don't agree with each other. This >>>>>>>>> also >>>>>>>>> feels like it contradicts the "single source of truth" idea later >>>>>>>>> stated in >>>>>>>>> the proposal. I see Gus's arguments for decoupling these and am not >>>>>>>>> strongly opposed, I just get a lurking feeling about it. Even if we >>>>>>>>> don't >>>>>>>>> do this, I would like this called out explicitly in the alternative >>>>>>>>> approaches section as something that we considered and rejected, with >>>>>>>>> details why, >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an >>>>>>>>> additional call out here that all operations are GET because nodes >>>>>>>>> cannot >>>>>>>>> be changed at runtime. >>>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous >>>>>>>>> OVERSEER preference role? >>>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of >>>>>>>>> available roles for a cluster. I _think_ this could be based on the >>>>>>>>> version >>>>>>>>> that the cluster is running? Would be useful to be able to >>>>>>>>> interrogate a >>>>>>>>> cluster in the future... we're seeing OOM issues on queries, can we >>>>>>>>> add >>>>>>>>> some query nodes? When were they introduced? I don't know what path >>>>>>>>> this >>>>>>>>> API should exist at. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the >>>>>>>>> SIP document. Not sure if there's a better path that we could go for. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which >>>>>>>>> parts are string literals and which parts are meant to be substituted >>>>>>>>> by >>>>>>>>> the operator? GET /api/cluster/roles/data would become GET >>>>>>>>> /api/cluster/roles/${rolename} in our SIP/documentation. >>>>>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 >>>>>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate >>>>>>>>> "nodes" >>>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>>>>>>> intermediate "nodes" node. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some >>>>>>>>> permissions? Maybe this requirement is too fundamental to the >>>>>>>>> operation of >>>>>>>>> a cluster and everybody would have to be able to do it. >>>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) >>>>>>>>> to treat roles? Implementation detail that the servers will figure >>>>>>>>> out? Or >>>>>>>>> strict guidance where the client needs to check where specific roles >>>>>>>>> are >>>>>>>>> before sending any further communication to the server? >>>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request >>>>>>>>> that it can't fulfil? An overseer node gets a query or an update. A >>>>>>>>> data >>>>>>>>> node gets a collection creation request. Do they forward it on to an >>>>>>>>> appropriate node, or do they reject it? Should this be configurable? >>>>>>>>> If >>>>>>>>> not, then it seems like lazy or poorly configured clients will defeat >>>>>>>>> this >>>>>>>>> isolation system quite easily. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes. >>>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when >>>>>>>>> roles are added mean? I thought we established that they are not >>>>>>>>> dynamic. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Thanks, >>>>>>>>> >>>>>>> Mike >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Hi, >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>>>>> >>>>>>>> >>>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> We also wish to add first class support for Query nodes >>>>>>>>> that are used to process user queries by forwarding to data nodes, >>>>>>>>> merging/aggregating them and presenting to users. This concept exists >>>>>>>>> as >>>>>>>>> first class citizens in most other search engines. This is a chance >>>>>>>>> for >>>>>>>>> Solr to catch up. >>>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Regards, >>>>>>>>> >>>>>>>> Ishan / Noble / Hitesh >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> -- >>>>>>>>> >>> http://www.needhamsoftware.com (work) >>>>>>>>> >>> http://www.the111shift.com (play) >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>> >>>>>>>>> >>>>> >>>>> -- >>>>> http://www.needhamsoftware.com (work) >>>>> http://www.the111shift.com (play) >>>>> >>>> >>> >>> -- >>> ----------------------------------------------------- >>> Noble Paul >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> > > > -- > ----------------------------------------------------- > Noble Paul > -- ----------------------------------------------------- Noble Paul
