> > With the Strict/Loose option and sensible defaults, users cannot trip >> themselves up by default, but the option is there for people to tinker and >> have an iron grip over their cluster. >> > > +1 to sensible defaults so users don't trip themselves. The option to > tinker for tighter grip can be tackled later, either on a per role basis or > as a generic concept later. >
+1 - Can definitely be added later if we so desire, not needed for this SIP On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < [email protected]> wrote: > > > On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> wrote: > >> I think the key is to let the roles have full control of the >> implications of having/not having that role. No need for even a >> strict/loose designation. The question of do you have the role is yes/no >> with no logic to guess if the role is implied or not, The question of will >> it come up with the role is "have_explicit ? use_defaults : use_defaults. >> >> Once you figure out who has a role (or not) what that means is up to the >> role code. >> >> Corollary: we don't have to change the way overseer works in this SIP. We >> can rework it or not as we see fit separately. >> > > +1 > > >> >> Only thing we need to do is find a wording that makes the above clear on >> first read through the SIP :) >> >> -Gus >> >> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <[email protected]> >> wrote: >> >>> This doesn't really address my concern around what happens if all of our >>>> existing OVERSEER candidates are down. When at least one of them is up, the >>>> overseer will go there, and that is good and expected. But what happens if >>>> all of the overseer eligible nodes are down. Your comment, and the old >>>> system, would imply that the overseer election goes to some other >>>> unrelated, untagged node. I disagree with this implementation choice. This >>>> sounds like something role specific to determine, but I would like to see >>>> us be more strict about it. I don't want cores leaking out of my data >>>> roles, I don't want query processing to leak out of my "query" nodes or >>>> whatever. Overseer shouldn't be special in this regard. >>>> >>> >>> I'm very strongly in favor of not letting users design a system in which >>> the cluster can be "live" without an overseer. I understand that the >>> overseer can be taxing to the cluster, but honestly what is the point of >>> having an untaxed cluster that doesn't have an overseer? I can see >>> arguments for the other roles to be stricter about this, but there are also >>> a lot of users who wouldn't want those to be strict either (like "query" >>> nodes). >>> >>> Maybe we just put in stronger guarantees that if a non-overseer role >>> node HAS to be selected to become overseer, it will try to migrate the >>> overseer job to a node with the overseer role whenever one becomes live. >>> >>> So maybe we don't have special rules per role, but instead roles can >>> either be defined as "Strict" or "Loose" (better names likely exist), and >>> the roles come with a default (Overseer -> Loose, Data -> Strict, Query -> >>> Loose, etc.). And it is up to each role to define how to behave when >>> running in LOOSE mode and a non-role node is used then a role node comes >>> online (like the overseer example given above). >>> >>> With the Strict/Loose option and sensible defaults, users cannot trip >>> themselves up by default, but the option is there for people to tinker and >>> have an iron grip over their cluster. >>> >>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> wrote: >>> >>>> Noble wrote: >>>> > We are not modifying the way the "overseer role" works today. We are >>>> just changing the definition and standardizing the configuration & >>>> discoverability >>>> Ishan wrote: >>>> > As of this SIP, we're not planning to modify the OVERSEER role (which >>>> currently stands for preferred overseer). We can take a stab at refactoring >>>> it later. >>>> >>>> Grouping these two comments together, since I think they are saying >>>> the same thing. I think this is part of my confusion. We have an old system >>>> that doesn't work the way we want the new system to work. There may be >>>> people already using the old system. What path do we offer for folks using >>>> the old system to migrate to the new system? What happens if somebody >>>> accidentally tries to use both systems at the same time? >>>> >>>> Ishan wrote: >>>> > When I wrote "When one or more such nodes [with OVERSEER role] are >>>> live, Solr guarantees that one of those nodes becomes the overseer.", >>>> I meant to somewhat capture the current behaviour as the OVERSEER role >>>> performs >>>> today. Do you see any inconsistency with this statement vs. what it does >>>> today? >>>> >>>> This doesn't really address my concern around what happens if all of >>>> our existing OVERSEER candidates are down. When at least one of them is up, >>>> the overseer will go there, and that is good and expected. But what happens >>>> if all of the overseer eligible nodes are down. Your comment, and the old >>>> system, would imply that the overseer election goes to some other >>>> unrelated, untagged node. I disagree with this implementation choice. This >>>> sounds like something role specific to determine, but I would like to see >>>> us be more strict about it. I don't want cores leaking out of my data >>>> roles, I don't want query processing to leak out of my "query" nodes or >>>> whatever. Overseer shouldn't be special in this regard. >>>> >>>> Noble wrote: >>>> > If we do that how do we know if xyz is a role or a node in the >>>> following request? >>>> >>>> You're absolutely correct, thanks for pointing this out. Let's leave it >>>> as is. >>>> >>>> >>>> >>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote: >>>>> >>>>>> Replying to the top post in this thread because there has been a lot >>>>>> of discussion and I don't want to look like I'm continuing any of those >>>>>> particular threads. >>>>>> >>>>>> I finally had time to sit down and think about this with the >>>>>> attention it deserves and am generally happy with how the conversation >>>>>> has >>>>>> shaped the current proposal. >>>>>> >>>>>> GOOD: I think using system properties to define node roles is fine >>>>>> and I like that data is the default role when not defined. I think it is >>>>>> important to hold on to the guarantee that an active overseer will land >>>>>> on >>>>>> an overseer node role. >>>>>> CHANGE REQUEST: I would like to see a migration path for folks using >>>>>> the current OVERSEER role. I am not sure that something can be done >>>>>> automatically since they need to now specify new properties at startup. >>>>>> Maybe we need to include loud warnings or support both approaches for a >>>>>> time? >>>>>> CHANGE REQUEST: I do not like that if all of the overseer nodes fail, >>>>>> then it is implied the overseer will go to one of the data nodes. The >>>>>> specific wording in the SIP - "When one or more such nodes are live, Solr >>>>>> guarantees that one of those nodes become the overseer." implies to me >>>>>> that >>>>>> failover could go from overseer1 to overseer2 to overseerN to random >>>>>> node. >>>>>> I feel like we need to have some recording that there were dedicated >>>>>> overseer nodes and stop the cascading failure instead of churning through >>>>>> our data nodes. >>>>>> >>>>>> CLARIFICATION: I am slightly confused by the proposed scope of >>>>>> "coordinator" roles from a split query/indexing standpoint. I understand >>>>>> that these are used as examples, but would like stronger language that >>>>>> new >>>>>> roles should also go through their own SIP discussions. >>>>>> >>>>>> CLARIFICATION: I do not like that we are storing node liveness in two >>>>>> different places now. We have the live nodes and we have the node roles >>>>>> stored in two different places in zookeeper and it feels like this would >>>>>> lead to race conditions or split brain or other hard to diagnose bugs >>>>>> when >>>>>> those two lists don't agree with each other. This also feels like it >>>>>> contradicts the "single source of truth" idea later stated in the >>>>>> proposal. >>>>>> I see Gus's arguments for decoupling these and am not strongly opposed, I >>>>>> just get a lurking feeling about it. Even if we don't do this, I would >>>>>> like >>>>>> this called out explicitly in the alternative approaches section as >>>>>> something that we considered and rejected, with details why, >>>>>> >>>>>> GOOD: The API looks pretty clear. I would like an additional call out >>>>>> here that all operations are GET because nodes cannot be changed at >>>>>> runtime. >>>>>> CLARIFICATION: How does this interact with the previous OVERSEER >>>>>> preference role? >>>>>> CHANGE REQUEST: An additional API to get the list of available roles >>>>>> for a cluster. I _think_ this could be based on the version that the >>>>>> cluster is running? Would be useful to be able to interrogate a cluster >>>>>> in >>>>>> the future... we're seeing OOM issues on queries, can we add some query >>>>>> nodes? When were they introduced? I don't know what path this API should >>>>>> exist at. >>>>>> >>>>> >>>>> Added a *GET /api/cluster/roles/supported* API, updated the SIP >>>>> document. Not sure if there's a better path that we could go for. >>>>> >>>>> >>>>>> CLARIFICATION: Can we list the APIs to clearly show which parts are >>>>>> string literals and which parts are meant to be substituted by the >>>>>> operator? *GET **/api/cluster/roles/data *would become *GET >>>>>> **/api/cluster/roles/${rolename} >>>>>> *in our SIP/documentation. >>>>>> CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should >>>>>> be *GET /api/cluster/roles/${nodename}* dropping the intermediate >>>>>> "nodes" >>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>>>> intermediate "nodes" node. >>>>>> >>>>>> CLARIFICATION: Should listing roles require some permissions? Maybe >>>>>> this requirement is too fundamental to the operation of a cluster and >>>>>> everybody would have to be able to do it. >>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to treat >>>>>> roles? Implementation detail that the servers will figure out? Or strict >>>>>> guidance where the client needs to check where specific roles are before >>>>>> sending any further communication to the server? >>>>>> CLARIFICATION: What happens when a node gets a request that it can't >>>>>> fulfil? An overseer node gets a query or an update. A data node gets a >>>>>> collection creation request. Do they forward it on to an appropriate >>>>>> node, >>>>>> or do they reject it? Should this be configurable? If not, then it seems >>>>>> like lazy or poorly configured clients will defeat this isolation system >>>>>> quite easily. >>>>>> >>>>>> GOOD: Testing the API is very important, yes. >>>>>> CLARIFICATION: What does testing for how nodes behave when roles are >>>>>> added mean? I thought we established that they are not dynamic. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Mike >>>>>> >>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>>> >>>>>>> We also wish to add first class support for Query nodes that are >>>>>>> used to process user queries by forwarding to data nodes, >>>>>>> merging/aggregating them and presenting to users. This concept exists as >>>>>>> first class citizens in most other search engines. This is a chance for >>>>>>> Solr to catch up. >>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>>> >>>>>>> Regards, >>>>>>> Ishan / Noble / Hitesh >>>>>>> >>>>>> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> >
