I think the key is to let the roles have full control of the implications of having/not having that role. No need for even a strict/loose designation. The question of do you have the role is yes/no with no logic to guess if the role is implied or not, The question of will it come up with the role is "have_explicit ? use_defaults : use_defaults.
Once you figure out who has a role (or not) what that means is up to the role code. Corollary: we don't have to change the way overseer works in this SIP. We can rework it or not as we see fit separately. Only thing we need to do is find a wording that makes the above clear on first read through the SIP :) -Gus On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <[email protected]> wrote: > This doesn't really address my concern around what happens if all of our >> existing OVERSEER candidates are down. When at least one of them is up, the >> overseer will go there, and that is good and expected. But what happens if >> all of the overseer eligible nodes are down. Your comment, and the old >> system, would imply that the overseer election goes to some other >> unrelated, untagged node. I disagree with this implementation choice. This >> sounds like something role specific to determine, but I would like to see >> us be more strict about it. I don't want cores leaking out of my data >> roles, I don't want query processing to leak out of my "query" nodes or >> whatever. Overseer shouldn't be special in this regard. >> > > I'm very strongly in favor of not letting users design a system in which > the cluster can be "live" without an overseer. I understand that the > overseer can be taxing to the cluster, but honestly what is the point of > having an untaxed cluster that doesn't have an overseer? I can see > arguments for the other roles to be stricter about this, but there are also > a lot of users who wouldn't want those to be strict either (like "query" > nodes). > > Maybe we just put in stronger guarantees that if a non-overseer role node > HAS to be selected to become overseer, it will try to migrate the overseer > job to a node with the overseer role whenever one becomes live. > > So maybe we don't have special rules per role, but instead roles can > either be defined as "Strict" or "Loose" (better names likely exist), and > the roles come with a default (Overseer -> Loose, Data -> Strict, Query -> > Loose, etc.). And it is up to each role to define how to behave when > running in LOOSE mode and a non-role node is used then a role node comes > online (like the overseer example given above). > > With the Strict/Loose option and sensible defaults, users cannot trip > themselves up by default, but the option is there for people to tinker and > have an iron grip over their cluster. > > On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> wrote: > >> Noble wrote: >> > We are not modifying the way the "overseer role" works today. We are >> just changing the definition and standardizing the configuration & >> discoverability >> Ishan wrote: >> > As of this SIP, we're not planning to modify the OVERSEER role (which >> currently stands for preferred overseer). We can take a stab at refactoring >> it later. >> >> Grouping these two comments together, since I think they are saying >> the same thing. I think this is part of my confusion. We have an old system >> that doesn't work the way we want the new system to work. There may be >> people already using the old system. What path do we offer for folks using >> the old system to migrate to the new system? What happens if somebody >> accidentally tries to use both systems at the same time? >> >> Ishan wrote: >> > When I wrote "When one or more such nodes [with OVERSEER role] are >> live, Solr guarantees that one of those nodes becomes the overseer.", I >> meant to somewhat capture the current behaviour as the OVERSEER role performs >> today. Do you see any inconsistency with this statement vs. what it does >> today? >> >> This doesn't really address my concern around what happens if all of our >> existing OVERSEER candidates are down. When at least one of them is up, the >> overseer will go there, and that is good and expected. But what happens if >> all of the overseer eligible nodes are down. Your comment, and the old >> system, would imply that the overseer election goes to some other >> unrelated, untagged node. I disagree with this implementation choice. This >> sounds like something role specific to determine, but I would like to see >> us be more strict about it. I don't want cores leaking out of my data >> roles, I don't want query processing to leak out of my "query" nodes or >> whatever. Overseer shouldn't be special in this regard. >> >> Noble wrote: >> > If we do that how do we know if xyz is a role or a node in the >> following request? >> >> You're absolutely correct, thanks for pointing this out. Let's leave it >> as is. >> >> >> >> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >> [email protected]> wrote: >> >>> >>> >>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote: >>> >>>> Replying to the top post in this thread because there has been a lot of >>>> discussion and I don't want to look like I'm continuing any of those >>>> particular threads. >>>> >>>> I finally had time to sit down and think about this with the attention >>>> it deserves and am generally happy with how the conversation has shaped the >>>> current proposal. >>>> >>>> GOOD: I think using system properties to define node roles is fine and >>>> I like that data is the default role when not defined. I think it is >>>> important to hold on to the guarantee that an active overseer will land on >>>> an overseer node role. >>>> CHANGE REQUEST: I would like to see a migration path for folks using >>>> the current OVERSEER role. I am not sure that something can be done >>>> automatically since they need to now specify new properties at startup. >>>> Maybe we need to include loud warnings or support both approaches for a >>>> time? >>>> CHANGE REQUEST: I do not like that if all of the overseer nodes fail, >>>> then it is implied the overseer will go to one of the data nodes. The >>>> specific wording in the SIP - "When one or more such nodes are live, Solr >>>> guarantees that one of those nodes become the overseer." implies to me that >>>> failover could go from overseer1 to overseer2 to overseerN to random node. >>>> I feel like we need to have some recording that there were dedicated >>>> overseer nodes and stop the cascading failure instead of churning through >>>> our data nodes. >>>> >>>> CLARIFICATION: I am slightly confused by the proposed scope of >>>> "coordinator" roles from a split query/indexing standpoint. I understand >>>> that these are used as examples, but would like stronger language that new >>>> roles should also go through their own SIP discussions. >>>> >>>> CLARIFICATION: I do not like that we are storing node liveness in two >>>> different places now. We have the live nodes and we have the node roles >>>> stored in two different places in zookeeper and it feels like this would >>>> lead to race conditions or split brain or other hard to diagnose bugs when >>>> those two lists don't agree with each other. This also feels like it >>>> contradicts the "single source of truth" idea later stated in the proposal. >>>> I see Gus's arguments for decoupling these and am not strongly opposed, I >>>> just get a lurking feeling about it. Even if we don't do this, I would like >>>> this called out explicitly in the alternative approaches section as >>>> something that we considered and rejected, with details why, >>>> >>>> GOOD: The API looks pretty clear. I would like an additional call out >>>> here that all operations are GET because nodes cannot be changed at >>>> runtime. >>>> CLARIFICATION: How does this interact with the previous OVERSEER >>>> preference role? >>>> CHANGE REQUEST: An additional API to get the list of available roles >>>> for a cluster. I _think_ this could be based on the version that the >>>> cluster is running? Would be useful to be able to interrogate a cluster in >>>> the future... we're seeing OOM issues on queries, can we add some query >>>> nodes? When were they introduced? I don't know what path this API should >>>> exist at. >>>> >>> >>> Added a *GET /api/cluster/roles/supported* API, updated the SIP >>> document. Not sure if there's a better path that we could go for. >>> >>> >>>> CLARIFICATION: Can we list the APIs to clearly show which parts are >>>> string literals and which parts are meant to be substituted by the >>>> operator? *GET **/api/cluster/roles/data *would become *GET >>>> **/api/cluster/roles/${rolename} >>>> *in our SIP/documentation. >>>> CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should be *GET >>>> /api/cluster/roles/${nodename}* dropping the intermediate "nodes" >>>> CHANGE REQUEST: The ZK structure also might not need that intermediate >>>> "nodes" node. >>>> >>>> CLARIFICATION: Should listing roles require some permissions? Maybe >>>> this requirement is too fundamental to the operation of a cluster and >>>> everybody would have to be able to do it. >>>> CLARIFICATION: How do we expect SolrJ (and other clients) to treat >>>> roles? Implementation detail that the servers will figure out? Or strict >>>> guidance where the client needs to check where specific roles are before >>>> sending any further communication to the server? >>>> CLARIFICATION: What happens when a node gets a request that it can't >>>> fulfil? An overseer node gets a query or an update. A data node gets a >>>> collection creation request. Do they forward it on to an appropriate node, >>>> or do they reject it? Should this be configurable? If not, then it seems >>>> like lazy or poorly configured clients will defeat this isolation system >>>> quite easily. >>>> >>>> GOOD: Testing the API is very important, yes. >>>> CLARIFICATION: What does testing for how nodes behave when roles are >>>> added mean? I thought we established that they are not dynamic. >>>> >>>> >>>> Thanks, >>>> Mike >>>> >>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Here's an SIP for introducing the concept of node roles: >>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>> >>>>> We also wish to add first class support for Query nodes that are used >>>>> to process user queries by forwarding to data nodes, merging/aggregating >>>>> them and presenting to users. This concept exists as first class citizens >>>>> in most other search engines. This is a chance for Solr to catch up. >>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>> >>>>> Regards, >>>>> Ishan / Noble / Hitesh >>>>> >>>> -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)
