Re: First class support for node roles

Houston Putman Thu, 02 Dec 2021 08:11:15 -0800

>
> With the Strict/Loose option and sensible defaults, users cannot trip
>> themselves up by default, but the option is there for people to tinker and
>> have an iron grip over their cluster.
>>
>
> +1 to sensible defaults so users don't trip themselves. The option to
> tinker for tighter grip can be tackled later, either on a per role basis or
> as a generic concept later.
>


+1 - Can definitely be added later if we so desire, not needed for this SIP

On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
[email protected]> wrote:

>
>
> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> wrote:
>
>> I think the key  is to let the roles have full control of the
>> implications of having/not having that role. No need for even a
>> strict/loose designation. The question of do you have the role is yes/no
>> with no logic to guess if the role is implied or not, The question of will
>> it come up with the role is "have_explicit ? use_defaults : use_defaults.
>>
>> Once you figure out who has a role (or not) what that means is up to the
>> role code.
>>
>> Corollary: we don't have to change the way overseer works in this SIP. We
>> can rework it or not as we see fit separately.
>>
>
> +1
>
>
>>
>> Only thing we need to do is find a wording that makes the above clear on
>> first read through the SIP :)
>>
>> -Gus
>>
>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <[email protected]>
>> wrote:
>>
>>> This doesn't really address my concern around what happens if all of our
>>>> existing OVERSEER candidates are down. When at least one of them is up, the
>>>> overseer will go there, and that is good and expected. But what happens if
>>>> all of the overseer eligible nodes are down. Your comment, and the old
>>>> system, would imply that the overseer election goes to some other
>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>> sounds like something role specific to determine, but I would like to see
>>>> us be more strict about it. I don't want cores leaking out of my data
>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>> whatever. Overseer shouldn't be special in this regard.
>>>>
>>>
>>> I'm very strongly in favor of not letting users design a system in which
>>> the cluster can be "live" without an overseer. I understand that the
>>> overseer can be taxing to the cluster, but honestly what is the point of
>>> having an untaxed cluster that doesn't have an overseer? I can see
>>> arguments for the other roles to be stricter about this, but there are also
>>> a lot of users who wouldn't want those to be strict either (like "query"
>>> nodes).
>>>
>>> Maybe we just put in stronger guarantees that if a non-overseer role
>>> node HAS to be selected to become overseer, it will try to migrate the
>>> overseer job to a node with the overseer role whenever one becomes live.
>>>
>>> So maybe we don't have special rules per role, but instead roles can
>>> either be defined as "Strict" or "Loose" (better names likely exist), and
>>> the roles come with a default (Overseer -> Loose, Data -> Strict, Query ->
>>> Loose, etc.). And it is up to each role to define how to behave when
>>> running in LOOSE mode and a non-role node is used then a role node comes
>>> online (like the overseer example given above).
>>>
>>> With the Strict/Loose option and sensible defaults, users cannot trip
>>> themselves up by default, but the option is there for people to tinker and
>>> have an iron grip over their cluster.
>>>
>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> wrote:
>>>
>>>> Noble wrote:
>>>> > We are not modifying the way the "overseer role" works today. We are
>>>> just changing the definition and standardizing the configuration &
>>>> discoverability
>>>> Ishan wrote:
>>>> > As of this SIP, we're not planning to modify the OVERSEER role (which
>>>> currently stands for preferred overseer). We can take a stab at refactoring
>>>> it later.
>>>>
>>>> Grouping these two comments together, since I think they are saying
>>>> the same thing. I think this is part of my confusion. We have an old system
>>>> that doesn't work the way we want the new system to work. There may be
>>>> people already using the old system. What path do we offer for folks using
>>>> the old system to migrate to the new system? What happens if somebody
>>>> accidentally tries to use both systems at the same time?
>>>>
>>>> Ishan wrote:
>>>> > When I wrote "When one or more such nodes [with OVERSEER role] are
>>>> live, Solr guarantees that one of those nodes becomes the overseer.",
>>>> I meant to somewhat capture the current behaviour as the OVERSEER role 
>>>> performs
>>>> today. Do you see any inconsistency with this statement vs. what it does
>>>> today?
>>>>
>>>> This doesn't really address my concern around what happens if all of
>>>> our existing OVERSEER candidates are down. When at least one of them is up,
>>>> the overseer will go there, and that is good and expected. But what happens
>>>> if all of the overseer eligible nodes are down. Your comment, and the old
>>>> system, would imply that the overseer election goes to some other
>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>> sounds like something role specific to determine, but I would like to see
>>>> us be more strict about it. I don't want cores leaking out of my data
>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>> whatever. Overseer shouldn't be special in this regard.
>>>>
>>>> Noble wrote:
>>>> > If we do that how do we know if xyz is a role or a node in the
>>>> following request?
>>>>
>>>> You're absolutely correct, thanks for pointing this out. Let's leave it
>>>> as is.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote:
>>>>>
>>>>>> Replying to the top post in this thread because there has been a lot
>>>>>> of discussion and I don't want to look like I'm continuing any of those
>>>>>> particular threads.
>>>>>>
>>>>>> I finally had time to sit down and think about this with the
>>>>>> attention it deserves and am generally happy with how the conversation 
>>>>>> has
>>>>>> shaped the current proposal.
>>>>>>
>>>>>> GOOD: I think using system properties to define node roles is fine
>>>>>> and I like that data is the default role when not defined. I think it is
>>>>>> important to hold on to the guarantee that an active overseer will land 
>>>>>> on
>>>>>> an overseer node role.
>>>>>> CHANGE REQUEST: I would like to see a migration path for folks using
>>>>>> the current OVERSEER role. I am not sure that something can be done
>>>>>> automatically since they need to now specify new properties at startup.
>>>>>> Maybe we need to include loud warnings or support both approaches for a
>>>>>> time?
>>>>>> CHANGE REQUEST: I do not like that if all of the overseer nodes fail,
>>>>>> then it is implied the overseer will go to one of the data nodes. The
>>>>>> specific wording in the SIP - "When one or more such nodes are live, Solr
>>>>>> guarantees that one of those nodes become the overseer." implies to me 
>>>>>> that
>>>>>> failover could go from overseer1 to overseer2 to overseerN to random 
>>>>>> node.
>>>>>> I feel like we need to have some recording that there were dedicated
>>>>>> overseer nodes and stop the cascading failure instead of churning through
>>>>>> our data nodes.
>>>>>>
>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of
>>>>>> "coordinator" roles from a split query/indexing standpoint. I understand
>>>>>> that these are used as examples, but would like stronger language that 
>>>>>> new
>>>>>> roles should also go through their own SIP discussions.
>>>>>>
>>>>>> CLARIFICATION: I do not like that we are storing node liveness in two
>>>>>> different places now. We have the live nodes and we have the node roles
>>>>>> stored in two different places in zookeeper and it feels like this would
>>>>>> lead to race conditions or split brain or other hard to diagnose bugs 
>>>>>> when
>>>>>> those two lists don't agree with each other. This also feels like it
>>>>>> contradicts the "single source of truth" idea later stated in the 
>>>>>> proposal.
>>>>>> I see Gus's arguments for decoupling these and am not strongly opposed, I
>>>>>> just get a lurking feeling about it. Even if we don't do this, I would 
>>>>>> like
>>>>>> this called out explicitly in the alternative approaches section as
>>>>>> something that we considered and rejected, with details why,
>>>>>>
>>>>>> GOOD: The API looks pretty clear. I would like an additional call out
>>>>>> here that all operations are GET because nodes cannot be changed at 
>>>>>> runtime.
>>>>>> CLARIFICATION: How does this interact with the previous OVERSEER
>>>>>> preference role?
>>>>>> CHANGE REQUEST: An additional API to get the list of available roles
>>>>>> for a cluster. I _think_ this could be based on the version that the
>>>>>> cluster is running? Would be useful to be able to interrogate a cluster 
>>>>>> in
>>>>>> the future... we're seeing OOM issues on queries, can we add some query
>>>>>> nodes? When were they introduced? I don't know what path this API should
>>>>>> exist at.
>>>>>>
>>>>>
>>>>> Added a *GET /api/cluster/roles/supported* API, updated the SIP
>>>>> document. Not sure if there's a better path that we could go for.
>>>>>
>>>>>
>>>>>> CLARIFICATION: Can we list the APIs to clearly show which parts are
>>>>>> string literals and which parts are meant to be substituted by the
>>>>>> operator? *GET **/api/cluster/roles/data *would become *GET 
>>>>>> **/api/cluster/roles/${rolename}
>>>>>> *in our SIP/documentation.
>>>>>> CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should
>>>>>> be *GET /api/cluster/roles/${nodename}* dropping the intermediate
>>>>>> "nodes"
>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>>>> intermediate "nodes" node.
>>>>>>
>>>>>> CLARIFICATION: Should listing roles require some permissions? Maybe
>>>>>> this requirement is too fundamental to the operation of a cluster and
>>>>>> everybody would have to be able to do it.
>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to treat
>>>>>> roles? Implementation detail that the servers will figure out? Or strict
>>>>>> guidance where the client needs to check where specific roles are before
>>>>>> sending any further communication to the server?
>>>>>> CLARIFICATION: What happens when a node gets a request that it can't
>>>>>> fulfil? An overseer node gets a query or an update. A data node gets a
>>>>>> collection creation request. Do they forward it on to an appropriate 
>>>>>> node,
>>>>>> or do they reject it? Should this be configurable? If not, then it seems
>>>>>> like lazy or poorly configured clients will defeat this isolation system
>>>>>> quite easily.
>>>>>>
>>>>>> GOOD: Testing the API is very important, yes.
>>>>>> CLARIFICATION: What does testing for how nodes behave when roles are
>>>>>> added mean? I thought we established that they are not dynamic.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Mike
>>>>>>
>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>>
>>>>>>> We also wish to add first class support for Query nodes that are
>>>>>>> used to process user queries by forwarding to data nodes,
>>>>>>> merging/aggregating them and presenting to users. This concept exists as
>>>>>>> first class citizens in most other search engines. This is a chance for
>>>>>>> Solr to catch up.
>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>>
>>>>>>> Regards,
>>>>>>> Ishan / Noble / Hitesh
>>>>>>>
>>>>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: First class support for node roles

Reply via email to