Re: First class support for node roles

Ishan Chattopadhyaya Wed, 01 Dec 2021 18:06:55 -0800

On Thu, Dec 2, 2021 at 12:54 AM Mike Drob <[email protected]> wrote:

> Noble wrote:
> > We are not modifying the way the "overseer role" works today. We are
> just changing the definition and standardizing the configuration &
> discoverability
> Ishan wrote:
> > As of this SIP, we're not planning to modify the OVERSEER role (which
> currently stands for preferred overseer). We can take a stab at refactoring
> it later.
>
> Grouping these two comments together, since I think they are saying
> the same thing. I think this is part of my confusion. We have an old system
> that doesn't work the way we want the new system to work. There may be
> people already using the old system. What path do we offer for folks using
> the old system to migrate to the new system?
>


The old system only supported the OVERSEER role. The users can continue
using the old system (ADDROLE/REMOVEROLE commands), but they are
deprecated. Or, they can start their nodes with a sysprop with this new
roles implementation.


> What happens if somebody accidentally tries to use both systems at the
> same time?
>

Upon a node starting up, if a node has -Dsolr.node.roles=overseer,<..>
defined, it is registered as a preferred overseer exactly as per how the
ADDROLE behaves today. Someone can use REMOVEROLE api to remove the
overseer role at runtime (disrecommended), but when this node restarts
again, the sysprop will make it a preferred overseer again.


>
> Ishan wrote:
> > When I wrote "When one or more such nodes [with OVERSEER role] are
> live, Solr guarantees that one of those nodes becomes the overseer.", I
> meant to somewhat capture the current behaviour as the OVERSEER role performs
> today. Do you see any inconsistency with this statement vs. what it does
> today?
>
> This doesn't really address my concern around what happens if all of our
> existing OVERSEER candidates are down.
>

If all preferred overseer nodes are down, some other node becomes the
overseer. This is exactly as the OVERSEER role works today; we aren't
changing that behaviour at all.


> When at least one of them is up, the overseer will go there, and that is
> good and expected. But what happens if all of the overseer eligible nodes
> are down.
>

One of the other nodes will become the overseer, exactly as one would
expect the system to work today.


> Your comment, and the old system, would imply that the overseer election
> goes to some other unrelated, untagged node. I disagree with this
> implementation choice.
>

This choice has already been made, and we're not attempting to change that
behaviour in this SIP. We can discuss an overhaul of the OVERSEER role in a
separate SIP/JIRA/thread.


> This sounds like something role specific to determine, but I would like to
> see us be more strict about it. I don't want cores leaking out of my data
> roles, I don't want query processing to leak out of my "query" nodes or
> whatever. Overseer shouldn't be special in this regard.
>

I think it is very difficult to define such a concept upfront. Different
roles will have different ways of interpreting these aspects. For OVERSEER
role, one might want the functionality to be performed by non-OVERSEER role
nodes too. For a future QUERY role, one might want data nodes to serve that
role as well. For DATA role, one might not want any other node to host the
data. I think it would be better to use the ref guide documentation for new
or existing roles to clearly specify how the system will behave in such
circumstances.


>
> Noble wrote:
> > If we do that how do we know if xyz is a role or a node in the
> following request?
>
> You're absolutely correct, thanks for pointing this out. Let's leave it as
> is.
>
>
>
> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
>
>>
>>
>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote:
>>
>>> Replying to the top post in this thread because there has been a lot of
>>> discussion and I don't want to look like I'm continuing any of those
>>> particular threads.
>>>
>>> I finally had time to sit down and think about this with the attention
>>> it deserves and am generally happy with how the conversation has shaped the
>>> current proposal.
>>>
>>> GOOD: I think using system properties to define node roles is fine and I
>>> like that data is the default role when not defined. I think it is
>>> important to hold on to the guarantee that an active overseer will land on
>>> an overseer node role.
>>> CHANGE REQUEST: I would like to see a migration path for folks using the
>>> current OVERSEER role. I am not sure that something can be done
>>> automatically since they need to now specify new properties at startup.
>>> Maybe we need to include loud warnings or support both approaches for a
>>> time?
>>> CHANGE REQUEST: I do not like that if all of the overseer nodes fail,
>>> then it is implied the overseer will go to one of the data nodes. The
>>> specific wording in the SIP - "When one or more such nodes are live, Solr
>>> guarantees that one of those nodes become the overseer." implies to me that
>>> failover could go from overseer1 to overseer2 to overseerN to random node.
>>> I feel like we need to have some recording that there were dedicated
>>> overseer nodes and stop the cascading failure instead of churning through
>>> our data nodes.
>>>
>>> CLARIFICATION: I am slightly confused by the proposed scope of
>>> "coordinator" roles from a split query/indexing standpoint. I understand
>>> that these are used as examples, but would like stronger language that new
>>> roles should also go through their own SIP discussions.
>>>
>>> CLARIFICATION: I do not like that we are storing node liveness in two
>>> different places now. We have the live nodes and we have the node roles
>>> stored in two different places in zookeeper and it feels like this would
>>> lead to race conditions or split brain or other hard to diagnose bugs when
>>> those two lists don't agree with each other. This also feels like it
>>> contradicts the "single source of truth" idea later stated in the proposal.
>>> I see Gus's arguments for decoupling these and am not strongly opposed, I
>>> just get a lurking feeling about it. Even if we don't do this, I would like
>>> this called out explicitly in the alternative approaches section as
>>> something that we considered and rejected, with details why,
>>>
>>> GOOD: The API looks pretty clear. I would like an additional call out
>>> here that all operations are GET because nodes cannot be changed at runtime.
>>> CLARIFICATION: How does this interact with the previous OVERSEER
>>> preference role?
>>> CHANGE REQUEST: An additional API to get the list of available roles for
>>> a cluster. I _think_ this could be based on the version that the cluster is
>>> running? Would be useful to be able to interrogate a cluster in the
>>> future... we're seeing OOM issues on queries, can we add some query nodes?
>>> When were they introduced? I don't know what path this API should exist at.
>>>
>>
>> Added a *GET /api/cluster/roles/supported* API, updated the SIP
>> document. Not sure if there's a better path that we could go for.
>>
>>
>>> CLARIFICATION: Can we list the APIs to clearly show which parts are
>>> string literals and which parts are meant to be substituted by the
>>> operator? *GET **/api/cluster/roles/data *would become *GET 
>>> **/api/cluster/roles/${rolename}
>>> *in our SIP/documentation.
>>> CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should be *GET
>>> /api/cluster/roles/${nodename}* dropping the intermediate "nodes"
>>> CHANGE REQUEST: The ZK structure also might not need that intermediate
>>> "nodes" node.
>>>
>>> CLARIFICATION: Should listing roles require some permissions? Maybe this
>>> requirement is too fundamental to the operation of a cluster and everybody
>>> would have to be able to do it.
>>> CLARIFICATION: How do we expect SolrJ (and other clients) to treat
>>> roles? Implementation detail that the servers will figure out? Or strict
>>> guidance where the client needs to check where specific roles are before
>>> sending any further communication to the server?
>>> CLARIFICATION: What happens when a node gets a request that it can't
>>> fulfil? An overseer node gets a query or an update. A data node gets a
>>> collection creation request. Do they forward it on to an appropriate node,
>>> or do they reject it? Should this be configurable? If not, then it seems
>>> like lazy or poorly configured clients will defeat this isolation system
>>> quite easily.
>>>
>>> GOOD: Testing the API is very important, yes.
>>> CLARIFICATION: What does testing for how nodes behave when roles are
>>> added mean? I thought we established that they are not dynamic.
>>>
>>>
>>> Thanks,
>>> Mike
>>>
>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Here's an SIP for introducing the concept of node roles:
>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>
>>>> We also wish to add first class support for Query nodes that are used
>>>> to process user queries by forwarding to data nodes, merging/aggregating
>>>> them and presenting to users. This concept exists as first class citizens
>>>> in most other search engines. This is a chance for Solr to catch up.
>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>
>>>> Regards,
>>>> Ishan / Noble / Hitesh
>>>>
>>>

Re: First class support for node roles

Reply via email to