Re: First class support for node roles

Mike Drob Mon, 29 Nov 2021 11:24:03 -0800

Replying to the top post in this thread because there has been a lot of
discussion and I don't want to look like I'm continuing any of those
particular threads.

I finally had time to sit down and think about this with the attention it
deserves and am generally happy with how the conversation has shaped the
current proposal.

GOOD: I think using system properties to define node roles is fine and I
like that data is the default role when not defined. I think it is
important to hold on to the guarantee that an active overseer will land on
an overseer node role.
CHANGE REQUEST: I would like to see a migration path for folks using the
current OVERSEER role. I am not sure that something can be done
automatically since they need to now specify new properties at startup.
Maybe we need to include loud warnings or support both approaches for a
time?
CHANGE REQUEST: I do not like that if all of the overseer nodes fail, then
it is implied the overseer will go to one of the data nodes. The specific
wording in the SIP - "When one or more such nodes are live, Solr guarantees
that one of those nodes become the overseer." implies to me that failover
could go from overseer1 to overseer2 to overseerN to random node. I feel
like we need to have some recording that there were dedicated overseer
nodes and stop the cascading failure instead of churning through our data
nodes.

CLARIFICATION: I am slightly confused by the proposed scope of
"coordinator" roles from a split query/indexing standpoint. I understand
that these are used as examples, but would like stronger language that new
roles should also go through their own SIP discussions.

CLARIFICATION: I do not like that we are storing node liveness in two
different places now. We have the live nodes and we have the node roles
stored in two different places in zookeeper and it feels like this would
lead to race conditions or split brain or other hard to diagnose bugs when
those two lists don't agree with each other. This also feels like it
contradicts the "single source of truth" idea later stated in the proposal.
I see Gus's arguments for decoupling these and am not strongly opposed, I
just get a lurking feeling about it. Even if we don't do this, I would like
this called out explicitly in the alternative approaches section as
something that we considered and rejected, with details why,

GOOD: The API looks pretty clear. I would like an additional call out here
that all operations are GET because nodes cannot be changed at runtime.
CLARIFICATION: How does this interact with the previous OVERSEER preference
role?
CHANGE REQUEST: An additional API to get the list of available roles for a
cluster. I _think_ this could be based on the version that the cluster is
running? Would be useful to be able to interrogate a cluster in the
future... we're seeing OOM issues on queries, can we add some query nodes?
When were they introduced? I don't know what path this API should exist at.
CLARIFICATION: Can we list the APIs to clearly show which parts are string
literals and which parts are meant to be substituted by the operator?
*GET **/api/cluster/roles/data
*would become *GET **/api/cluster/roles/${rolename} *in our
SIP/documentation.
CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should be *GET
/api/cluster/roles/${nodename}* dropping the intermediate "nodes"
CHANGE REQUEST: The ZK structure also might not need that intermediate
"nodes" node.

CLARIFICATION: Should listing roles require some permissions? Maybe this
requirement is too fundamental to the operation of a cluster and everybody
would have to be able to do it.
CLARIFICATION: How do we expect SolrJ (and other clients) to treat roles?
Implementation detail that the servers will figure out? Or strict guidance
where the client needs to check where specific roles are before sending any
further communication to the server?
CLARIFICATION: What happens when a node gets a request that it can't
fulfil? An overseer node gets a query or an update. A data node gets a
collection creation request. Do they forward it on to an appropriate node,
or do they reject it? Should this be configurable? If not, then it seems
like lazy or poorly configured clients will defeat this isolation system
quite easily.

GOOD: Testing the API is very important, yes.
CLARIFICATION: What does testing for how nodes behave when roles are added
mean? I thought we established that they are not dynamic.

Thanks,
Mike

On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
[email protected]> wrote:

> Hi,
>
> Here's an SIP for introducing the concept of node roles:
> https://issues.apache.org/jira/browse/SOLR-15694
> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>
> We also wish to add first class support for Query nodes that are used to
> process user queries by forwarding to data nodes, merging/aggregating them
> and presenting to users. This concept exists as first class citizens in
> most other search engines. This is a chance for Solr to catch up.
> https://issues.apache.org/jira/browse/SOLR-15715
>
> Regards,
> Ishan / Noble / Hitesh
>

Re: First class support for node roles

Reply via email to