Re: First class support for node roles

Ilan Ginzburg Wed, 27 Oct 2021 09:17:55 -0700

In other words, roles are all "positive", but their consequences are only
negative (rejecting when the matching positive role is not present).


We can also consider no role defined = all roles allowed. Will make things
simpler.

On Wed, Oct 27, 2021 at 6:14 PM Ilan Ginzburg <[email protected]> wrote:

> How do we expect the roles to be used?
> One way I see is a node refusing to do anything related to a role it
> doesn't have.
> For example if a node does not have role "data", any attempt to create a
> core on it would fail.
> A node not having the role "query", will refuse to have anything to do
> with handling a query etc.
> Then it would be up to other code to make sure only the appropriate nodes
> are requested to do any type of action.
> So for example any replica placement code plugin would have to restrict
> the set of candidate nodes for a new replica placement to those having
> "data". Otherwise the call would fail, and there should be nothing the
> replica placement code can do about it.
>
> Similarly, the "overseer" role would limit the nodes that participate in
> the Overseer election. The Overseer election code would have to remove (or
> not add) all non qualifying nodes from the election, and we should expect a
> node without role "overseer" to refuse to start the Overseer machinery if
> asked to...
>
> Trying to make the use case clear regarding how roles are used.
> Ilan
>
> On Wed, Oct 27, 2021 at 5:47 PM Gus Heck <[email protected]> wrote:
>
>>
>>
>> On Wed, Oct 27, 2021 at 9:55 AM Ishan Chattopadhyaya <
>> [email protected]> wrote:
>>
>>> Hi Gus,
>>>
>>> > I think that we should expand/edit your list of roles to be
>>>
>>> The list can be expanded as and when more isolation and features are
>>> needed. I only listed those roles that we already have a functionality for
>>> or is under development.
>>>
>>
>> Well all of those roles (except zookeeper) are things nodes do today. As
>> it stands they are all doing all of them. What we add support for as we
>> move forward is starting without a role, and add the zookeeper role when
>> that feature is ready.
>>
>>
>>> > I would like to recommend that the roles be all positive ("Can do
>>> this") and nodes with no role at all are ineligible for all activities.
>>>
>>> It comes down to the defaults and backcompat. If we want all Solr nodes
>>> to be able to host data replicas by default (without user explicitly
>>> specifying role=data), then we need a way to unset this role. The most
>>> reasonable way sounded like a "!data". We can do away with !data if we
>>> mandate each and every data node have the role "data" explicitly defined
>>> for it, which breaks backcompat and also is cumbersome to use for those who
>>> don't want to use these special roles.
>>>
>>>
>> Not sure I understand, which of the roles I mentioned (other than
>> zookeeper, which I expect is intended as different from our current
>> embedded zk) is NOT currently supported by a single cloud node brought up
>> as shown in our tutorials/docs? I'm certainly not proposing that the
>> default change to nothing. The default is all roles, unless you specify
>> roles at startup.
>>
>>
>>> > I also suggest that these roles each have a node in zookeeper listing
>>> the current member nodes (as child nodes) so that code that wants to find a
>>> node with an appropriate role does not need to scan the list of all nodes
>>> parsing something to discover which nodes apply and also does not have to
>>> parse json to do it.
>>>
>>> /roles.json exists today, it has role as key and list of nodes as value.
>>> In the next major version, we can change the format of that file and use
>>> key as node, value as list of roles. Or, maybe we can go for adding the
>>> roles to the data for each item in the list of live_nodes.
>>>
>>>
>> I'm not finding anything in our documentation about roles.json so I think
>> it's an internal implementation detail, which reduces back compat concerns.
>> ADDROLE/REMOVEROLE don't accept json or anything like that and could be
>> made to work with zk nodes too.
>>
>> The fact that some precursor work was done without a SIP (or before SIPs
>> existed) should not hamstring our design once a SIP that clearly covers the
>> same topic is under consideration. By their nature SIP's are non-trivial
>> and often will include compatibility breaks. Good news is I don't think I
>> see one here, just a code change to transition to a different zk backend. I
>> think that it's probably a mistake to consider our zookeeper data a public
>> API and we should be moving away from that or at the very least segregating
>> clearly what in zk is long term reliable. Ideally our v1/v2 api's should be
>> the public api through which information about the cluster is obtained.
>> Programming directly against zk is kind of like a custom build of solr.
>> Sometimes useful and appropriate, but maintenance is your concern. For code
>> plugging into solr, it should in theory be against an internal information
>> java api, and zookeeper should not be touched directly. (I know this is not
>> in a good state or at least wasn't last time I looked closely, but it
>> should be where we are heading).
>>
>> > any code seeking to transition a node
>>>
>>> We considered this situation and realized that it is very risky to have
>>> nodes change roles while they are up and running. Better to assign fixed
>>> roles upon startup.
>>>
>>
>> I agree that concurrency is hard. I definitely think startup time
>> assignments should be involved here. I'm not thinking that every transition
>> must be supported. As a starting point it would be fine if none were.
>> Having something suddenly become zookeeper is probably tricky to support
>> (see discussion in that thread regarding nodes not actually participating
>> until they have a partner to join with them to avoid even numbered
>> clusters), but I think the design should not preclude the possibility of
>> nodes becoming eligible for some roles or withdrawing from some roles, and
>> treatment of roles should be consistent. In some cases someone may decide
>> it's worth the work of handling the concurrency concerns, best if they
>> don't have to break back compat or hack their code around the assumption it
>> wouldn't happen to do it.
>>
>> Taking the zookeeper case as an example, it very much might be desirable
>> to have the possibility to heal the zk cluster by promoting another node
>> (configured as eligible for zk) to active zk duty if one of the current zk
>> nodes has been down long enough (say on prem hardware, motherboard pops a
>> capacitor, server gone for a week while new hardware is purchased, built
>> and configured). Especially if the down node didn't hold data or other
>> nodes had sufficient replicas and the cluster is still answering queries
>> just fine.
>>
>>
>>>
>>> > I know of a case that would benefit from having separate Query/Update
>>> nodes that handle a heavy analysis process which would be deployed to a
>>> number of CPU heavy boxes (which might add more in prep for bulk indexing,
>>> and remove them when bulk was done), data could then be hosted on cheaper
>>> nodes....
>>>
>>> This is the main motivation behind this work. SOLR-15715 needs this, and
>>> hence it would be good to get this in as soon as possible.
>>>
>>
>> I think we can incrementally work towards configurability for all of
>> these roles. The current default state is that a node has all roles and the
>> incremental progress is to enable removing a role from a node. This I think
>> is why it might be good to to
>>
>> A) Determine the set of roles our current solr nodes are performing (that
>> might be removed in some scenario) and document this via assigning these
>> roles as default on as this SIP goes live.
>> B) Figure out what the process of adding something entirely new that we
>> haven't yet thought of with its own role would look like.
>>
>> I think it would be great if we not only satisfied the current need but
>> determined how we expect this to change over time.
>>
>>
>>> Regards,
>>> Ishan
>>>
>>> On Wed, Oct 27, 2021 at 6:32 PM Gus Heck <[email protected]> wrote:
>>>
>>>> The SIP looks like a good start, and I was already thinking of
>>>> something very similar to this as a follow on to my attempts to split the
>>>> uber filter (SolrDispatchFilter) into servlets such that roles determine
>>>> what servlets are deployed, but I would like to recommend that the roles be
>>>> all positive ("Can do this") and nodes with no role at all are ineligible
>>>> for all activities. (just like standard role permissioning systems). This
>>>> will make it much more familiar and easy to think about. Therefore there
>>>> would be no need for a role such as !data which I presume was meant to mean
>>>> "no data on this node"... rather just don't give the "data" role to the
>>>> node.
>>>>
>>>> Additional node roles I think should exist:
>>>>
>>>> I think that we should expand/edit your list of roles to be
>>>>
>>>>    - QUERY - accepts and analyzes queries up to the point of actually
>>>>    consulting the lucene index (useful if you have a very heavy analysis 
>>>> phase)
>>>>    - UPDATE - accepts update requests, and performs update
>>>>    functionality prior to and including DistributedUpdateProcessorFactory
>>>>    (useful if you have a very heavy analysis phase)
>>>>    - ADMIN - accepts admin/management commands
>>>>    - UI - hosts an admin ui
>>>>    - ZOOKEEPER - hosts embedded zookeeper
>>>>    - OVERSEER - performs overseer related functionality (though IIRC
>>>>    there's a proposal to eliminate overseer that might eliminate this)
>>>>    - DATA - nodes where there is a lucene index and matching against
>>>>    the analyzed results of a query may be conducted to generate a response,
>>>>    also performs update steps that come after 
>>>> DistributedUpdateProcesserFactory
>>>>
>>>> I also suggest that these roles each have a node in zookeeper listing
>>>> the current member nodes (as child nodes) so that code that wants to find a
>>>> node with an appropriate role does not need to scan the list of all nodes
>>>> parsing something to discover which nodes apply and also does not have to
>>>> parse json to do it. I think this will be particularly key for zookeeper
>>>> nodes which might be 3 out of 100 or more nodes. Similar to how we track
>>>> live nodes. I think we should have a nodes.json too that tracks what roles
>>>> a node is ALLOWED to take (as opposed to which roles it currently 
>>>> servicing)
>>>>
>>>> So running code consults the zookeeper role list of nodes, and any code
>>>> seeking to transition a node (an admin operation with much lower
>>>> performance requirements) consults the json data in the nodes.json node,
>>>> parses it, finds the node in question and checks what it's eligible for
>>>> (this will correspond to which servlets/apps have been loaded).
>>>>
>>>> I know of a case that would benefit from having separate Query/Update
>>>> nodes that handle a heavy analysis process which would be deployed to a
>>>> number of CPU heavy boxes (which might add more in prep for bulk indexing,
>>>> and remove them when bulk was done), data could then be hosted on cheaper
>>>> nodes....
>>>>
>>>> Also maybe think about how this relates to NRT/TLOG/PULL which are also
>>>> maybe role like
>>>>
>>>> WDYT?
>>>>
>>>> -Gus
>>>>
>>>>
>>>> On Wed, Oct 27, 2021 at 3:17 AM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Here's an SIP for introducing the concept of node roles:
>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>
>>>>> We also wish to add first class support for Query nodes that are used
>>>>> to process user queries by forwarding to data nodes, merging/aggregating
>>>>> them and presenting to users. This concept exists as first class citizens
>>>>> in most other search engines. This is a chance for Solr to catch up.
>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>
>>>>> Regards,
>>>>> Ishan / Noble / Hitesh
>>>>>
>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: First class support for node roles

Reply via email to