Re: First class support for node roles

Gus Heck Wed, 27 Oct 2021 09:24:10 -0700

> In other words, roles are all "positive", but their consequences are only
> negative (rejecting when the matching positive role is not present).
>
> Yeah right. to do something the machine needs the role



> We can also consider no role defined = all roles allowed. Will make things
> simpler.
>

in terms of startup command yes. Internally we should have all explicitly
assigned when no roles are specified at startup so that the code doesn't
have a million if checks for the empty case


>
> On Wed, Oct 27, 2021 at 6:14 PM Ilan Ginzburg <[email protected]> wrote:
>
>> How do we expect the roles to be used?
>> One way I see is a node refusing to do anything related to a role it
>> doesn't have.
>> For example if a node does not have role "data", any attempt to create a
>> core on it would fail.
>> A node not having the role "query", will refuse to have anything to do
>> with handling a query etc.
>> Then it would be up to other code to make sure only the appropriate nodes
>> are requested to do any type of action.
>> So for example any replica placement code plugin would have to restrict
>> the set of candidate nodes for a new replica placement to those having
>> "data". Otherwise the call would fail, and there should be nothing the
>> replica placement code can do about it.
>>
>> Similarly, the "overseer" role would limit the nodes that participate in
>> the Overseer election. The Overseer election code would have to remove (or
>> not add) all non qualifying nodes from the election, and we should expect a
>> node without role "overseer" to refuse to start the Overseer machinery if
>> asked to...
>>
>> Trying to make the use case clear regarding how roles are used.
>> Ilan
>>
>> On Wed, Oct 27, 2021 at 5:47 PM Gus Heck <[email protected]> wrote:
>>
>>>
>>>
>>> On Wed, Oct 27, 2021 at 9:55 AM Ishan Chattopadhyaya <
>>> [email protected]> wrote:
>>>
>>>> Hi Gus,
>>>>
>>>> > I think that we should expand/edit your list of roles to be
>>>>
>>>> The list can be expanded as and when more isolation and features are
>>>> needed. I only listed those roles that we already have a functionality for
>>>> or is under development.
>>>>
>>>
>>> Well all of those roles (except zookeeper) are things nodes do today. As
>>> it stands they are all doing all of them. What we add support for as we
>>> move forward is starting without a role, and add the zookeeper role when
>>> that feature is ready.
>>>
>>>
>>>> > I would like to recommend that the roles be all positive ("Can do
>>>> this") and nodes with no role at all are ineligible for all activities.
>>>>
>>>> It comes down to the defaults and backcompat. If we want all Solr nodes
>>>> to be able to host data replicas by default (without user explicitly
>>>> specifying role=data), then we need a way to unset this role. The most
>>>> reasonable way sounded like a "!data". We can do away with !data if we
>>>> mandate each and every data node have the role "data" explicitly defined
>>>> for it, which breaks backcompat and also is cumbersome to use for those who
>>>> don't want to use these special roles.
>>>>
>>>>
>>> Not sure I understand, which of the roles I mentioned (other than
>>> zookeeper, which I expect is intended as different from our current
>>> embedded zk) is NOT currently supported by a single cloud node brought up
>>> as shown in our tutorials/docs? I'm certainly not proposing that the
>>> default change to nothing. The default is all roles, unless you specify
>>> roles at startup.
>>>
>>>
>>>> > I also suggest that these roles each have a node in zookeeper listing
>>>> the current member nodes (as child nodes) so that code that wants to find a
>>>> node with an appropriate role does not need to scan the list of all nodes
>>>> parsing something to discover which nodes apply and also does not have to
>>>> parse json to do it.
>>>>
>>>> /roles.json exists today, it has role as key and list of nodes as
>>>> value. In the next major version, we can change the format of that file and
>>>> use key as node, value as list of roles. Or, maybe we can go for adding the
>>>> roles to the data for each item in the list of live_nodes.
>>>>
>>>>
>>> I'm not finding anything in our documentation about roles.json so I
>>> think it's an internal implementation detail, which reduces back compat
>>> concerns. ADDROLE/REMOVEROLE don't accept json or anything like that and
>>> could be made to work with zk nodes too.
>>>
>>> The fact that some precursor work was done without a SIP (or before SIPs
>>> existed) should not hamstring our design once a SIP that clearly covers the
>>> same topic is under consideration. By their nature SIP's are non-trivial
>>> and often will include compatibility breaks. Good news is I don't think I
>>> see one here, just a code change to transition to a different zk backend. I
>>> think that it's probably a mistake to consider our zookeeper data a public
>>> API and we should be moving away from that or at the very least segregating
>>> clearly what in zk is long term reliable. Ideally our v1/v2 api's should be
>>> the public api through which information about the cluster is obtained.
>>> Programming directly against zk is kind of like a custom build of solr.
>>> Sometimes useful and appropriate, but maintenance is your concern. For code
>>> plugging into solr, it should in theory be against an internal information
>>> java api, and zookeeper should not be touched directly. (I know this is not
>>> in a good state or at least wasn't last time I looked closely, but it
>>> should be where we are heading).
>>>
>>> > any code seeking to transition a node
>>>>
>>>> We considered this situation and realized that it is very risky to have
>>>> nodes change roles while they are up and running. Better to assign fixed
>>>> roles upon startup.
>>>>
>>>
>>> I agree that concurrency is hard. I definitely think startup time
>>> assignments should be involved here. I'm not thinking that every transition
>>> must be supported. As a starting point it would be fine if none were.
>>> Having something suddenly become zookeeper is probably tricky to support
>>> (see discussion in that thread regarding nodes not actually participating
>>> until they have a partner to join with them to avoid even numbered
>>> clusters), but I think the design should not preclude the possibility of
>>> nodes becoming eligible for some roles or withdrawing from some roles, and
>>> treatment of roles should be consistent. In some cases someone may decide
>>> it's worth the work of handling the concurrency concerns, best if they
>>> don't have to break back compat or hack their code around the assumption it
>>> wouldn't happen to do it.
>>>
>>> Taking the zookeeper case as an example, it very much might be desirable
>>> to have the possibility to heal the zk cluster by promoting another node
>>> (configured as eligible for zk) to active zk duty if one of the current zk
>>> nodes has been down long enough (say on prem hardware, motherboard pops a
>>> capacitor, server gone for a week while new hardware is purchased, built
>>> and configured). Especially if the down node didn't hold data or other
>>> nodes had sufficient replicas and the cluster is still answering queries
>>> just fine.
>>>
>>>
>>>>
>>>> > I know of a case that would benefit from having separate Query/Update
>>>> nodes that handle a heavy analysis process which would be deployed to a
>>>> number of CPU heavy boxes (which might add more in prep for bulk indexing,
>>>> and remove them when bulk was done), data could then be hosted on cheaper
>>>> nodes....
>>>>
>>>> This is the main motivation behind this work. SOLR-15715 needs this,
>>>> and hence it would be good to get this in as soon as possible.
>>>>
>>>
>>> I think we can incrementally work towards configurability for all of
>>> these roles. The current default state is that a node has all roles and the
>>> incremental progress is to enable removing a role from a node. This I think
>>> is why it might be good to to
>>>
>>> A) Determine the set of roles our current solr nodes are performing
>>> (that might be removed in some scenario) and document this via assigning
>>> these roles as default on as this SIP goes live.
>>> B) Figure out what the process of adding something entirely new that we
>>> haven't yet thought of with its own role would look like.
>>>
>>> I think it would be great if we not only satisfied the current need but
>>> determined how we expect this to change over time.
>>>
>>>
>>>> Regards,
>>>> Ishan
>>>>
>>>> On Wed, Oct 27, 2021 at 6:32 PM Gus Heck <[email protected]> wrote:
>>>>
>>>>> The SIP looks like a good start, and I was already thinking of
>>>>> something very similar to this as a follow on to my attempts to split the
>>>>> uber filter (SolrDispatchFilter) into servlets such that roles determine
>>>>> what servlets are deployed, but I would like to recommend that the roles 
>>>>> be
>>>>> all positive ("Can do this") and nodes with no role at all are ineligible
>>>>> for all activities. (just like standard role permissioning systems). This
>>>>> will make it much more familiar and easy to think about. Therefore there
>>>>> would be no need for a role such as !data which I presume was meant to 
>>>>> mean
>>>>> "no data on this node"... rather just don't give the "data" role to the
>>>>> node.
>>>>>
>>>>> Additional node roles I think should exist:
>>>>>
>>>>> I think that we should expand/edit your list of roles to be
>>>>>
>>>>>    - QUERY - accepts and analyzes queries up to the point of actually
>>>>>    consulting the lucene index (useful if you have a very heavy analysis 
>>>>> phase)
>>>>>    - UPDATE - accepts update requests, and performs update
>>>>>    functionality prior to and including DistributedUpdateProcessorFactory
>>>>>    (useful if you have a very heavy analysis phase)
>>>>>    - ADMIN - accepts admin/management commands
>>>>>    - UI - hosts an admin ui
>>>>>    - ZOOKEEPER - hosts embedded zookeeper
>>>>>    - OVERSEER - performs overseer related functionality (though IIRC
>>>>>    there's a proposal to eliminate overseer that might eliminate this)
>>>>>    - DATA - nodes where there is a lucene index and matching against
>>>>>    the analyzed results of a query may be conducted to generate a 
>>>>> response,
>>>>>    also performs update steps that come after 
>>>>> DistributedUpdateProcesserFactory
>>>>>
>>>>> I also suggest that these roles each have a node in zookeeper listing
>>>>> the current member nodes (as child nodes) so that code that wants to find 
>>>>> a
>>>>> node with an appropriate role does not need to scan the list of all nodes
>>>>> parsing something to discover which nodes apply and also does not have to
>>>>> parse json to do it. I think this will be particularly key for zookeeper
>>>>> nodes which might be 3 out of 100 or more nodes. Similar to how we track
>>>>> live nodes. I think we should have a nodes.json too that tracks what roles
>>>>> a node is ALLOWED to take (as opposed to which roles it currently 
>>>>> servicing)
>>>>>
>>>>> So running code consults the zookeeper role list of nodes, and any
>>>>> code seeking to transition a node (an admin operation with much lower
>>>>> performance requirements) consults the json data in the nodes.json node,
>>>>> parses it, finds the node in question and checks what it's eligible for
>>>>> (this will correspond to which servlets/apps have been loaded).
>>>>>
>>>>> I know of a case that would benefit from having separate Query/Update
>>>>> nodes that handle a heavy analysis process which would be deployed to a
>>>>> number of CPU heavy boxes (which might add more in prep for bulk indexing,
>>>>> and remove them when bulk was done), data could then be hosted on cheaper
>>>>> nodes....
>>>>>
>>>>> Also maybe think about how this relates to NRT/TLOG/PULL which are
>>>>> also maybe role like
>>>>>
>>>>> WDYT?
>>>>>
>>>>> -Gus
>>>>>
>>>>>
>>>>> On Wed, Oct 27, 2021 at 3:17 AM Ishan Chattopadhyaya <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>
>>>>>> We also wish to add first class support for Query nodes that are used
>>>>>> to process user queries by forwarding to data nodes, merging/aggregating
>>>>>> them and presenting to users. This concept exists as first class citizens
>>>>>> in most other search engines. This is a chance for Solr to catch up.
>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>
>>>>>> Regards,
>>>>>> Ishan / Noble / Hitesh
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: First class support for node roles

Reply via email to