Re: First class support for node roles

Ishan Chattopadhyaya Mon, 01 Nov 2021 05:02:15 -0700

If the changes and the scope seem acceptable, should we proceed for a vote?


On Mon, Nov 1, 2021 at 5:22 PM Ishan Chattopadhyaya <
[email protected]> wrote:

> Hi Gus,
>
> Thanks for the summary.
>
> > (+Gus, +Houston,+Ilan) Positive roles, the existence of which implies
> functionality such that if a node can provide functionality. i.e. it always
> has the role if it can and if it doesn't have the role it can't provide
> the functionality.
>
> I've removed the concept of "!data" from the SIP proposal. A node that
> doesn't have -Dnode.roles parameter will be assumed to have
> -Dnode.roles=data. If a node is started with a node.roles param, it must
> include "data" for all nodes hosting data.
>
> > (+Houston,+Ishan,+Gus - below) Rename query role
>
> Coordinator role should be better now.
>
>
>    - > (+Gus) We should include a plan for the overall set of roles to
>    work towards and then build them out as time allows us to.
>    - > (+Gus) We have a distinction between "capable" and "currently
>    providing"
>    - > (+Gus) Capable be evidenced by a config/startup designation that
>    adds a list of roles to a json file in zk where the nodes are all
>    listed
>    - > (+Gus) Providing be evidenced by the node adding an list of
>    ephemeral nodes (similar to live_nodes) for each role
>
>
> From an overall conceptual point of view, there doesn't need to be any
> specialization for a role. When a new role is introduced, such details on
> behaviour and implementation can be documented and defined then. As for
> OVERSEER role today, it can be documented as a role that marks a node to be
> a "preferred" overseer (or eligible/capable etc.), and "currently
> providing" can be determined by the OVERSEERSTATUS api call or the overseer
> leader election queue.
>
> > (+Ilan, +Gus) Making collections role aware
>
> Seems to me that this is something that can be introduced as a follow up,
> and we don't want to complicate the proposed design early on.
>
> > Ishan, specifics on how your coordinator node would work would be
> interesting to know if it really is distinct from my concept of a "query"
> node. I agree that that term is probably confusing, I used it to mean
> "query parsing" you meant it as "query aggregator".
>
> As of now, the coordinator node would be capable to servicing query (or
> indexing at a later point in time) requests by handling the queries on the
> coordinator nodes itself, and making shard-requests to data nodes. If we
> want to have the coordinator nodes do even more work, i.e. do query parsing
> on behalf of the shards, the capability can be further enhanced.
>
> Regards,
> Ishan
>
> On Fri, Oct 29, 2021 at 7:21 PM Gus Heck <[email protected]> wrote:
>
>> edit:
>> 6. (+Gus) Providing be evidenced by a the node *adding itself to a list*
>> of ephemeral nodes (similar to live_nodes) for each role
>>
>> On Fri, Oct 29, 2021 at 9:40 AM Gus Heck <[email protected]> wrote:
>>
>>> I've heard a number of folks agree that we should not have negative
>>> (role removal) values for roles (!data in the sip).
>>>
>>> I also don't like the idea of the "coordinator" creating assumptions
>>> about other roles. I think the point of avoiding "!data" is to make it
>>> programmatically and logically easy to tell what role a node has, if we
>>> have to have a method called figureOutImpliedRoles() with a lot of logic in
>>> it that's bad.  It should just be getRoles().contains(role), trivially
>>> returning the roles that are already declared in config/zk/whatever.
>>>
>>> We don't have to support every possible role all at once. We can have
>>> "basic functionality" that all nodes provide regardless of roles (right now
>>> that's everything), and then lop off chunks of basic functionality and
>>> assign them to roles. That should be easy and backward compatible if we
>>> then give the new role to every node by default on upgrade.
>>>
>>> However we should carefully think about what should and shouldn't be
>>> part of any role, because moving functionality out of a role back to basic
>>> functionality or between roles will create backwards compatibility issues.
>>> This is why I think we should have a concept of what roles we will have in
>>> the future, so we don't inadvertently move functionality into a role that
>>> later needs to go in some other role (mistakes/bugs may happen of course,
>>> but best effort).
>>>
>>> So boiling it down I've seen suggestion for the following
>>> additions/edits to the SIP:
>>>
>>>    1. (+Gus, +Houston,+Ilan) Positive roles, the existence of which
>>>    implies functionality such that if a node can provide functionality. i.e.
>>>    it always has the role if it can and if it doesn't have the role it can't
>>>    provide the functionality.
>>>    2. (+Houston,+Ishan,+Gus - below) Rename query role
>>>    3. (+Gus) We should include a plan for the overall set of roles to
>>>    work towards and then build them out as time allows us to.
>>>    4. (+Gus) We have a distinction between "capable" and "currently
>>>    providing"
>>>    5. (+Gus) Capable be evidenced by a config/startup designation that
>>>    adds a list of roles to a json file in zk where the nodes are all listed
>>>    6. (+Gus) Providing be evidenced by the node adding an list of
>>>    ephemeral nodes (similar to live_nodes) for each role
>>>    7. (+Ilan, +Gus) Making collections role aware
>>>
>>> Ilan suggested that we make collections role-aware which would make some
>>> sense since the collection might want to have a minimum of 2
>>> query-aggregator nodes available, might want to avoid zk nodes, etc. I
>>> think that this is a good next feature and the intention should be added to
>>> the SIP, but need not be in the initial implementation since by default
>>> everything can have all roles (roles implemented to date) and initially
>>> removing roles from nodes will be an advanced/manual feature mostly
>>> applicable to static clusters that don't add collections regularly, then
>>> support for role aware collections can be added to make the feature useful
>>> for a wider audience (should be its own ticket anyway, and it interacts
>>> with replica placement).
>>>
>>> I've heard several agree with #1, and it seems 3-6 were either not yet
>>> clear or folks are still deliberating as I haven't noticed positive or
>>> negative opinions there, just some discussion of the definition of
>>> candidate roles. I'm fond of 3-5 because it allows for things like knowing
>>> what the capabilities of a down node are, and finding a provider without
>>> having to cross-coordinate with live_nodes. (keeps code simple, avoids
>>> racing between the check for liveness and the check for the capability)
>>> Also, a node joining as live and able to serve queries can be decoupled
>>> from when it's ready to provide a service (thinking at least zk here,
>>> waiting for a 2nd node capable of zk before expanding the zk cluster to
>>> avoid even numbered clusters).
>>>
>>> Ishan, specifics on how your coordinator node would work would be
>>> interesting to know if it really is distinct from my concept of a "query"
>>> node. I agree that that term is probably confusing, I used it to mean
>>> "query parsing" you meant it as "query aggregator".
>>>
>>> As a side note, with positive only roles and all roles added unless
>>> specified otherwise, Ishan's use case might be as simple as just removing
>>> the DATA role from a few nodes and restricting the aggregation queries
>>> concerned to those nodes. To get solr to enforce the restriction for you,
>>> then a "query/compute/coordinator" role must be removed from the remainder
>>> of the nodes.
>>>
>>> -Gus
>>>
>>> On Fri, Oct 29, 2021 at 5:49 AM Ishan Chattopadhyaya <
>>> [email protected]> wrote:
>>>
>>>> > I'll introduce a change to the SIP document, unless there are
>>>> objections/questions/concerns. WDYT?
>>>> I've made the change to the document. Feedback on this welcome.
>>>>
>>>> On Fri, Oct 29, 2021 at 2:52 PM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>>
>>>>> It seems to me, after going through this thread, that the role "query"
>>>>> is misleading for what we're planning to introduce in SOLR-15715. We're
>>>>> essentially introducing a functionality for performing, what we initially
>>>>> called, "query aggregations". The actual queries run on data nodes anyway,
>>>>> just that the first point of entry for such distributed queries will be a
>>>>> separate node with this extra functionality. Towards that, I feel we 
>>>>> should
>>>>> call a node with such capability as a "coordinator" node (instead of 
>>>>> "query
>>>>> node"). It fits nicely with the mental model of ElasticSearch as well:
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#coordinating-node
>>>>> .
>>>>>
>>>>> Proposing that if a node has a role "coordinator", then that node is
>>>>> already assumed to have no data replicas on it. If a node starts with 
>>>>> roles
>>>>> "coordinator,data" both, then the startup should fail with a message 
>>>>> saying
>>>>> a coordinator node should not host data on it. A coordinator node, though,
>>>>> can have other roles like "zookeeper" or "overseer" etc.
>>>>>
>>>>> I'll introduce a change to the SIP document, unless there are
>>>>> objections/questions/concerns. WDYT?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 29, 2021 at 1:54 PM Ilan Ginzburg <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> If we make collections role-aware for example (replicas of that
>>>>>> collection can only be placed on nodes with a specific role, in addition 
>>>>>> to
>>>>>> the other role based constraints), the set of roles should be user
>>>>>> extensible and not fixed.
>>>>>>
>>>>>> If collections are not role aware, the constraints introduced by
>>>>>> roles apply to all collections equally which might be insufficient if a
>>>>>> user needs for example a heavily used collection to only be placed on 
>>>>>> more
>>>>>> powerful nodes.
>>>>>>
>>>>>> Ilan
>>>>>>
>>>>>> On Thu 28 Oct 2021 at 07:59, Gus Heck <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 27, 2021 at 3:34 PM Houston Putman <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I don't think it's unreasonable to want to have nodes that don't
>>>>>>>>> accept queries. This is just ishan's use case.
>>>>>>>>
>>>>>>>>
>>>>>>>> Maybe I am misunderstanding, and it does deal with your last point
>>>>>>>> about inter-node communication, but Peer-sync uses queries when doing
>>>>>>>> replication between replicas. If a node doesn't have queries enabled, 
>>>>>>>> then
>>>>>>>> it's possible to break peer sync. There are other options to make sure
>>>>>>>> certain replicas aren't queried (shards.preference).
>>>>>>>> For the separation of update/query traffic, I understand having
>>>>>>>> compute nodes that deal with pre-replica commands, such as managing
>>>>>>>> distributed queries or pre-processing documents in the URP chain. But 
>>>>>>>> for
>>>>>>>> actual non-distrib queries and final update requests, the only way to
>>>>>>>> actually separate this traffic is using PULL/TLOG replicas, because
>>>>>>>> otherwise (with NRT) all update requests are still going to the query
>>>>>>>> nodes, just the same as non-query nodes that are hosting those 
>>>>>>>> replicas.
>>>>>>>> (and shard leadership could go to any "data" node, since I imagine we
>>>>>>>> wouldn't filter out the "query" nodes...) The shards.preference option
>>>>>>>> makes it easy to send queries to only PULL replicas in this scenario.
>>>>>>>> That's why I saw this more as a "compute" role rather than "query".
>>>>>>>>
>>>>>>>
>>>>>>> Yeah for internal stuff we still need the ability to query so we
>>>>>>> might need to accommodate that that, but you may not have noticed the 
>>>>>>> bit
>>>>>>> where I mentioned Query nodes doing the parsing/analysis of the query 
>>>>>>> and
>>>>>>> then sending a fully parsed (possibly serialized lucene objects) query 
>>>>>>> to
>>>>>>> the data node. So data nodes would only speak a single lucene level 
>>>>>>> query
>>>>>>> language and not parse queries or analyze text. In theory, with that, 
>>>>>>> one
>>>>>>> could even have something like elastic reduce a request to lucene 
>>>>>>> objects
>>>>>>> and then get an answer from a solr data node (for simple cases without 
>>>>>>> need
>>>>>>> to find shards via zookeeper etc) certainly many details and corner 
>>>>>>> cases
>>>>>>> to think about there.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Definitely not what I would like. If I'm going to try to segregate
>>>>>>>>> data nodes out to certain nodes, I don't want them appearing 
>>>>>>>>> elsewhere just
>>>>>>>>> cause something went down or filled up. Nor would I want updates to
>>>>>>>>> suddenly start bogging down my query nodes....
>>>>>>>>>
>>>>>>>>
>>>>>>>> I guess it depends on what you are optimizing for. Maybe there can
>>>>>>>> be an option for this. like -DnonLenientRoles=data,update or something 
>>>>>>>> like
>>>>>>>> that.
>>>>>>>>
>>>>>>>
>>>>>>> A possibility
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Oct 27, 2021 at 3:03 PM Gus Heck <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 27, 2021 at 2:44 PM Houston Putman <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> As for the "query" role, let's name it something better like
>>>>>>>>>> "compute", since data nodes are always going to be "querying".
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't think it's unreasonable to want to have nodes that don't
>>>>>>>>> accept queries. This is just ishan's use case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  if no live nodes have roles=overseer (or roles=all), then we
>>>>>>>>>> should just select any node to be overseer. This should be the same 
>>>>>>>>>> for
>>>>>>>>>> compute, data, etc.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Definitely not what I would like. If I'm going to try to segregate
>>>>>>>>> data nodes out to certain nodes, I don't want them appearing 
>>>>>>>>> elsewhere just
>>>>>>>>> cause something went down or filled up. Nor would I want updates to
>>>>>>>>> suddenly start bogging down my query nodes....
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So, for the proposal, lets say "data" is a special role which is
>>>>>>>>>>> assumed by default, and is enabled on all nodes unless there's a 
>>>>>>>>>>> !data.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Instead of  this, maybe we have role groups. Such as
>>>>>>>>>> admin~=overseer,zk or worker~=compute,data,updateProcessing
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Roll groups sounds like a next level feature to be built on top
>>>>>>>>> once we figure out how roles work independently.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As for the suggested Roles, I'm not sure ADMIN or UI really fit,
>>>>>>>>>> since there is another option to disable the UI for a solr node, and
>>>>>>>>>> various ADMIN commands have to be accepted across other node roles. 
>>>>>>>>>> (Data
>>>>>>>>>> nodes require the Collections API, same with the overseer.)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I admit I'm angling towards a world in which enabling and
>>>>>>>>> disabling broad features is done in one way in one place... As for 
>>>>>>>>> admin
>>>>>>>>> there might be a distinction between commands issued between nodes 
>>>>>>>>> and from
>>>>>>>>> the outside world... I have this other idea about inter-node 
>>>>>>>>> communication
>>>>>>>>> that's even less baked that I wont distract with here ;)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> - Houston
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 27, 2021 at 1:34 PM Ishan Chattopadhyaya <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> bq. In other words, roles are all "positive", but their
>>>>>>>>>>> consequences are only negative (rejecting when the matching 
>>>>>>>>>>> positive role
>>>>>>>>>>> is not present).
>>>>>>>>>>>
>>>>>>>>>>> Essentially, yes. A node that doesn't specify any role should be
>>>>>>>>>>> able to do everything.
>>>>>>>>>>>
>>>>>>>>>>> Let me just take a brief detour and mention our thoughts on the
>>>>>>>>>>> "query" role. While all data nodes can also be used for querying, 
>>>>>>>>>>> our idea
>>>>>>>>>>> was to create a layer of nodes that have some special mechanism to 
>>>>>>>>>>> be able
>>>>>>>>>>> to proxy/forward queries to data nodes (lets call it "pseudo cores" 
>>>>>>>>>>> or
>>>>>>>>>>> "synthetic cores" or "proxy cores". Our thought was that any node 
>>>>>>>>>>> that has
>>>>>>>>>>> "query,!data" role would enable this special mode on startup 
>>>>>>>>>>> (whereby
>>>>>>>>>>> requests are served by these special pseudo cores). We'll discuss 
>>>>>>>>>>> about
>>>>>>>>>>> this in detail in that issue.
>>>>>>>>>>>
>>>>>>>>>>> Back to the main subject here.
>>>>>>>>>>>
>>>>>>>>>>> Lets take a practical scenario:
>>>>>>>>>>> * Layer1: Organization has about 100 nodes, each node has many
>>>>>>>>>>> data replicas
>>>>>>>>>>> * Layer2: To manage such a large cluster reliably, they keep
>>>>>>>>>>> aside 4-5 dedicated overseer nodes.
>>>>>>>>>>> * Layer3: Since query aggregations/coordination can potentially
>>>>>>>>>>> be expensive, they keep aside 5-10 query nodes.
>>>>>>>>>>>
>>>>>>>>>>> My preference would be as follows:
>>>>>>>>>>> * I'd like to refer to Layer1 nodes as the "data nodes" and
>>>>>>>>>>> hence get either no role defined for them or -Dnode.roles=data.
>>>>>>>>>>> * I'd like to refer to Layer2 nodes as "overseer nodes" (even
>>>>>>>>>>> though I understand, only one of them can be an overseer at a 
>>>>>>>>>>> time). I'd
>>>>>>>>>>> like to have -Dnode.roles=!data,overseer
>>>>>>>>>>> * I'd like to refer to Layer3 nodes as "query nodes", with
>>>>>>>>>>> -Dnode.roles=!data,query
>>>>>>>>>>>
>>>>>>>>>>> ^ This seems very practical from operational standpoint.
>>>>>>>>>>>
>>>>>>>>>>> So, for the proposal, lets say "data" is a special role which is
>>>>>>>>>>> assumed by default, and is enabled on all nodes unless there's a 
>>>>>>>>>>> !data. It
>>>>>>>>>>> is presumed that data nodes can also serve queries directly, so 
>>>>>>>>>>> adding a
>>>>>>>>>>> "query" to those nodes is meaningless (also because there's no 
>>>>>>>>>>> practical
>>>>>>>>>>> benefit to stopping a data node from receiving a query for "!query" 
>>>>>>>>>>> role to
>>>>>>>>>>> be useful).
>>>>>>>>>>>
>>>>>>>>>>> "query" role on nodes that don't host data really refers to a
>>>>>>>>>>> special capability for lightweight, stateless nodes. I don't want 
>>>>>>>>>>> to add a
>>>>>>>>>>> "!query" on dedicated overseer nodes, and hence I don't want to 
>>>>>>>>>>> assume that
>>>>>>>>>>> "query" is implicitly avaiable on any node even if the role isn't 
>>>>>>>>>>> specified.
>>>>>>>>>>>
>>>>>>>>>>> "overseer" role is complicated, since it is already defined and
>>>>>>>>>>> we don't have the opportunity to define it the right way. I'd hate 
>>>>>>>>>>> having
>>>>>>>>>>> to put a "!overseer" on every data node on startup in order to have 
>>>>>>>>>>> a few
>>>>>>>>>>> dedicated overseers.
>>>>>>>>>>>
>>>>>>>>>>> In short, in this SIP, I just wish to implement the concept of
>>>>>>>>>>> nodes and its handling. How individual roles are leveraged can be 
>>>>>>>>>>> up to
>>>>>>>>>>> every new role's implementation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 27, 2021 at 9:54 PM Gus Heck <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> In other words, roles are all "positive", but their
>>>>>>>>>>>>> consequences are only negative (rejecting when the matching 
>>>>>>>>>>>>> positive role
>>>>>>>>>>>>> is not present).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah right. to do something the machine needs the role
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> We can also consider no role defined = all roles allowed. Will
>>>>>>>>>>>>> make things simpler.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> in terms of startup command yes. Internally we should have all
>>>>>>>>>>>> explicitly assigned when no roles are specified at startup so that 
>>>>>>>>>>>> the code
>>>>>>>>>>>> doesn't have a million if checks for the empty case
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 6:14 PM Ilan Ginzburg <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> How do we expect the roles to be used?
>>>>>>>>>>>>>> One way I see is a node refusing to do anything related to a
>>>>>>>>>>>>>> role it doesn't have.
>>>>>>>>>>>>>> For example if a node does not have role "data", any attempt
>>>>>>>>>>>>>> to create a core on it would fail.
>>>>>>>>>>>>>> A node not having the role "query", will refuse to have
>>>>>>>>>>>>>> anything to do with handling a query etc.
>>>>>>>>>>>>>> Then it would be up to other code to make sure only the
>>>>>>>>>>>>>> appropriate nodes are requested to do any type of action.
>>>>>>>>>>>>>> So for example any replica placement code plugin would have
>>>>>>>>>>>>>> to restrict the set of candidate nodes for a new replica 
>>>>>>>>>>>>>> placement to those
>>>>>>>>>>>>>> having "data". Otherwise the call would fail, and there should 
>>>>>>>>>>>>>> be nothing
>>>>>>>>>>>>>> the replica placement code can do about it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similarly, the "overseer" role would limit the nodes that
>>>>>>>>>>>>>> participate in the Overseer election. The Overseer election code 
>>>>>>>>>>>>>> would have
>>>>>>>>>>>>>> to remove (or not add) all non qualifying nodes from the 
>>>>>>>>>>>>>> election, and we
>>>>>>>>>>>>>> should expect a node without role "overseer" to refuse to start 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> Overseer machinery if asked to...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Trying to make the use case clear regarding how roles are
>>>>>>>>>>>>>> used.
>>>>>>>>>>>>>> Ilan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 5:47 PM Gus Heck <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 9:55 AM Ishan Chattopadhyaya <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Gus,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I think that we should expand/edit your list of roles to
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The list can be expanded as and when more isolation and
>>>>>>>>>>>>>>>> features are needed. I only listed those roles that we already 
>>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>> functionality for or is under development.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Well all of those roles (except zookeeper) are things nodes
>>>>>>>>>>>>>>> do today. As it stands they are all doing all of them. What we 
>>>>>>>>>>>>>>> add support
>>>>>>>>>>>>>>> for as we move forward is starting without a role, and add the 
>>>>>>>>>>>>>>> zookeeper
>>>>>>>>>>>>>>> role when that feature is ready.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I would like to recommend that the roles be all positive
>>>>>>>>>>>>>>>> ("Can do this") and nodes with no role at all are ineligible 
>>>>>>>>>>>>>>>> for all
>>>>>>>>>>>>>>>> activities.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It comes down to the defaults and backcompat. If we want
>>>>>>>>>>>>>>>> all Solr nodes to be able to host data replicas by default 
>>>>>>>>>>>>>>>> (without user
>>>>>>>>>>>>>>>> explicitly specifying role=data), then we need a way to unset 
>>>>>>>>>>>>>>>> this role.
>>>>>>>>>>>>>>>> The most reasonable way sounded like a "!data". We can do away 
>>>>>>>>>>>>>>>> with !data
>>>>>>>>>>>>>>>> if we mandate each and every data node have the role "data" 
>>>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>>> defined for it, which breaks backcompat and also is cumbersome 
>>>>>>>>>>>>>>>> to use for
>>>>>>>>>>>>>>>> those who don't want to use these special roles.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure I understand, which of the roles I mentioned (other
>>>>>>>>>>>>>>> than zookeeper, which I expect is intended as different from 
>>>>>>>>>>>>>>> our current
>>>>>>>>>>>>>>> embedded zk) is NOT currently supported by a single cloud node 
>>>>>>>>>>>>>>> brought up
>>>>>>>>>>>>>>> as shown in our tutorials/docs? I'm certainly not proposing 
>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>> default change to nothing. The default is all roles, unless you 
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>> roles at startup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I also suggest that these roles each have a node in
>>>>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) so 
>>>>>>>>>>>>>>>> that code
>>>>>>>>>>>>>>>> that wants to find a node with an appropriate role does not 
>>>>>>>>>>>>>>>> need to scan
>>>>>>>>>>>>>>>> the list of all nodes parsing something to discover which 
>>>>>>>>>>>>>>>> nodes apply and
>>>>>>>>>>>>>>>> also does not have to parse json to do it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /roles.json exists today, it has role as key and list of
>>>>>>>>>>>>>>>> nodes as value. In the next major version, we can change the 
>>>>>>>>>>>>>>>> format of that
>>>>>>>>>>>>>>>> file and use key as node, value as list of roles. Or, maybe we 
>>>>>>>>>>>>>>>> can go for
>>>>>>>>>>>>>>>> adding the roles to the data for each item in the list of 
>>>>>>>>>>>>>>>> live_nodes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not finding anything in our documentation about
>>>>>>>>>>>>>>> roles.json so I think it's an internal implementation detail, 
>>>>>>>>>>>>>>> which reduces
>>>>>>>>>>>>>>> back compat concerns. ADDROLE/REMOVEROLE don't accept json or 
>>>>>>>>>>>>>>> anything like
>>>>>>>>>>>>>>> that and could be made to work with zk nodes too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The fact that some precursor work was done without a SIP (or
>>>>>>>>>>>>>>> before SIPs existed) should not hamstring our design once a SIP 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> clearly covers the same topic is under consideration. By their 
>>>>>>>>>>>>>>> nature SIP's
>>>>>>>>>>>>>>> are non-trivial and often will include compatibility breaks. 
>>>>>>>>>>>>>>> Good news is I
>>>>>>>>>>>>>>> don't think I see one here, just a code change to transition to 
>>>>>>>>>>>>>>> a different
>>>>>>>>>>>>>>> zk backend. I think that it's probably a mistake to consider 
>>>>>>>>>>>>>>> our zookeeper
>>>>>>>>>>>>>>> data a public API and we should be moving away from that or at 
>>>>>>>>>>>>>>> the very
>>>>>>>>>>>>>>> least segregating clearly what in zk is long term reliable. 
>>>>>>>>>>>>>>> Ideally our
>>>>>>>>>>>>>>> v1/v2 api's should be the public api through which information 
>>>>>>>>>>>>>>> about the
>>>>>>>>>>>>>>> cluster is obtained. Programming directly against zk is kind of 
>>>>>>>>>>>>>>> like a
>>>>>>>>>>>>>>> custom build of solr. Sometimes useful and appropriate, but 
>>>>>>>>>>>>>>> maintenance is
>>>>>>>>>>>>>>> your concern. For code plugging into solr, it should in theory 
>>>>>>>>>>>>>>> be against
>>>>>>>>>>>>>>> an internal information java api, and zookeeper should not be 
>>>>>>>>>>>>>>> touched
>>>>>>>>>>>>>>> directly. (I know this is not in a good state or at least 
>>>>>>>>>>>>>>> wasn't last time
>>>>>>>>>>>>>>> I looked closely, but it should be where we are heading).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > any code seeking to transition a node
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We considered this situation and realized that it is very
>>>>>>>>>>>>>>>> risky to have nodes change roles while they are up and 
>>>>>>>>>>>>>>>> running. Better to
>>>>>>>>>>>>>>>> assign fixed roles upon startup.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree that concurrency is hard. I definitely think startup
>>>>>>>>>>>>>>> time assignments should be involved here. I'm not thinking that 
>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>> transition must be supported. As a starting point it would be 
>>>>>>>>>>>>>>> fine if none
>>>>>>>>>>>>>>> were. Having something suddenly become zookeeper is probably 
>>>>>>>>>>>>>>> tricky to
>>>>>>>>>>>>>>> support (see discussion in that thread regarding nodes not 
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> participating until they have a partner to join with them to 
>>>>>>>>>>>>>>> avoid even
>>>>>>>>>>>>>>> numbered clusters), but I think the design should not preclude 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> possibility of nodes becoming eligible for some roles or 
>>>>>>>>>>>>>>> withdrawing from
>>>>>>>>>>>>>>> some roles, and treatment of roles should be consistent. In 
>>>>>>>>>>>>>>> some cases
>>>>>>>>>>>>>>> someone may decide it's worth the work of handling the 
>>>>>>>>>>>>>>> concurrency
>>>>>>>>>>>>>>> concerns, best if they don't have to break back compat or hack 
>>>>>>>>>>>>>>> their code
>>>>>>>>>>>>>>> around the assumption it wouldn't happen to do it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Taking the zookeeper case as an example, it very much might
>>>>>>>>>>>>>>> be desirable to have the possibility to heal the zk cluster by 
>>>>>>>>>>>>>>> promoting
>>>>>>>>>>>>>>> another node (configured as eligible for zk) to active zk duty 
>>>>>>>>>>>>>>> if one of
>>>>>>>>>>>>>>> the current zk nodes has been down long enough (say on prem 
>>>>>>>>>>>>>>> hardware,
>>>>>>>>>>>>>>> motherboard pops a capacitor, server gone for a week while new 
>>>>>>>>>>>>>>> hardware is
>>>>>>>>>>>>>>> purchased, built and configured). Especially if the down node 
>>>>>>>>>>>>>>> didn't hold
>>>>>>>>>>>>>>> data or other nodes had sufficient replicas and the cluster is 
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> answering queries just fine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I know of a case that would benefit from having separate
>>>>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which 
>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more 
>>>>>>>>>>>>>>>> in prep for
>>>>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data could 
>>>>>>>>>>>>>>>> then be
>>>>>>>>>>>>>>>> hosted on cheaper nodes....
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is the main motivation behind this work. SOLR-15715
>>>>>>>>>>>>>>>> needs this, and hence it would be good to get this in as soon 
>>>>>>>>>>>>>>>> as possible.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we can incrementally work towards configurability
>>>>>>>>>>>>>>> for all of these roles. The current default state is that a 
>>>>>>>>>>>>>>> node has all
>>>>>>>>>>>>>>> roles and the incremental progress is to enable removing a role 
>>>>>>>>>>>>>>> from a
>>>>>>>>>>>>>>> node. This I think is why it might be good to to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A) Determine the set of roles our current solr nodes are
>>>>>>>>>>>>>>> performing (that might be removed in some scenario) and 
>>>>>>>>>>>>>>> document this via
>>>>>>>>>>>>>>> assigning these roles as default on as this SIP goes live.
>>>>>>>>>>>>>>> B) Figure out what the process of adding something entirely
>>>>>>>>>>>>>>> new that we haven't yet thought of with its own role would look 
>>>>>>>>>>>>>>> like.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it would be great if we not only satisfied the
>>>>>>>>>>>>>>> current need but determined how we expect this to change over 
>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Ishan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 6:32 PM Gus Heck <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The SIP looks like a good start, and I was already
>>>>>>>>>>>>>>>>> thinking of something very similar to this as a follow on to 
>>>>>>>>>>>>>>>>> my attempts to
>>>>>>>>>>>>>>>>> split the uber filter (SolrDispatchFilter) into servlets such 
>>>>>>>>>>>>>>>>> that roles
>>>>>>>>>>>>>>>>> determine what servlets are deployed, but I would like to 
>>>>>>>>>>>>>>>>> recommend that
>>>>>>>>>>>>>>>>> the roles be all positive ("Can do this") and nodes with no 
>>>>>>>>>>>>>>>>> role at all are
>>>>>>>>>>>>>>>>> ineligible for all activities. (just like standard role 
>>>>>>>>>>>>>>>>> permissioning
>>>>>>>>>>>>>>>>> systems). This will make it much more familiar and easy to 
>>>>>>>>>>>>>>>>> think about.
>>>>>>>>>>>>>>>>> Therefore there would be no need for a role such as !data 
>>>>>>>>>>>>>>>>> which I presume
>>>>>>>>>>>>>>>>> was meant to mean "no data on this node"... rather just don't 
>>>>>>>>>>>>>>>>> give the
>>>>>>>>>>>>>>>>> "data" role to the node.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Additional node roles I think should exist:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think that we should expand/edit your list of roles to be
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - QUERY - accepts and analyzes queries up to the point
>>>>>>>>>>>>>>>>>    of actually consulting the lucene index (useful if you 
>>>>>>>>>>>>>>>>> have a very heavy
>>>>>>>>>>>>>>>>>    analysis phase)
>>>>>>>>>>>>>>>>>    - UPDATE - accepts update requests, and performs
>>>>>>>>>>>>>>>>>    update functionality prior to and including
>>>>>>>>>>>>>>>>>    DistributedUpdateProcessorFactory (useful if you have a 
>>>>>>>>>>>>>>>>> very heavy analysis
>>>>>>>>>>>>>>>>>    phase)
>>>>>>>>>>>>>>>>>    - ADMIN - accepts admin/management commands
>>>>>>>>>>>>>>>>>    - UI - hosts an admin ui
>>>>>>>>>>>>>>>>>    - ZOOKEEPER - hosts embedded zookeeper
>>>>>>>>>>>>>>>>>    - OVERSEER - performs overseer related functionality
>>>>>>>>>>>>>>>>>    (though IIRC there's a proposal to eliminate overseer that 
>>>>>>>>>>>>>>>>> might eliminate
>>>>>>>>>>>>>>>>>    this)
>>>>>>>>>>>>>>>>>    - DATA - nodes where there is a lucene index and
>>>>>>>>>>>>>>>>>    matching against the analyzed results of a query may be 
>>>>>>>>>>>>>>>>> conducted to
>>>>>>>>>>>>>>>>>    generate a response, also performs update steps that come 
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>    DistributedUpdateProcesserFactory
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also suggest that these roles each have a node in
>>>>>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) 
>>>>>>>>>>>>>>>>> so that code
>>>>>>>>>>>>>>>>> that wants to find a node with an appropriate role does not 
>>>>>>>>>>>>>>>>> need to scan
>>>>>>>>>>>>>>>>> the list of all nodes parsing something to discover which 
>>>>>>>>>>>>>>>>> nodes apply and
>>>>>>>>>>>>>>>>> also does not have to parse json to do it. I think this will 
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> particularly key for zookeeper nodes which might be 3 out of 
>>>>>>>>>>>>>>>>> 100 or more
>>>>>>>>>>>>>>>>> nodes. Similar to how we track live nodes. I think we should 
>>>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>> nodes.json too that tracks what roles a node is ALLOWED to 
>>>>>>>>>>>>>>>>> take (as opposed
>>>>>>>>>>>>>>>>> to which roles it currently servicing)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So running code consults the zookeeper role list of nodes,
>>>>>>>>>>>>>>>>> and any code seeking to transition a node (an admin operation 
>>>>>>>>>>>>>>>>> with much
>>>>>>>>>>>>>>>>> lower performance requirements) consults the json data in the 
>>>>>>>>>>>>>>>>> nodes.json
>>>>>>>>>>>>>>>>> node, parses it, finds the node in question and checks what 
>>>>>>>>>>>>>>>>> it's eligible
>>>>>>>>>>>>>>>>> for (this will correspond to which servlets/apps have been 
>>>>>>>>>>>>>>>>> loaded).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I know of a case that would benefit from having separate
>>>>>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which 
>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more 
>>>>>>>>>>>>>>>>> in prep for
>>>>>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data 
>>>>>>>>>>>>>>>>> could then be
>>>>>>>>>>>>>>>>> hosted on cheaper nodes....
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also maybe think about how this relates to NRT/TLOG/PULL
>>>>>>>>>>>>>>>>> which are also maybe role like
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Gus
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 3:17 AM Ishan Chattopadhyaya <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We also wish to add first class support for Query nodes
>>>>>>>>>>>>>>>>>> that are used to process user queries by forwarding to data 
>>>>>>>>>>>>>>>>>> nodes,
>>>>>>>>>>>>>>>>>> merging/aggregating them and presenting to users. This 
>>>>>>>>>>>>>>>>>> concept exists as
>>>>>>>>>>>>>>>>>> first class citizens in most other search engines. This is a 
>>>>>>>>>>>>>>>>>> chance for
>>>>>>>>>>>>>>>>>> Solr to catch up.
>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Ishan / Noble / Hitesh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>>>>>>>>>>>> http://www.the111shift.com (play)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>>>>>>>>>> http://www.the111shift.com (play)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>>>>>>> http://www.the111shift.com (play)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>>>> http://www.the111shift.com (play)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>> http://www.the111shift.com (play)
>>>>>>>
>>>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: First class support for node roles

Reply via email to