If the changes and the scope seem acceptable, should we proceed for a vote?
On Mon, Nov 1, 2021 at 5:22 PM Ishan Chattopadhyaya < [email protected]> wrote: > Hi Gus, > > Thanks for the summary. > > > (+Gus, +Houston,+Ilan) Positive roles, the existence of which implies > functionality such that if a node can provide functionality. i.e. it always > has the role if it can and if it doesn't have the role it can't provide > the functionality. > > I've removed the concept of "!data" from the SIP proposal. A node that > doesn't have -Dnode.roles parameter will be assumed to have > -Dnode.roles=data. If a node is started with a node.roles param, it must > include "data" for all nodes hosting data. > > > (+Houston,+Ishan,+Gus - below) Rename query role > > Coordinator role should be better now. > > > - > (+Gus) We should include a plan for the overall set of roles to > work towards and then build them out as time allows us to. > - > (+Gus) We have a distinction between "capable" and "currently > providing" > - > (+Gus) Capable be evidenced by a config/startup designation that > adds a list of roles to a json file in zk where the nodes are all > listed > - > (+Gus) Providing be evidenced by the node adding an list of > ephemeral nodes (similar to live_nodes) for each role > > > From an overall conceptual point of view, there doesn't need to be any > specialization for a role. When a new role is introduced, such details on > behaviour and implementation can be documented and defined then. As for > OVERSEER role today, it can be documented as a role that marks a node to be > a "preferred" overseer (or eligible/capable etc.), and "currently > providing" can be determined by the OVERSEERSTATUS api call or the overseer > leader election queue. > > > (+Ilan, +Gus) Making collections role aware > > Seems to me that this is something that can be introduced as a follow up, > and we don't want to complicate the proposed design early on. > > > Ishan, specifics on how your coordinator node would work would be > interesting to know if it really is distinct from my concept of a "query" > node. I agree that that term is probably confusing, I used it to mean > "query parsing" you meant it as "query aggregator". > > As of now, the coordinator node would be capable to servicing query (or > indexing at a later point in time) requests by handling the queries on the > coordinator nodes itself, and making shard-requests to data nodes. If we > want to have the coordinator nodes do even more work, i.e. do query parsing > on behalf of the shards, the capability can be further enhanced. > > Regards, > Ishan > > On Fri, Oct 29, 2021 at 7:21 PM Gus Heck <[email protected]> wrote: > >> edit: >> 6. (+Gus) Providing be evidenced by a the node *adding itself to a list* >> of ephemeral nodes (similar to live_nodes) for each role >> >> On Fri, Oct 29, 2021 at 9:40 AM Gus Heck <[email protected]> wrote: >> >>> I've heard a number of folks agree that we should not have negative >>> (role removal) values for roles (!data in the sip). >>> >>> I also don't like the idea of the "coordinator" creating assumptions >>> about other roles. I think the point of avoiding "!data" is to make it >>> programmatically and logically easy to tell what role a node has, if we >>> have to have a method called figureOutImpliedRoles() with a lot of logic in >>> it that's bad. It should just be getRoles().contains(role), trivially >>> returning the roles that are already declared in config/zk/whatever. >>> >>> We don't have to support every possible role all at once. We can have >>> "basic functionality" that all nodes provide regardless of roles (right now >>> that's everything), and then lop off chunks of basic functionality and >>> assign them to roles. That should be easy and backward compatible if we >>> then give the new role to every node by default on upgrade. >>> >>> However we should carefully think about what should and shouldn't be >>> part of any role, because moving functionality out of a role back to basic >>> functionality or between roles will create backwards compatibility issues. >>> This is why I think we should have a concept of what roles we will have in >>> the future, so we don't inadvertently move functionality into a role that >>> later needs to go in some other role (mistakes/bugs may happen of course, >>> but best effort). >>> >>> So boiling it down I've seen suggestion for the following >>> additions/edits to the SIP: >>> >>> 1. (+Gus, +Houston,+Ilan) Positive roles, the existence of which >>> implies functionality such that if a node can provide functionality. i.e. >>> it always has the role if it can and if it doesn't have the role it can't >>> provide the functionality. >>> 2. (+Houston,+Ishan,+Gus - below) Rename query role >>> 3. (+Gus) We should include a plan for the overall set of roles to >>> work towards and then build them out as time allows us to. >>> 4. (+Gus) We have a distinction between "capable" and "currently >>> providing" >>> 5. (+Gus) Capable be evidenced by a config/startup designation that >>> adds a list of roles to a json file in zk where the nodes are all listed >>> 6. (+Gus) Providing be evidenced by the node adding an list of >>> ephemeral nodes (similar to live_nodes) for each role >>> 7. (+Ilan, +Gus) Making collections role aware >>> >>> Ilan suggested that we make collections role-aware which would make some >>> sense since the collection might want to have a minimum of 2 >>> query-aggregator nodes available, might want to avoid zk nodes, etc. I >>> think that this is a good next feature and the intention should be added to >>> the SIP, but need not be in the initial implementation since by default >>> everything can have all roles (roles implemented to date) and initially >>> removing roles from nodes will be an advanced/manual feature mostly >>> applicable to static clusters that don't add collections regularly, then >>> support for role aware collections can be added to make the feature useful >>> for a wider audience (should be its own ticket anyway, and it interacts >>> with replica placement). >>> >>> I've heard several agree with #1, and it seems 3-6 were either not yet >>> clear or folks are still deliberating as I haven't noticed positive or >>> negative opinions there, just some discussion of the definition of >>> candidate roles. I'm fond of 3-5 because it allows for things like knowing >>> what the capabilities of a down node are, and finding a provider without >>> having to cross-coordinate with live_nodes. (keeps code simple, avoids >>> racing between the check for liveness and the check for the capability) >>> Also, a node joining as live and able to serve queries can be decoupled >>> from when it's ready to provide a service (thinking at least zk here, >>> waiting for a 2nd node capable of zk before expanding the zk cluster to >>> avoid even numbered clusters). >>> >>> Ishan, specifics on how your coordinator node would work would be >>> interesting to know if it really is distinct from my concept of a "query" >>> node. I agree that that term is probably confusing, I used it to mean >>> "query parsing" you meant it as "query aggregator". >>> >>> As a side note, with positive only roles and all roles added unless >>> specified otherwise, Ishan's use case might be as simple as just removing >>> the DATA role from a few nodes and restricting the aggregation queries >>> concerned to those nodes. To get solr to enforce the restriction for you, >>> then a "query/compute/coordinator" role must be removed from the remainder >>> of the nodes. >>> >>> -Gus >>> >>> On Fri, Oct 29, 2021 at 5:49 AM Ishan Chattopadhyaya < >>> [email protected]> wrote: >>> >>>> > I'll introduce a change to the SIP document, unless there are >>>> objections/questions/concerns. WDYT? >>>> I've made the change to the document. Feedback on this welcome. >>>> >>>> On Fri, Oct 29, 2021 at 2:52 PM Ishan Chattopadhyaya < >>>> [email protected]> wrote: >>>> >>>>> It seems to me, after going through this thread, that the role "query" >>>>> is misleading for what we're planning to introduce in SOLR-15715. We're >>>>> essentially introducing a functionality for performing, what we initially >>>>> called, "query aggregations". The actual queries run on data nodes anyway, >>>>> just that the first point of entry for such distributed queries will be a >>>>> separate node with this extra functionality. Towards that, I feel we >>>>> should >>>>> call a node with such capability as a "coordinator" node (instead of >>>>> "query >>>>> node"). It fits nicely with the mental model of ElasticSearch as well: >>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#coordinating-node >>>>> . >>>>> >>>>> Proposing that if a node has a role "coordinator", then that node is >>>>> already assumed to have no data replicas on it. If a node starts with >>>>> roles >>>>> "coordinator,data" both, then the startup should fail with a message >>>>> saying >>>>> a coordinator node should not host data on it. A coordinator node, though, >>>>> can have other roles like "zookeeper" or "overseer" etc. >>>>> >>>>> I'll introduce a change to the SIP document, unless there are >>>>> objections/questions/concerns. WDYT? >>>>> >>>>> >>>>> >>>>> On Fri, Oct 29, 2021 at 1:54 PM Ilan Ginzburg <[email protected]> >>>>> wrote: >>>>> >>>>>> If we make collections role-aware for example (replicas of that >>>>>> collection can only be placed on nodes with a specific role, in addition >>>>>> to >>>>>> the other role based constraints), the set of roles should be user >>>>>> extensible and not fixed. >>>>>> >>>>>> If collections are not role aware, the constraints introduced by >>>>>> roles apply to all collections equally which might be insufficient if a >>>>>> user needs for example a heavily used collection to only be placed on >>>>>> more >>>>>> powerful nodes. >>>>>> >>>>>> Ilan >>>>>> >>>>>> On Thu 28 Oct 2021 at 07:59, Gus Heck <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 3:34 PM Houston Putman < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I don't think it's unreasonable to want to have nodes that don't >>>>>>>>> accept queries. This is just ishan's use case. >>>>>>>> >>>>>>>> >>>>>>>> Maybe I am misunderstanding, and it does deal with your last point >>>>>>>> about inter-node communication, but Peer-sync uses queries when doing >>>>>>>> replication between replicas. If a node doesn't have queries enabled, >>>>>>>> then >>>>>>>> it's possible to break peer sync. There are other options to make sure >>>>>>>> certain replicas aren't queried (shards.preference). >>>>>>>> For the separation of update/query traffic, I understand having >>>>>>>> compute nodes that deal with pre-replica commands, such as managing >>>>>>>> distributed queries or pre-processing documents in the URP chain. But >>>>>>>> for >>>>>>>> actual non-distrib queries and final update requests, the only way to >>>>>>>> actually separate this traffic is using PULL/TLOG replicas, because >>>>>>>> otherwise (with NRT) all update requests are still going to the query >>>>>>>> nodes, just the same as non-query nodes that are hosting those >>>>>>>> replicas. >>>>>>>> (and shard leadership could go to any "data" node, since I imagine we >>>>>>>> wouldn't filter out the "query" nodes...) The shards.preference option >>>>>>>> makes it easy to send queries to only PULL replicas in this scenario. >>>>>>>> That's why I saw this more as a "compute" role rather than "query". >>>>>>>> >>>>>>> >>>>>>> Yeah for internal stuff we still need the ability to query so we >>>>>>> might need to accommodate that that, but you may not have noticed the >>>>>>> bit >>>>>>> where I mentioned Query nodes doing the parsing/analysis of the query >>>>>>> and >>>>>>> then sending a fully parsed (possibly serialized lucene objects) query >>>>>>> to >>>>>>> the data node. So data nodes would only speak a single lucene level >>>>>>> query >>>>>>> language and not parse queries or analyze text. In theory, with that, >>>>>>> one >>>>>>> could even have something like elastic reduce a request to lucene >>>>>>> objects >>>>>>> and then get an answer from a solr data node (for simple cases without >>>>>>> need >>>>>>> to find shards via zookeeper etc) certainly many details and corner >>>>>>> cases >>>>>>> to think about there. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Definitely not what I would like. If I'm going to try to segregate >>>>>>>>> data nodes out to certain nodes, I don't want them appearing >>>>>>>>> elsewhere just >>>>>>>>> cause something went down or filled up. Nor would I want updates to >>>>>>>>> suddenly start bogging down my query nodes.... >>>>>>>>> >>>>>>>> >>>>>>>> I guess it depends on what you are optimizing for. Maybe there can >>>>>>>> be an option for this. like -DnonLenientRoles=data,update or something >>>>>>>> like >>>>>>>> that. >>>>>>>> >>>>>>> >>>>>>> A possibility >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Wed, Oct 27, 2021 at 3:03 PM Gus Heck <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 27, 2021 at 2:44 PM Houston Putman < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> As for the "query" role, let's name it something better like >>>>>>>>>> "compute", since data nodes are always going to be "querying". >>>>>>>>>> >>>>>>>>> >>>>>>>>> I don't think it's unreasonable to want to have nodes that don't >>>>>>>>> accept queries. This is just ishan's use case. >>>>>>>>> >>>>>>>>> >>>>>>>>>> if no live nodes have roles=overseer (or roles=all), then we >>>>>>>>>> should just select any node to be overseer. This should be the same >>>>>>>>>> for >>>>>>>>>> compute, data, etc. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Definitely not what I would like. If I'm going to try to segregate >>>>>>>>> data nodes out to certain nodes, I don't want them appearing >>>>>>>>> elsewhere just >>>>>>>>> cause something went down or filled up. Nor would I want updates to >>>>>>>>> suddenly start bogging down my query nodes.... >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> So, for the proposal, lets say "data" is a special role which is >>>>>>>>>>> assumed by default, and is enabled on all nodes unless there's a >>>>>>>>>>> !data. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Instead of this, maybe we have role groups. Such as >>>>>>>>>> admin~=overseer,zk or worker~=compute,data,updateProcessing >>>>>>>>>> >>>>>>>>> >>>>>>>>> Roll groups sounds like a next level feature to be built on top >>>>>>>>> once we figure out how roles work independently. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> As for the suggested Roles, I'm not sure ADMIN or UI really fit, >>>>>>>>>> since there is another option to disable the UI for a solr node, and >>>>>>>>>> various ADMIN commands have to be accepted across other node roles. >>>>>>>>>> (Data >>>>>>>>>> nodes require the Collections API, same with the overseer.) >>>>>>>>>> >>>>>>>>> >>>>>>>>> I admit I'm angling towards a world in which enabling and >>>>>>>>> disabling broad features is done in one way in one place... As for >>>>>>>>> admin >>>>>>>>> there might be a distinction between commands issued between nodes >>>>>>>>> and from >>>>>>>>> the outside world... I have this other idea about inter-node >>>>>>>>> communication >>>>>>>>> that's even less baked that I wont distract with here ;) >>>>>>>>> >>>>>>>>> >>>>>>>>>> - Houston >>>>>>>>>> >>>>>>>>>> On Wed, Oct 27, 2021 at 1:34 PM Ishan Chattopadhyaya < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> bq. In other words, roles are all "positive", but their >>>>>>>>>>> consequences are only negative (rejecting when the matching >>>>>>>>>>> positive role >>>>>>>>>>> is not present). >>>>>>>>>>> >>>>>>>>>>> Essentially, yes. A node that doesn't specify any role should be >>>>>>>>>>> able to do everything. >>>>>>>>>>> >>>>>>>>>>> Let me just take a brief detour and mention our thoughts on the >>>>>>>>>>> "query" role. While all data nodes can also be used for querying, >>>>>>>>>>> our idea >>>>>>>>>>> was to create a layer of nodes that have some special mechanism to >>>>>>>>>>> be able >>>>>>>>>>> to proxy/forward queries to data nodes (lets call it "pseudo cores" >>>>>>>>>>> or >>>>>>>>>>> "synthetic cores" or "proxy cores". Our thought was that any node >>>>>>>>>>> that has >>>>>>>>>>> "query,!data" role would enable this special mode on startup >>>>>>>>>>> (whereby >>>>>>>>>>> requests are served by these special pseudo cores). We'll discuss >>>>>>>>>>> about >>>>>>>>>>> this in detail in that issue. >>>>>>>>>>> >>>>>>>>>>> Back to the main subject here. >>>>>>>>>>> >>>>>>>>>>> Lets take a practical scenario: >>>>>>>>>>> * Layer1: Organization has about 100 nodes, each node has many >>>>>>>>>>> data replicas >>>>>>>>>>> * Layer2: To manage such a large cluster reliably, they keep >>>>>>>>>>> aside 4-5 dedicated overseer nodes. >>>>>>>>>>> * Layer3: Since query aggregations/coordination can potentially >>>>>>>>>>> be expensive, they keep aside 5-10 query nodes. >>>>>>>>>>> >>>>>>>>>>> My preference would be as follows: >>>>>>>>>>> * I'd like to refer to Layer1 nodes as the "data nodes" and >>>>>>>>>>> hence get either no role defined for them or -Dnode.roles=data. >>>>>>>>>>> * I'd like to refer to Layer2 nodes as "overseer nodes" (even >>>>>>>>>>> though I understand, only one of them can be an overseer at a >>>>>>>>>>> time). I'd >>>>>>>>>>> like to have -Dnode.roles=!data,overseer >>>>>>>>>>> * I'd like to refer to Layer3 nodes as "query nodes", with >>>>>>>>>>> -Dnode.roles=!data,query >>>>>>>>>>> >>>>>>>>>>> ^ This seems very practical from operational standpoint. >>>>>>>>>>> >>>>>>>>>>> So, for the proposal, lets say "data" is a special role which is >>>>>>>>>>> assumed by default, and is enabled on all nodes unless there's a >>>>>>>>>>> !data. It >>>>>>>>>>> is presumed that data nodes can also serve queries directly, so >>>>>>>>>>> adding a >>>>>>>>>>> "query" to those nodes is meaningless (also because there's no >>>>>>>>>>> practical >>>>>>>>>>> benefit to stopping a data node from receiving a query for "!query" >>>>>>>>>>> role to >>>>>>>>>>> be useful). >>>>>>>>>>> >>>>>>>>>>> "query" role on nodes that don't host data really refers to a >>>>>>>>>>> special capability for lightweight, stateless nodes. I don't want >>>>>>>>>>> to add a >>>>>>>>>>> "!query" on dedicated overseer nodes, and hence I don't want to >>>>>>>>>>> assume that >>>>>>>>>>> "query" is implicitly avaiable on any node even if the role isn't >>>>>>>>>>> specified. >>>>>>>>>>> >>>>>>>>>>> "overseer" role is complicated, since it is already defined and >>>>>>>>>>> we don't have the opportunity to define it the right way. I'd hate >>>>>>>>>>> having >>>>>>>>>>> to put a "!overseer" on every data node on startup in order to have >>>>>>>>>>> a few >>>>>>>>>>> dedicated overseers. >>>>>>>>>>> >>>>>>>>>>> In short, in this SIP, I just wish to implement the concept of >>>>>>>>>>> nodes and its handling. How individual roles are leveraged can be >>>>>>>>>>> up to >>>>>>>>>>> every new role's implementation. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Oct 27, 2021 at 9:54 PM Gus Heck <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> In other words, roles are all "positive", but their >>>>>>>>>>>>> consequences are only negative (rejecting when the matching >>>>>>>>>>>>> positive role >>>>>>>>>>>>> is not present). >>>>>>>>>>>>> >>>>>>>>>>>>> Yeah right. to do something the machine needs the role >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> We can also consider no role defined = all roles allowed. Will >>>>>>>>>>>>> make things simpler. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> in terms of startup command yes. Internally we should have all >>>>>>>>>>>> explicitly assigned when no roles are specified at startup so that >>>>>>>>>>>> the code >>>>>>>>>>>> doesn't have a million if checks for the empty case >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Oct 27, 2021 at 6:14 PM Ilan Ginzburg < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> How do we expect the roles to be used? >>>>>>>>>>>>>> One way I see is a node refusing to do anything related to a >>>>>>>>>>>>>> role it doesn't have. >>>>>>>>>>>>>> For example if a node does not have role "data", any attempt >>>>>>>>>>>>>> to create a core on it would fail. >>>>>>>>>>>>>> A node not having the role "query", will refuse to have >>>>>>>>>>>>>> anything to do with handling a query etc. >>>>>>>>>>>>>> Then it would be up to other code to make sure only the >>>>>>>>>>>>>> appropriate nodes are requested to do any type of action. >>>>>>>>>>>>>> So for example any replica placement code plugin would have >>>>>>>>>>>>>> to restrict the set of candidate nodes for a new replica >>>>>>>>>>>>>> placement to those >>>>>>>>>>>>>> having "data". Otherwise the call would fail, and there should >>>>>>>>>>>>>> be nothing >>>>>>>>>>>>>> the replica placement code can do about it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Similarly, the "overseer" role would limit the nodes that >>>>>>>>>>>>>> participate in the Overseer election. The Overseer election code >>>>>>>>>>>>>> would have >>>>>>>>>>>>>> to remove (or not add) all non qualifying nodes from the >>>>>>>>>>>>>> election, and we >>>>>>>>>>>>>> should expect a node without role "overseer" to refuse to start >>>>>>>>>>>>>> the >>>>>>>>>>>>>> Overseer machinery if asked to... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Trying to make the use case clear regarding how roles are >>>>>>>>>>>>>> used. >>>>>>>>>>>>>> Ilan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 5:47 PM Gus Heck <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 9:55 AM Ishan Chattopadhyaya < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Gus, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I think that we should expand/edit your list of roles to >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The list can be expanded as and when more isolation and >>>>>>>>>>>>>>>> features are needed. I only listed those roles that we already >>>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>>> functionality for or is under development. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Well all of those roles (except zookeeper) are things nodes >>>>>>>>>>>>>>> do today. As it stands they are all doing all of them. What we >>>>>>>>>>>>>>> add support >>>>>>>>>>>>>>> for as we move forward is starting without a role, and add the >>>>>>>>>>>>>>> zookeeper >>>>>>>>>>>>>>> role when that feature is ready. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I would like to recommend that the roles be all positive >>>>>>>>>>>>>>>> ("Can do this") and nodes with no role at all are ineligible >>>>>>>>>>>>>>>> for all >>>>>>>>>>>>>>>> activities. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It comes down to the defaults and backcompat. If we want >>>>>>>>>>>>>>>> all Solr nodes to be able to host data replicas by default >>>>>>>>>>>>>>>> (without user >>>>>>>>>>>>>>>> explicitly specifying role=data), then we need a way to unset >>>>>>>>>>>>>>>> this role. >>>>>>>>>>>>>>>> The most reasonable way sounded like a "!data". We can do away >>>>>>>>>>>>>>>> with !data >>>>>>>>>>>>>>>> if we mandate each and every data node have the role "data" >>>>>>>>>>>>>>>> explicitly >>>>>>>>>>>>>>>> defined for it, which breaks backcompat and also is cumbersome >>>>>>>>>>>>>>>> to use for >>>>>>>>>>>>>>>> those who don't want to use these special roles. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Not sure I understand, which of the roles I mentioned (other >>>>>>>>>>>>>>> than zookeeper, which I expect is intended as different from >>>>>>>>>>>>>>> our current >>>>>>>>>>>>>>> embedded zk) is NOT currently supported by a single cloud node >>>>>>>>>>>>>>> brought up >>>>>>>>>>>>>>> as shown in our tutorials/docs? I'm certainly not proposing >>>>>>>>>>>>>>> that the >>>>>>>>>>>>>>> default change to nothing. The default is all roles, unless you >>>>>>>>>>>>>>> specify >>>>>>>>>>>>>>> roles at startup. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I also suggest that these roles each have a node in >>>>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) so >>>>>>>>>>>>>>>> that code >>>>>>>>>>>>>>>> that wants to find a node with an appropriate role does not >>>>>>>>>>>>>>>> need to scan >>>>>>>>>>>>>>>> the list of all nodes parsing something to discover which >>>>>>>>>>>>>>>> nodes apply and >>>>>>>>>>>>>>>> also does not have to parse json to do it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /roles.json exists today, it has role as key and list of >>>>>>>>>>>>>>>> nodes as value. In the next major version, we can change the >>>>>>>>>>>>>>>> format of that >>>>>>>>>>>>>>>> file and use key as node, value as list of roles. Or, maybe we >>>>>>>>>>>>>>>> can go for >>>>>>>>>>>>>>>> adding the roles to the data for each item in the list of >>>>>>>>>>>>>>>> live_nodes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not finding anything in our documentation about >>>>>>>>>>>>>>> roles.json so I think it's an internal implementation detail, >>>>>>>>>>>>>>> which reduces >>>>>>>>>>>>>>> back compat concerns. ADDROLE/REMOVEROLE don't accept json or >>>>>>>>>>>>>>> anything like >>>>>>>>>>>>>>> that and could be made to work with zk nodes too. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The fact that some precursor work was done without a SIP (or >>>>>>>>>>>>>>> before SIPs existed) should not hamstring our design once a SIP >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> clearly covers the same topic is under consideration. By their >>>>>>>>>>>>>>> nature SIP's >>>>>>>>>>>>>>> are non-trivial and often will include compatibility breaks. >>>>>>>>>>>>>>> Good news is I >>>>>>>>>>>>>>> don't think I see one here, just a code change to transition to >>>>>>>>>>>>>>> a different >>>>>>>>>>>>>>> zk backend. I think that it's probably a mistake to consider >>>>>>>>>>>>>>> our zookeeper >>>>>>>>>>>>>>> data a public API and we should be moving away from that or at >>>>>>>>>>>>>>> the very >>>>>>>>>>>>>>> least segregating clearly what in zk is long term reliable. >>>>>>>>>>>>>>> Ideally our >>>>>>>>>>>>>>> v1/v2 api's should be the public api through which information >>>>>>>>>>>>>>> about the >>>>>>>>>>>>>>> cluster is obtained. Programming directly against zk is kind of >>>>>>>>>>>>>>> like a >>>>>>>>>>>>>>> custom build of solr. Sometimes useful and appropriate, but >>>>>>>>>>>>>>> maintenance is >>>>>>>>>>>>>>> your concern. For code plugging into solr, it should in theory >>>>>>>>>>>>>>> be against >>>>>>>>>>>>>>> an internal information java api, and zookeeper should not be >>>>>>>>>>>>>>> touched >>>>>>>>>>>>>>> directly. (I know this is not in a good state or at least >>>>>>>>>>>>>>> wasn't last time >>>>>>>>>>>>>>> I looked closely, but it should be where we are heading). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> > any code seeking to transition a node >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We considered this situation and realized that it is very >>>>>>>>>>>>>>>> risky to have nodes change roles while they are up and >>>>>>>>>>>>>>>> running. Better to >>>>>>>>>>>>>>>> assign fixed roles upon startup. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I agree that concurrency is hard. I definitely think startup >>>>>>>>>>>>>>> time assignments should be involved here. I'm not thinking that >>>>>>>>>>>>>>> every >>>>>>>>>>>>>>> transition must be supported. As a starting point it would be >>>>>>>>>>>>>>> fine if none >>>>>>>>>>>>>>> were. Having something suddenly become zookeeper is probably >>>>>>>>>>>>>>> tricky to >>>>>>>>>>>>>>> support (see discussion in that thread regarding nodes not >>>>>>>>>>>>>>> actually >>>>>>>>>>>>>>> participating until they have a partner to join with them to >>>>>>>>>>>>>>> avoid even >>>>>>>>>>>>>>> numbered clusters), but I think the design should not preclude >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> possibility of nodes becoming eligible for some roles or >>>>>>>>>>>>>>> withdrawing from >>>>>>>>>>>>>>> some roles, and treatment of roles should be consistent. In >>>>>>>>>>>>>>> some cases >>>>>>>>>>>>>>> someone may decide it's worth the work of handling the >>>>>>>>>>>>>>> concurrency >>>>>>>>>>>>>>> concerns, best if they don't have to break back compat or hack >>>>>>>>>>>>>>> their code >>>>>>>>>>>>>>> around the assumption it wouldn't happen to do it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Taking the zookeeper case as an example, it very much might >>>>>>>>>>>>>>> be desirable to have the possibility to heal the zk cluster by >>>>>>>>>>>>>>> promoting >>>>>>>>>>>>>>> another node (configured as eligible for zk) to active zk duty >>>>>>>>>>>>>>> if one of >>>>>>>>>>>>>>> the current zk nodes has been down long enough (say on prem >>>>>>>>>>>>>>> hardware, >>>>>>>>>>>>>>> motherboard pops a capacitor, server gone for a week while new >>>>>>>>>>>>>>> hardware is >>>>>>>>>>>>>>> purchased, built and configured). Especially if the down node >>>>>>>>>>>>>>> didn't hold >>>>>>>>>>>>>>> data or other nodes had sufficient replicas and the cluster is >>>>>>>>>>>>>>> still >>>>>>>>>>>>>>> answering queries just fine. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I know of a case that would benefit from having separate >>>>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which >>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more >>>>>>>>>>>>>>>> in prep for >>>>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data could >>>>>>>>>>>>>>>> then be >>>>>>>>>>>>>>>> hosted on cheaper nodes.... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is the main motivation behind this work. SOLR-15715 >>>>>>>>>>>>>>>> needs this, and hence it would be good to get this in as soon >>>>>>>>>>>>>>>> as possible. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think we can incrementally work towards configurability >>>>>>>>>>>>>>> for all of these roles. The current default state is that a >>>>>>>>>>>>>>> node has all >>>>>>>>>>>>>>> roles and the incremental progress is to enable removing a role >>>>>>>>>>>>>>> from a >>>>>>>>>>>>>>> node. This I think is why it might be good to to >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> A) Determine the set of roles our current solr nodes are >>>>>>>>>>>>>>> performing (that might be removed in some scenario) and >>>>>>>>>>>>>>> document this via >>>>>>>>>>>>>>> assigning these roles as default on as this SIP goes live. >>>>>>>>>>>>>>> B) Figure out what the process of adding something entirely >>>>>>>>>>>>>>> new that we haven't yet thought of with its own role would look >>>>>>>>>>>>>>> like. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it would be great if we not only satisfied the >>>>>>>>>>>>>>> current need but determined how we expect this to change over >>>>>>>>>>>>>>> time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Ishan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 6:32 PM Gus Heck < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The SIP looks like a good start, and I was already >>>>>>>>>>>>>>>>> thinking of something very similar to this as a follow on to >>>>>>>>>>>>>>>>> my attempts to >>>>>>>>>>>>>>>>> split the uber filter (SolrDispatchFilter) into servlets such >>>>>>>>>>>>>>>>> that roles >>>>>>>>>>>>>>>>> determine what servlets are deployed, but I would like to >>>>>>>>>>>>>>>>> recommend that >>>>>>>>>>>>>>>>> the roles be all positive ("Can do this") and nodes with no >>>>>>>>>>>>>>>>> role at all are >>>>>>>>>>>>>>>>> ineligible for all activities. (just like standard role >>>>>>>>>>>>>>>>> permissioning >>>>>>>>>>>>>>>>> systems). This will make it much more familiar and easy to >>>>>>>>>>>>>>>>> think about. >>>>>>>>>>>>>>>>> Therefore there would be no need for a role such as !data >>>>>>>>>>>>>>>>> which I presume >>>>>>>>>>>>>>>>> was meant to mean "no data on this node"... rather just don't >>>>>>>>>>>>>>>>> give the >>>>>>>>>>>>>>>>> "data" role to the node. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Additional node roles I think should exist: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think that we should expand/edit your list of roles to be >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - QUERY - accepts and analyzes queries up to the point >>>>>>>>>>>>>>>>> of actually consulting the lucene index (useful if you >>>>>>>>>>>>>>>>> have a very heavy >>>>>>>>>>>>>>>>> analysis phase) >>>>>>>>>>>>>>>>> - UPDATE - accepts update requests, and performs >>>>>>>>>>>>>>>>> update functionality prior to and including >>>>>>>>>>>>>>>>> DistributedUpdateProcessorFactory (useful if you have a >>>>>>>>>>>>>>>>> very heavy analysis >>>>>>>>>>>>>>>>> phase) >>>>>>>>>>>>>>>>> - ADMIN - accepts admin/management commands >>>>>>>>>>>>>>>>> - UI - hosts an admin ui >>>>>>>>>>>>>>>>> - ZOOKEEPER - hosts embedded zookeeper >>>>>>>>>>>>>>>>> - OVERSEER - performs overseer related functionality >>>>>>>>>>>>>>>>> (though IIRC there's a proposal to eliminate overseer that >>>>>>>>>>>>>>>>> might eliminate >>>>>>>>>>>>>>>>> this) >>>>>>>>>>>>>>>>> - DATA - nodes where there is a lucene index and >>>>>>>>>>>>>>>>> matching against the analyzed results of a query may be >>>>>>>>>>>>>>>>> conducted to >>>>>>>>>>>>>>>>> generate a response, also performs update steps that come >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> DistributedUpdateProcesserFactory >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I also suggest that these roles each have a node in >>>>>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) >>>>>>>>>>>>>>>>> so that code >>>>>>>>>>>>>>>>> that wants to find a node with an appropriate role does not >>>>>>>>>>>>>>>>> need to scan >>>>>>>>>>>>>>>>> the list of all nodes parsing something to discover which >>>>>>>>>>>>>>>>> nodes apply and >>>>>>>>>>>>>>>>> also does not have to parse json to do it. I think this will >>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>> particularly key for zookeeper nodes which might be 3 out of >>>>>>>>>>>>>>>>> 100 or more >>>>>>>>>>>>>>>>> nodes. Similar to how we track live nodes. I think we should >>>>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>>>> nodes.json too that tracks what roles a node is ALLOWED to >>>>>>>>>>>>>>>>> take (as opposed >>>>>>>>>>>>>>>>> to which roles it currently servicing) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So running code consults the zookeeper role list of nodes, >>>>>>>>>>>>>>>>> and any code seeking to transition a node (an admin operation >>>>>>>>>>>>>>>>> with much >>>>>>>>>>>>>>>>> lower performance requirements) consults the json data in the >>>>>>>>>>>>>>>>> nodes.json >>>>>>>>>>>>>>>>> node, parses it, finds the node in question and checks what >>>>>>>>>>>>>>>>> it's eligible >>>>>>>>>>>>>>>>> for (this will correspond to which servlets/apps have been >>>>>>>>>>>>>>>>> loaded). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I know of a case that would benefit from having separate >>>>>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which >>>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more >>>>>>>>>>>>>>>>> in prep for >>>>>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data >>>>>>>>>>>>>>>>> could then be >>>>>>>>>>>>>>>>> hosted on cheaper nodes.... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also maybe think about how this relates to NRT/TLOG/PULL >>>>>>>>>>>>>>>>> which are also maybe role like >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> WDYT? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -Gus >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 3:17 AM Ishan Chattopadhyaya < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We also wish to add first class support for Query nodes >>>>>>>>>>>>>>>>>> that are used to process user queries by forwarding to data >>>>>>>>>>>>>>>>>> nodes, >>>>>>>>>>>>>>>>>> merging/aggregating them and presenting to users. This >>>>>>>>>>>>>>>>>> concept exists as >>>>>>>>>>>>>>>>>> first class citizens in most other search engines. This is a >>>>>>>>>>>>>>>>>> chance for >>>>>>>>>>>>>>>>>> Solr to catch up. >>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> Ishan / Noble / Hitesh >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>> http://www.the111shift.com (play) >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> http://www.needhamsoftware.com (work) >>>>>>> http://www.the111shift.com (play) >>>>>>> >>>>>> >>> >>> -- >>> http://www.needhamsoftware.com (work) >>> http://www.the111shift.com (play) >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> >
