Re: Approach for a new Autoscaling framework

Varun Thacker Sun, 26 Jul 2020 15:11:12 -0700

On Sun, Jul 26, 2020 at 1:05 AM Ilan Ginzburg <[email protected]> wrote:


> Varun, you're correct.
> This PR was built based on what's needed for creation (easiest starting
> point for me and likely most urgent need). It's still totally WIP and
> following steps include building the API required for move and other
> placement based needs, then also everything related to triggers (see the
> Jira).
>
> Collection API commands (Solr provided implementation, not a plug-in) will
> build the requests they need, then call the plug-in (custom one or a defaut
> one), and use the returned "work items" (more types of work items will be
> introduced of course) to do the job (know where to place or where to move
> or what to remove or add etc.)
>

This sounds perfect!

I'd be interested to see how can we use SamplePluginMinimizeCores for say
create collection but use FooPluginMinimizeLoad for add-replica

>
> Ilan
>
> Le dim. 26 juil. 2020 à 04:13, Varun Thacker <[email protected]> a écrit :
>
>> Hi Ilan,
>>
>> I like where we're going with
>> https://github.com/apache/lucene-solr/pull/1684 . Correct me if I am
>> wrong, but my understanding of this PR is we're defining the interfaces for
>> creating policies
>>
>> What's not clear to me is how will existing collection APIs like
>> create-collections/add-replica etc make use of it? Is that something that
>> has been discussed somewhere that I could read up on?
>>
>>
>>
>> On Sat, Jul 25, 2020 at 2:03 PM Ilan Ginzburg <[email protected]> wrote:
>>
>>> Thanks Gus!
>>> This makes a lot of sense but significantly increases IMO the scope and
>>> effort to define an "Autoscaling" framework interface.
>>>
>>> I'd be happy to try to see what concepts could be shared and how a
>>> generic plugin facade could be defined.
>>>
>>> What are the other types of plugins that would share such a unified
>>> approach? Do they already exist under another form or are just projects at
>>> this stage, like Autoscaling plugins?
>>>
>>> But... Assuming this is the first "facade" layer to be defined between
>>> Solr and external code, it might be hard to make it generic and get it
>>> right. There's value in starting simple, understanding the tradeoffs and
>>> generalizing later.
>>>
>>> Also I'd like to make sure we're not paying a performance "genericity
>>> tax" in Autoscaling for unneeded features.
>>>
>>> Ilan
>>>
>>> Le sam. 25 juil. 2020 à 16:02, Gus Heck <[email protected]> a écrit :
>>>
>>>> Scanned through the PR and read some of this thread. I likely have
>>>> missed much other discussion, so forgive me if I'm dredging up somethings
>>>> that are already discussed elsewhere.
>>>>
>>>> The idea of designing the interfaces defining what information is
>>>> available seems good here, but I worry that it's too auto-scaling focused.
>>>> In my imagination, I would see solr having a standard informational
>>>> interface that is useful to any plugin of any sort. Autoscaling should be
>>>> leveraging that and we should be enhancing that to enable autoscaling. The
>>>> current state of  the system is one key type of information, but another
>>>> type of information that should exist within solr and be exposed to plugins
>>>> (including autoscaling) is events. When a new node joins there should be an
>>>> event for example so that plugins can listen for that rather than
>>>> incessantly polling and comparing the list of 100 nodes to a cached list of
>>>> 100 nodes.
>>>>
>>>> In the PR I see a bunch of classes all off in a separate package, which
>>>> looks like an autoscaling fiefdom which will be tempted if not forced to
>>>> duplicate lots of stuff relative to other plugins and/or core.
>>>>
>>>> As a side note I would think the metrics system could be a plugin that
>>>> leverages the same set of informational interfaces....
>>>>
>>>> So there should be 3 parts to this as I imagine it.
>>>>
>>>> 1) Enhancements to the **plugin system** that make information about
>>>> the cluster available solr to ALL plugins
>>>> 2) Enhancements to the **plugin system** API's provided to ALL plugins
>>>> that allow them to mutate solr safely.
>>>> 3) A plugin that we intend to support for our users currently using
>>>> auto scaling utilizes the enhanced information to provide a similar level
>>>> of functionality as is *promised* by our current documentation of
>>>> autoscaling, there might be some gaps or differences but we should be
>>>> discussing what they are and providing recommended workarounds for users
>>>> that relied on those promises to the users. Even if there were cases where
>>>> we failed to deliver, if there were at least some conditions under which we
>>>> could deliver the promised functionality those should be supported. Only if
>>>> we never were able to deliver and it never worked under any circumstance
>>>> should we rip stuff out entirely.
>>>>
>>>> Implicit in the above is the concept that there should be a facade
>>>> between plugins and the core of solr.
>>>>
>>>> WRT #1 which will necessarily involve information collected from remote
>>>> nodes, we need to be designing that thinking about what informational
>>>> guarantees it provides. Latency, consistency, delivery, etc. We also need
>>>> to think about what is exposed in a read-only fashion vs what plugins might
>>>> write back to solr. Certainly there will be a lot of information that most
>>>> plugins ignore, and we might consider having groupings of information and
>>>> interfaces or annotations that indicate what info is provided, but the
>>>> simplest default state is to just give plugins a reference to a class that
>>>> they can use to drill into information about the cluster as needed.
>>>> (SolrInformationBooth? ... or less tongue in cheek...  enhance
>>>> SolrInfoBean? )
>>>>
>>>> Finally a fourth thing that occurs to me as I write is we need to
>>>> consider what information one plugin might make available to the rest of
>>>> the solr plugins. This might come later, and is hard because it's very hard
>>>> to anticipate what info might be generated by unknown plugins in the 
>>>> future.
>>>>
>>>> So some humorous, not seriously suggested but hopefully memorable class
>>>> names encapsulating the concepts:
>>>>
>>>> SolrInformationBooth (place to query)
>>>> SolrLoudspeaker (event announcements)
>>>> SolrControlLevers (mutate solr cluster)
>>>> SolrPluginFacebookPage (info published by the plugin that others can
>>>> watch)
>>>>
>>>> The "facade" provided to plugins by the plugin system should grow and
>>>> expand such that more and more plugins can rely on it. This effort should
>>>> grow it enough to move autoscaling onto it without dropping (much)
>>>> functionality that we've previously published.
>>>>
>>>> -Gus
>>>>
>>>> On Fri, Jul 24, 2020 at 4:40 PM Jan Høydahl <[email protected]>
>>>> wrote:
>>>>
>>>>> Not clear to me what type of "alternative proposal" you're thinking of
>>>>> Jan
>>>>>
>>>>>
>>>>> That would be the responsibility of Noble and others who have concerns
>>>>> to detail - and try convince other peers.
>>>>> It’s hard for me as a spectator to know whether to agree with Noble
>>>>> without a clear picture of what the alternative API or approach would look
>>>>> like.
>>>>> I’m often a fan of loosely typed APIs since they tend to cause less
>>>>> boilerplate code, but strong typing may indeed be a sound choice in this
>>>>> API.
>>>>>
>>>>> Jan Høydahl
>>>>>
>>>>> 24. jul. 2020 kl. 01:44 skrev Ilan Ginzburg <[email protected]>:
>>>>>
>>>>> 
>>>>> In my opinion we have to (and therefore will) ship at least a basic
>>>>> prod ready implementation on top of the API that does simple things (not
>>>>> sure about rack, but for example balance cores and disk size without co
>>>>> locating replicas of same shard on same node).
>>>>> Without such an implementation, I suspect adoption will be low.
>>>>> Moreover, it's always a lot more friendly to start coding from a working
>>>>> example than from scratch.
>>>>>
>>>>> Not clear to me what type of "alternative proposal" you're thinking of
>>>>> Jan. Alternative API proposal? Alternative approach to replace 
>>>>> Autoscaling?
>>>>>
>>>>> Ilan
>>>>>
>>>>> Ilan
>>>>>
>>>>> On Fri, Jul 24, 2020 at 12:11 AM Jan Høydahl <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Important discussion indeed.
>>>>>>
>>>>>> I don’t have time to dive deep into the PR or make up my mind whether
>>>>>> there is a simpler and more future proof way of designing these APIs. 
>>>>>> But I
>>>>>> understand that autoscaling is a complex beast and it is important we get
>>>>>> it right.
>>>>>>
>>>>>> One question regarding having to write code vs config. Is the plan to
>>>>>> ship some very simple light weight default placement rules ootb that 
>>>>>> gives
>>>>>> 80% of users what they need with simple config, or would every user need 
>>>>>> to
>>>>>> write code to e.g. spread replicas across hosts/racks? I’d be interested 
>>>>>> in
>>>>>> seeing an alternative proposal laid out, perhaps not in code but with a
>>>>>> design that can be compared and discussed.
>>>>>>
>>>>>> Jan Høydahl
>>>>>>
>>>>>> 23. jul. 2020 kl. 17:53 skrev Houston Putman <[email protected]
>>>>>> >:
>>>>>>
>>>>>> 
>>>>>> I think this is a valid thing to discuss on the dev list, since this
>>>>>> isn't just about code comments.
>>>>>> It seems to me that Ilan wants to discuss the philosophy around how
>>>>>> to design plugins and the interfaces in Solr which the plugins will talk 
>>>>>> to.
>>>>>> This is broad and affects much more than just the Autoscaling
>>>>>> framework.
>>>>>>
>>>>>> As a community & product, we have so far agreed that Solr should be
>>>>>> lighter weight and additional features should live in plugins that are
>>>>>> managed separately from Solr itself.
>>>>>> At that point we need to think about the lifetime and support of
>>>>>> these plugins. People love to refactor stuff in the solr core, which 
>>>>>> before
>>>>>> plugins wasn't a large issue.
>>>>>> However if we are now intending for many customers to rely on
>>>>>> plugins, then we need to come up with standards and guarantees so that
>>>>>> these plugins don't:
>>>>>>
>>>>>>    - Stall people from upgrading Solr (minor or major versions)
>>>>>>    - Hinder the development of Solr Core
>>>>>>    - Cause us more headaches trying to keep multiple repos of
>>>>>>    plugins up to date with recent versions of Solr
>>>>>>
>>>>>>
>>>>>> I am not completely sure where I stand right now, but this is
>>>>>> definitely something that we should be thinking about when migrating all 
>>>>>> of
>>>>>> this functionality to plugins.
>>>>>>
>>>>>> - Houston
>>>>>>
>>>>>> On Thu, Jul 23, 2020 at 9:27 AM Ishan Chattopadhyaya <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think we should move the discussion back to the PR because it has
>>>>>>> more context and inline comments are possible. Having this discussion 
>>>>>>> in 4
>>>>>>> places (jira, pr, slack and dev list is very hard to keep track of).
>>>>>>>
>>>>>>> On Thu, 23 Jul, 2020, 5:57 pm Ilan Ginzburg, <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> [I’m moving a discussion from the PR
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684> for SOLR-14613
>>>>>>>> <https://issues.apache.org/jira/browse/SOLR-14613> to the dev list
>>>>>>>> for a wider audience. This is about replacing the now (in master) gone
>>>>>>>> Autoscaling framework with a way for clients to write their customized
>>>>>>>> placement code]
>>>>>>>>
>>>>>>>> It took me a long time to write this mail and it's quite long,
>>>>>>>> sorry.
>>>>>>>> Please anybody interested in the future of Autoscaling (not only
>>>>>>>> those I cc'ed) do read it and provide feedback. Very impacting 
>>>>>>>> decisions
>>>>>>>> have to be made now.
>>>>>>>>
>>>>>>>> Thanks Noble for your feedback.
>>>>>>>> I believe it is important that we are aligned on what we build
>>>>>>>> here, esp. at the early defining stages (now).
>>>>>>>>
>>>>>>>> Let me try to elaborate on your concerns and provide in general the
>>>>>>>> rationale behind the approach.
>>>>>>>>
>>>>>>>> *> Anyone who wishes to implement this should not require to learn
>>>>>>>> a lot before even getting started*
>>>>>>>> For somebody who knows Solr (what is a Node, Collection, Shard,
>>>>>>>> Replica) and basic notions related to Autoscaling (getting variables
>>>>>>>> representing current state to make decisions), there’s not much to 
>>>>>>>> learn.
>>>>>>>> The framework uses the same concepts, often with the same names.
>>>>>>>>
>>>>>>>> *> I don't believe we should have a set of interfaces that
>>>>>>>> duplicate existing classes just for this functionality.*
>>>>>>>> Where appropriate we can have existing classes be the
>>>>>>>> implementations for these interfaces and be passed to the plugins, that
>>>>>>>> would be perfectly ok. The proposal doesn’t include implementations at 
>>>>>>>> this
>>>>>>>> stage, therefore there’s no duplication, or not yet... (we must get the
>>>>>>>> interfaces right and agreed upon before implementation). If some 
>>>>>>>> interface
>>>>>>>> methods in the proposal have a different name from equivalent methods 
>>>>>>>> in
>>>>>>>> internal classes we plan to use, of course let's rename one or the 
>>>>>>>> other.
>>>>>>>>
>>>>>>>> Existing internal abstractions are most of the time concrete
>>>>>>>> classes and not interfaces (Replica, Slice, DocCollection,
>>>>>>>> ClusterState). Making these visible to contrib code living
>>>>>>>> elsewhere is making future refactoring hard and contrib code will most
>>>>>>>> likely end up reaching to methods it shouldn’t be using. If we define a
>>>>>>>> clean set of interfaces for plugins, I wouldn’t hesitate to break 
>>>>>>>> external
>>>>>>>> plugins that reach out to other internal Solr classes, but will make
>>>>>>>> everything possible to keep the API backward compatible so existing 
>>>>>>>> plugins
>>>>>>>> can be recompiled without change.
>>>>>>>>
>>>>>>>> *> 24 interfaces to do this is definitely over engineering*
>>>>>>>> I don’t consider the number of classes or interfaces a metric of
>>>>>>>> complexity or of engineering quality. There are sample
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684/files#diff-ddbe185b5e7922b91b90dfabfc50df4c>
>>>>>>>> plugin implementations to serve as a base for plugin writers (and for 
>>>>>>>> us
>>>>>>>> defining this framework) and I believe the process is relatively 
>>>>>>>> simple.
>>>>>>>> Trying to do the same things with existing Solr classes might prove a 
>>>>>>>> lot
>>>>>>>> harder (but might be worth the effort for comparison purposes to make 
>>>>>>>> sure
>>>>>>>> we agree on the approach? For example, getting sister replicas of a 
>>>>>>>> given
>>>>>>>> replica in the proposed API is: replica.getShard()
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684/files#diff-a2d49bd52fddde54bb7fd2e96238507eR27>
>>>>>>>> .getReplicas()
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684/files#diff-9633f5e169fa3095062451599daac213R31>.
>>>>>>>> Doing so with the internal classes likely involves getting the
>>>>>>>> DocCollection and Slice name from the Replica, then get the
>>>>>>>> DocCollection from the cluster state, there get the Slice based on
>>>>>>>> its name and finally getReplicas() from the Slice). I consider the
>>>>>>>> role of this new framework is to make life as easy as possible for 
>>>>>>>> writing
>>>>>>>> placement code and the like, make life easy for us to maintain it, 
>>>>>>>> make it
>>>>>>>> easy to write a simulation engine (should be at least an order of 
>>>>>>>> magnitude
>>>>>>>> simpler than the previous one), etc.
>>>>>>>>
>>>>>>>> An example regarding readability and number of interfaces: rather
>>>>>>>> than defining an enum with runtime annotation for building its 
>>>>>>>> instances (
>>>>>>>> Variable.Type
>>>>>>>> <https://github.com/apache/lucene-solr/blob/branch_8_6/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/Variable.java#L98>)
>>>>>>>> and then very generic access methods, the proposal defines a specific
>>>>>>>> interface for each “variable type” (called properties
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684/files#diff-4c0fa84354f93cb00e6643aefd00fd3c>).
>>>>>>>> Rather than concatenating strings to specify the data to return from a
>>>>>>>> remote node (based on snitches
>>>>>>>> <https://github.com/apache/lucene-solr/blame/branch_8_6/solr/core/src/java/org/apache/solr/cloud/rule/ImplicitSnitch.java#L60>,
>>>>>>>> see doc
>>>>>>>> <https://lucene.apache.org/solr/guide/8_1/solrcloud-autoscaling-policy-preferences.html#node-selector>),
>>>>>>>> the proposal is explicit and strongly typed (here
>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684/files#diff-4ec32958f54ec8e1f7e2d5ce8de331bb>
>>>>>>>>  example
>>>>>>>> to get a specific system property from a node). This definitely does
>>>>>>>> increase the number of interfaces, but reduces IMO the effort to code 
>>>>>>>> to
>>>>>>>> these abstractions and provides a lot more compile time and IDE 
>>>>>>>> assistance.
>>>>>>>>
>>>>>>>> Goal is to hide all the boilerplate code and machinery (and to a
>>>>>>>> point - complexity) in the implementations of these interfaces rather 
>>>>>>>> than
>>>>>>>> have each plugin writer deal with the same problems.
>>>>>>>>
>>>>>>>> We’re moving from something that was complex and hard to read and
>>>>>>>> debug yet functionally extremely rich, to something simpler for us, 
>>>>>>>> more
>>>>>>>> demanding for users (write code rather than policy config if there's a 
>>>>>>>> need
>>>>>>>> for new behavior) but that should not be less "expressive" in any
>>>>>>>> significant way. One could even imagine reimplementing the former
>>>>>>>> Autoscaling config Domain Specific Language on top of these API (maybe 
>>>>>>>> as a
>>>>>>>> summer internship project :)
>>>>>>>>
>>>>>>>> *> This is a common mistake that we all do. When we design a
>>>>>>>> feature we think that is the most important thing.*
>>>>>>>> If by *"most important thing"* you mean investing the best
>>>>>>>> reasonable effort to do things right then yes.
>>>>>>>> If you mean trying to make a minor feature look more important and
>>>>>>>> inflated than it is, I disagree.
>>>>>>>> As a personal note, replica placement is not the aspect of
>>>>>>>> SolrCloud I'm most interested in, but the first bottleneck we hit when
>>>>>>>> pushing the scale of SolrCloud. I approach this with a state of mind 
>>>>>>>> "let's
>>>>>>>> do it right and get it out of the way" to move to topics I really want 
>>>>>>>> to
>>>>>>>> work on (around distribution in SolrCloud and the role of Overseer).
>>>>>>>> Implementing Autoscaling in a way that simplifies future refactoring 
>>>>>>>> (or
>>>>>>>> that does not make them harder than they already are) is therefore 
>>>>>>>> *very
>>>>>>>> high* on my priority list, to support modest changes (Slice to
>>>>>>>> Shard renaming) and more ambitious ones (replacing Zookeeper,
>>>>>>>> removing Overseer, you name it).
>>>>>>>>
>>>>>>>> Thanks for reading, again sorry for the long email, but I hope this
>>>>>>>> helps (at least helps the discussion),
>>>>>>>> Ilan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu 23 Jul 2020 at 08:16, Noble Paul <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't believe we should have a set of interfaces that duplicate
>>>>>>>>> existing classes just for this functionality. This is a common 
>>>>>>>>> mistake that
>>>>>>>>> we all do. When we design a feature we think that is the most 
>>>>>>>>> important
>>>>>>>>> thing. We endup over designing and over engineering things. This 
>>>>>>>>> feature
>>>>>>>>> will remain a tiny part of Solr. Anyone who wishes to implement this 
>>>>>>>>> should
>>>>>>>>> not require to learn a lot before even getting started. Let's try to 
>>>>>>>>> have a
>>>>>>>>> minimal set of interfaces so that people who try to implement them do 
>>>>>>>>> not
>>>>>>>>> have a huge learning cure.
>>>>>>>>>
>>>>>>>>> Let's try to understand the requirement
>>>>>>>>>
>>>>>>>>>    - Solr wants a set of positions to place a few replicas
>>>>>>>>>    - The implementation wants to know what is the current state
>>>>>>>>>    of the cluster so that it can make those decisions
>>>>>>>>>
>>>>>>>>> 24 interfaces to do this is definitely over engineering
>>>>>>>>>
>>>>>>>>> —
>>>>>>>>> You are receiving this because you authored the thread.
>>>>>>>>> Reply to this email directly, view it on GitHub
>>>>>>>>> <https://github.com/apache/lucene-solr/pull/1684#issuecomment-662837142>,
>>>>>>>>> or unsubscribe
>>>>>>>>> <https://github.com/notifications/unsubscribe-auth/AKIOMCFT5GU2II347GZ4HTTR47IVTANCNFSM4PC3HDKQ>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>

Re: Approach for a new Autoscaling framework

Reply via email to