Re: Solr Alpha (EA) release of Reference Branch

Varun Thacker Tue, 06 Oct 2020 20:43:26 -0700

On Tue, Oct 6, 2020 at 7:45 PM Anshum Gupta <[email protected]> wrote:


> Thanks for initiating this discussion, Ishan.
>
> For the sake of making sure that we are all on the same page, let me
> summarize my understanding and take on this thread.
>
> The current situation
> Mark has a reference branch, which the folks who have looked at the
> branch, feel that it’s a much better, improved, reliable, and sustainable
> version of the current master i.e. take the same baseline and make it
> better. We would like to get those changes to the project, but aren’t sure
> about how to do so. Releasing the branch when it’s ready to go, as an alpha
> release will allow users to test it.
>
> 1. Is releasing the branch officially going to help us achieve the goal of
> having a well tested branch ?
> 2. Assuming #1 is true, do we as a community want to release the branch
> officially and assume responsibility?
>

I think so! We should all try to help out to the best we can.

3. What is our path forward after the release I.e. do we merge the branch
> into master or swap out current master.
>
> What do we plan to do (options).
> I feel there is a consensus on everyone wanting the best for the project
> and wanting Marks’ changes released.
>
> #1 - There are differing opinions, and I personally think we can have our
> test harnesses test the new branch, but I think most companies running Solr
> at scale would have concerns with taking up an alpha release and deploying
> it in production. The various tests that a bunch of folks are working on is
> our best bet at testing out the branch, in which case I’m not sure if we
> want an official release.
>
> #2 - I feel that having an official release and having artifacts show up
> in maven central will confuse people. The 4.0 alpha release was very
> different in the sense that it was the same branch, the code wasn’t
> replacing anything existing but introducing a completely new feature i.e.
> SolrCloud.
>
> #3 - I’m still unclear on how these changes will be released in terms of
> the community consensus. I’ve tried to merge parts of Marks’ effort from
> another time into master, but it’s very difficult, almost impossible to
> isolate and extract commits on the basis of coverage/features/etc. This is
> a lot of really great effort and after having spoken with Mark multiple
> times, I really feel we should figure out a way to absorb this but I do
> have concerns around replacing the master branch completely.
>
> While I do like the idea that Tomás proposed, I also feel that maintaining
> and managing cherry-picking across 9x, master, and ref branch will only
> make it difficult for people to work though the duration of 9x.
>
> I haven’t looked at the current ref branch recently, but the folks who
> have looked at it, if you think that this code can be merged into master
> even as big chunks, that’d be the most confidence building way forward.
>
>
>
>
> On Tue, Oct 6, 2020 at 11:37 AM Ilan Ginzburg <[email protected]> wrote:
>
>> Copying below Mark's posts from ASF Slack #solr-next-big-thing channel.
>>
>> The Solr Reference Branch.
>> Document 1, a quick intro.
>> You can think of the Solr Reference Branch as a remaster of Solr. It
>> is not an attempt to redesign Solr or make it more fancy. The goal of
>> the Solr Reference Branch is to be a better incarnation of the current
>> Apache Solr, which will provide a base for future development and
>> design.
>> There are a variety of problems with Solr today that make it difficult
>> to adopt and run. This is me being as honest and objective as I can
>> be, though no doubt, many will see it as an exaggeration or negative
>> focus. I just see it as the way it is and has been, it's just taken me
>> a real long time to actually get all the way under the rug to find the
>> really hardened nasty cockroaches burrowed in there.
>> 1. Resource usage and management is wasteful, inefficient, buggy, and
>> haphazard.
>> 2. SolrCloud is not long term reliable. Exceptional cases will
>> frequently flummox the system, and exceptional cases are supposed to
>> be our wheelhouse and primary focus. Leaders will be lost and not
>> recover, the Overseer will go away, GC storms will hit, tight loops in
>> a bad case will crank up resources, and retries will be abundant and
>> overaggressive.
>> 3. Our blocking and locking is generally not efficient, especially in key
>> paths.
>> 4. We get thread safety wrong (too often) in some important spots.
>> 5. Distributed updates have to be added locally before they are
>> distributed, and then that distribution is generally inefficient,
>> prone to blocking and/or timeouts, and hobbled by HTTP1.1 and our need
>> for packing updates into a single request to achieve any kind of
>> performance, losing proper error handling and eating the many rough
>> edges of the ConcurrentUpdateSolrClient.
>> 6. Our Zookeeper foundation code is often inefficient, buggy,
>> unreliable, and improperly used (we don’t always use async or multi
>> where we should, we force updates from zk instead of being notified,
>> we don’t handle session expiration as well as we should, our
>> algorithms are slow and buggy, we make a multitude more calls than we
>> should (especially on cluster startup), etc, etc)
>> 7. We have circular dependencies between major classes that can start
>> threads in their constructors that start interacting with the other
>> classes before construction is complete.
>> 8. Our XML handling is abysmally outdated and slow for multiple
>> reasons. Our heavy Xpath usage is incredibly wasteful and expensive.
>> 9. Our thread management is not understandable, not properly tunable,
>> not efficient, sometimes buggy, not always consistent, and difficult
>> to understand fundamentally.
>> 10. Our Jetty configuration is lacking in a variety of ways,
>> especially around shutdown and http2.
>> 11. The dynamic schema feature can be very expensive and not fully thread
>> safe.
>> 12. The Overseer is extremely inefficient, can be extremely slow to
>> stop, had a buggy leader election algorithm, doesn’t handle session
>> expiration as well as it should, can keep trying to come back from the
>> dead, and the list goes on.
>> 13. Our connection resuse is often very poor or non existent, when
>> it’s improved, it always reverts back to bad or worse.
>> 14. HTTP1.1 is not great for our type of application in a variety of
>> ways that HTTP2 solves – but we still use a lot of HTTP1.1 and HTTP2
>> is not configured well and the client needs some work.
>> 15. Lifecycle of important objects is often off, most things can and
>> will leak (SolrCores, SolrIndexSearchers, Directory’s, Solr clients),
>> some things will close objects more than once or that don’t belong to
>> them, or close things in a bad order.
>> 16. There is often sleeps and/or pulling that is a magnitude slower
>> than proper event driven waits.
>> 17. Our tests are actually pretty unstable and making them stable is
>> way, way more difficult than most people realize. I’m quite sure I’ve
>> spent much, much more time on this than anyone out there, and I can
>> tell you, the tests are not stable in a 1,000 shifting ways that have
>> and will continue to cause lots of damage.
>> 18. We don’t have good async update/search support for scaling and
>> better resource usage.
>> 19. We often duplicate resources or create new pools instead of sharing.
>> 20. We don’t do tons of parallelizable stuff in parallel, when we do
>> it’s inconsistent.
>> 21. Our Collections API can often not wait correctly for the proper
>> state for what it did to be ready before returning. Even if it gets it
>> right, a cloud client that made the request won’t necessarily have the
>> updated state local when the request returns. Things often still work,
>> but with a variety of interesting and slow results possible.
>> 22. We don’t often holistically look at what we have built and how it
>> fits together and so often there are silly things, bad fits, one off
>> bad patterns, lazy attempts at something, etc.
>> 24. Close and shutdown are inefficient and slow across a huge swatch
>> of our object tree. These issues tend to be growy and breed less
>> concern over time.
>> 25. There are a variety of ways and places that we can generate an
>> absurd amount of unnecessary garbage.
>> 26. SolrCore reload is not fully reliable and increasingly important and
>> used.
>> 27. The leader election code has a variety of ugly little bugs and is
>> based on a recursive implementation that will eventually exhaust stack
>> space – though it’s likely your cluster will be brought down by
>> something else before that is a problem (unless you hit the infinite
>> loop, no one can be leader, eat up the stack as fast as possible case
>> – which should be hard these days with the leader election throttle).
>> 28. The recovery processes, like almost everything you can imagine,
>> has a variety of issues and rarer bad cases and affects.
>> By and large, everything is inefficient and buggy and full of accepted
>> compromise regardless.
>> Interestingly, this does not make us an atypical open source Java
>> distributed project. But, I’m kind of a software snob, and I would not
>> run this thing and so I cannot work on it. What is there to do ...
>> The Solr Reference Branch is intended to tackle every one of those
>> issues. As well as about 1000+ more of varying and lesser importance.
>> As all of that comes together, cool stuff starts to unlock, and you
>> begin to see some phenomena that is together much greater than the sum
>> of it’s many, many parts.
>> 29. Our tests have been getting better and better are stamping out the
>> legit noise they create - every scream a breadcrumb towards badness -
>> but we have built a scream catching machine - though we will never be
>> able to catch them all for a huge variety of deep reasons.
>>
>> The Solr Reference Branch
>> Document 2
>> While the extent of the previously mentioned issues was not clear to
>> me, that is a deep rabbit hole, I’ve always, as have many others,
>> known the current state of things with Solr at a higher, broader
>> level.
>> So what about this effort is different? Is this not just a bunch of my
>> standard JIRA issues all crammed into one? Should we not break them
>> out proper and do things sensible?
>> Well, previously, as is probably common, I was both a bit lost on
>> where we were exactly and certainly on where to find firmer ground for
>> real, not just the mirage always just over the hill.
>> I love performance and efficiency though. I’ve always avoided it as a
>> focus with Solr and SolrCloud, thinking stability has to come first.
>> Having given up on stability and scale after a good 8 years or
>> something, completely tossed out as a pipe dream, I started work on
>> something new, something really just for me. I started plugging in
>> HTTP2. And the effort and work needed for that and the learning and
>> some of the results, completely opened my eyes. I also attacked very
>> different than I have in the past, doing something I like for me, I
>> drowned myself in it. Spent 2-3 weeks at a time here and there sitting
>> at the computer with intense focus for 16-20 hours a day. The more I
>> did, the more I found, the more I understood, the more I discovered.
>> I discovered a discovery processes. It was leading me to everything I
>> needed to do and I just had to follow the long, ever flowing path,
>> keeping my mental models strong, re etching, ruminating, obsessing.
>> I realized many test functions we have – most- should be taking on the
>> order of milliseconds instead of seconds to dozens of seconds. I
>> realized tons and tons of our issues and gremlins lived and prospered
>> in our slow and inefficient smog. I realized that if I just spent the
>> time to look where slowness and flakiness prevailed, really look, like
>> take hours just for some random side road - build a bridge, burn it,
>> and build one further down, etc, etc – that making huge improvement
>> after huge improvement was actually very low hanging fruit, just
>> hidden by some thorns and leaves and lack of any reasonable
>> introspection into the system we have created and continue to build.
>> Over time, I could see what had to be done and I could see what it
>> would achieve. I built different parts at different times, lost them
>> and rebuilt them a different way with different focus. I build and
>> expanded my introspection and measuring tools and classes.
>> That’s a sentence trying to cover a universe, but if you want to
>> really boil it down even further, I’d invoke the normally faulty
>> broken windows theory. There is magic in perfect windows that only
>> those that have them know. Can we get perfect? I like to dream and
>> there is no end to the introspection, experimentation, and
>> improvements to try. The perfect landing aside though, no doubt we can
>> move drastically from where we are.
>>
>> Another thing I learned is the crazy number of ways you can make all
>> the tests pass like champions, and roll into production unusable.
>> Which tells me that production users are a large part of our test
>> strategy, and that can’t be to make any real change in a satisfactory
>> way.
>>
>> The current goal is to have a mostly usable and testable system by
>> mid-late October. Not everything 100%, some known caveats and cleanup
>> and plenty to do, but it should be in good shape for a user to try out
>> given the caveats outlined
>> The biggest risk currently is the absorption of the search side async
>> work from master - I'm familiar with that, I've worked on it myself,
>> the code involved is derived from an old branch of mine, but async is
>> a whole different animal and trying to nail it without any downsides
>> to the old synchronous model is a tough nut
>> one that I was already battling on the dist update side, so it's good
>> stuff to work on and do, but its taking some effort to get in shape
>>
>> On Tue, Oct 6, 2020 at 8:00 PM Tomás Fernández Löbbe
>> <[email protected]> wrote:
>> >
>> > > Let's say we cut 9x and now there is a new master taken from the
>> reference branch.
>> > I never said “make a new master”, I said merge changes in ref branch
>> into master. If things are broken into pieces like Ishan is suggesting,
>> those changes can be merged into 9.x too. I only suggested this because you
>> felt unsure about merging to master now and I guess this is due to fear of
>> introducing bugs so close to a potential 9.0 release, is that not right?
>> >
>> >
>> > > We will never be able to reconcile these 2 branches
>> > Sorry, but how is that different if we do an alpha release from the
>> branch now? What would be the process after that? Let's say people don't
>> find issues and we want to merge those changes, what’s the plan then?
>> >
>> > > Choice 1:
>> > I’m fine with choice 1 if that’s what you want, as long as it’s not an
>> official release for the reasons stated above.
>> >
>> >
>> > > I promise to do code review & cleanup as much as possible. But I'm
>> hesitant to give a stamp of approval to make it THE official release
>> > What do you mean? I thought this is what you were suggesting, make an
>> official release from the reference_impl branch?
>> >
>> >
>> > I think Ilan’s last email is on spot, and I agree 100% with what he can
>> express much better than I can :)
>> >
>> > > Mark's descriptions in Slack go in the right way but are still too
>> high level
>> > Can someone share those here? or in Jira?
>> >
>> > On Tue, Oct 6, 2020 at 5:09 AM Noble Paul <[email protected]> wrote:
>> >>
>> >> > I think the danger is high to treat this branch as a black box (or
>> an "all or nothing").
>> >>
>> >> True Ilan.  Ideally, I would like a few of us to study the code &
>> >> start pulling in changes we are confident of (even to 8x branch, why
>> >> not). We cannot burden a single developer to do everything.
>> >>
>> >> This cannot be a task just for one or 2 devs. We all will have to work
>> >> together to decompose the changes and digest them into master. I can
>> >> do my bit.
>> >>
>> >> But, I'm sure we may hit a point where certain changes cannot be
>> >> isolated and absorbed. We will have to collectively make a call, how
>> >> to absorb them
>> >>
>> >> On Tue, Oct 6, 2020 at 9:00 PM Ishan Chattopadhyaya
>> >> <[email protected]> wrote:
>> >> >
>> >> >
>> >> > I'm willing to help and I believe others will too if the amount of
>> work for contributing is reasonable (i.e. not a three months effort).
>> >> >
>> >> > I looked into the possibility of doing so. To me, it seemed to be
>> that it is very hard to do so: possibly 1 year project for me. Problem is
>> that it is hard to pull out a particular class of improvements (say thread
>> management improvement) and have all tests pass with it (because tests have
>> gotten extensive improvements of their own) and also observe the effect of
>> the improvement. IIUC, every improvement to Solr seemed to require many
>> iterations to get the tests happy. I remember Mark telling me that it may
>> not even be possible for him to do something like that (i.e. bring all
>> changes into master as tiny pieces).
>> >> >
>> >> > What I volunteered to do, however, is to decompose roughly all the
>> general improvements into smaller, manageable commits. However, making sure
>> all tests pass at every commit point is beyond my capability.
>> >> >
>> >> > On Tue, 6 Oct, 2020, 3:10 pm Ilan Ginzburg, <[email protected]>
>> wrote:
>> >> >>
>> >> >> Another option to integrate this work into the main code line would
>> be to understand what changes have been made and where (Mark's descriptions
>> in Slack go in the right way but are still too high level), and then port
>> or even redo them in main, one by one.
>> >> >>
>> >> >> I think the danger is high to treat this branch as a black box (or
>> an "all or nothing"). Using the merging itself to change our understanding
>> and increase our knowledge of what was done can greatly reduce the risk.
>> >> >>
>> >> >> We do develop new features in Solr 9 without beta releasing them,
>> so if we port Mark's improvements by small chunks (and maybe in the process
>> decide that some should not be ported or not now) I don't see why this
>> can't integrate to become like other improvements done to the code. If
>> specific changes do require a beta release, do that release from master and
>> pick the right moment.
>> >> >>
>> >> >> I'm willing to help and I believe others will too if the amount of
>> work for contributing is reasonable (i.e. not a three months effort). This
>> requires documenting the changes done in that branch, pointing to where
>> these changes happened and then picking them up one by one and porting them
>> more or less independently of each other. We might only port a subset of
>> changes by the time 9.0 is released, that's fine we can continue in
>> following releases.
>> >> >>
>> >> >> My 2 cents...
>> >> >> Ilan
>> >> >>
>> >> >> Le mar. 6 oct. 2020 à 09:56, Noble Paul <[email protected]> a
>> écrit :
>> >> >>>
>> >> >>> Yes, A docker image will definitely help. I wasn't trying to
>> downplay that
>> >> >>>
>> >> >>> On Tue, Oct 6, 2020 at 6:55 PM Ishan Chattopadhyaya
>> >> >>> <[email protected]> wrote:
>> >> >>> >
>> >> >>> >
>> >> >>> > > Docker is not a big requirement for large scale installations.
>> Most of them already have their own install scripts. Availability of docker
>> is not important for them. If a user is only encouraged to install Solr
>> because of a docker image , most likely they are not running a large enough
>> cluster
>> >> >>> >
>> >> >>> > I disagree, Noble. Having a docker image us going to be useful
>> to some clients, with complex usecases. Great point, David!
>> >> >>> >
>> >> >>> > On Tue, 6 Oct, 2020, 1:09 pm Ishan Chattopadhyaya, <
>> [email protected]> wrote:
>> >> >>> >>
>> >> >>> >> As I said, I'm *personally* not confident in putting such a big
>> changeset into master that wasn't vetted in a real user environment widely.
>> I have, in the past, done enough bad things to Solr (directly or
>> indirectly), and I don't want to repeat the same. Also, I'll be very
>> uncomfortable if someone else did so.
>> >> >>> >>
>> >> >>> >> Having said this, if someone else wants to port the changes
>> over to master *without first getting enough real world testing*, feel free
>> to do so, and I can focus my efforts elsewhere.
>> >> >>> >>
>> >> >>> >> On Tue, 6 Oct, 2020, 9:22 am Tomás Fernández Löbbe, <
>> [email protected]> wrote:
>> >> >>> >>>
>> >> >>> >>> I was thinking (and I haven’t flushed it out completely but
>> will throw the idea) that an alternative approach with this timeline could
>> be to cut 9x branch around November/December? And then you could merge into
>> master, it would have the latest  changes from master plus the ref branch
>> changes. From there any nightly build could be use to help test/debug.
>> >> >>> >>>
>> >> >>> >>> That said I don’t know for sure what are the changes in the
>> branch that do not belong in 9. The problem with them being 10x only is
>> that backports would potentially be more difficult for all the life of 9.
>> >> >>> >>>
>> >> >>> >>> On Mon, Oct 5, 2020 at 4:54 PM Noble Paul <
>> [email protected]> wrote:
>> >> >>> >>>>
>> >> >>> >>>> >I don't think it can be said what committers do and don't do
>> with regards to running Solr.  All of us would answer this differently and
>> at different points in time.
>> >> >>> >>>>
>> >> >>> >>>> " I have run it in one large cluster, so it is certified to
>> be bug free/stable" I don't think it's a reasonable approach. We need as
>> much feedback from our users because each of them stress Solr in a
>> different way. This is not to suggest that committers are not doing testing
>> or their tests are not valid. When I talk to the committers out here they
>> say they do not see any performance stability issues at all. But, my client
>> reports issues on a day to day basis.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> > Definitely publish a Docker image BTW -- it's the best way
>> to try out any software.
>> >> >>> >>>>
>> >> >>> >>>> Docker is not a big requirement for large scale
>> installations. Most of them already have their own install scripts.
>> Availability of docker is not important for them. If a user is only
>> encouraged to install Solr because of a docker image , most likely they are
>> not running a large enough cluster
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> On Tue, Oct 6, 2020, 6:30 AM David Smiley <[email protected]>
>> wrote:
>> >> >>> >>>>>
>> >> >>> >>>>> Thanks so much for your responses Ishan... I'm getting much
>> more information in this thread than my attempts to get questions answered
>> on the JIRA issue months ago.  And especially,  thank you for volunteering
>> for the difficult porting efforts!
>> >> >>> >>>>>
>> >> >>> >>>>> Tomas said:
>> >> >>> >>>>>>
>> >> >>> >>>>>>  I do agree with the previous comments that calling it
>> "Solr 10" (even with the "-alpha") would confuse users, maybe use
>> "reference"? or maybe something in reference to SOLR-14788?
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> I have the opposite opinion.  This word "reference" is
>> baffling to me despite whatever Mark's explanation is.  I like the
>> justification Ishan gave for 10-alpha and I don't think I could re-phrase
>> his justification any better.  *If* the release was _not_ official (thus
>> wouldn't show up in the usual places anyone would look for a release), I
>> think it would alleviate that confusion concern even more, although I think
>> "alpha" ought to be enough of a signal not to use it without digging deeper
>> on what's going on.
>> >> >>> >>>>>
>> >> >>> >>>>> Alex then Ishan said:
>> >> >>> >>>>>>
>> >> >>> >>>>>> > Maybe we could release it to
>> >> >>> >>>>>> > committers community first and dogfood it "internally"?
>> >> >>> >>>>>>
>> >> >>> >>>>>> Alex: It is meaningless. Committers don't run large scale
>> installations. We barely even have time to take care of running unit tests
>> before destabilizing our builds. We are not the right audience. However, we
>> all can anyway check out the branch and start playing with it, even without
>> a release. There are orgs that don't want to install any code that wasn't
>> officially released; this release is geared towards them (to help us test
>> this at their scale).
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> I don't think it can be said what committers do and don't do
>> with regards to running Solr.  All of us would answer this differently and
>> at different points in time.  From time to time, though not at present,
>> I've been well positioned to try out a new version of Solr in a stage/test
>> environment to see how it goes.  (Putting on my Salesforce metaphorical
>> hat...) Even though I'm not able to deploy it in a realistic way today, I'm
>> able to run a battery of tests to see if one of the features we depend on
>> have changed or is broken.  That's useful feedback to an alpha release!
>> And even though I'm saying I'm not well positioned to try out some new Solr
>> release in a production-ish setting now, it's something I could make a good
>> case for internally since upgrades take a lot of effort where I work.  It's
>> in our interest for SolrCloud to be very stable (of course).
>> >> >>> >>>>>
>> >> >>> >>>>> Regardless, I think what you're driving at Ishan is that you
>> want an "official" release -- one that goes through the whole ceremony.
>> You believe that people would be more likely to use it.  I think all we
>> need to do is announce (similar to a real release) that there is some
>> unofficial alpha distribution and that we want to solicit your feedback --
>> basically, help us find bugs.  Definitely publish a Docker image BTW --
>> it's the best way to try out any software.  I'm -0 on doing an official
>> release for alpha software because it's unnecessary to achieve the goals
>> and somewhat confusing.  I think the Solr 4 alpha/beta situation was
>> different -- it was not some fork a committer was maintaining; it was the
>> master branch of its time, and it was destined to be the very next release,
>> not some possible future release.
>> >> >>> >>>>>
>> >> >>> >>>>> ~ David Smiley
>> >> >>> >>>>> Apache Lucene/Solr Search Developer
>> >> >>> >>>>> http://www.linkedin.com/in/davidwsmiley
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> -----------------------------------------------------
>> >> >>> Noble Paul
>> >> >>>
>> >> >>>
>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe, e-mail: [email protected]
>> >> >>> For additional commands, e-mail: [email protected]
>> >> >>>
>> >>
>> >>
>> >> --
>> >> -----------------------------------------------------
>> >> Noble Paul
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> --
> Anshum Gupta
>

Re: Solr Alpha (EA) release of Reference Branch

Reply via email to