On Tue, Oct 6, 2020 at 7:45 PM Anshum Gupta <[email protected]> wrote:
> Thanks for initiating this discussion, Ishan. > > For the sake of making sure that we are all on the same page, let me > summarize my understanding and take on this thread. > > The current situation > Mark has a reference branch, which the folks who have looked at the > branch, feel that it’s a much better, improved, reliable, and sustainable > version of the current master i.e. take the same baseline and make it > better. We would like to get those changes to the project, but aren’t sure > about how to do so. Releasing the branch when it’s ready to go, as an alpha > release will allow users to test it. > > 1. Is releasing the branch officially going to help us achieve the goal of > having a well tested branch ? > 2. Assuming #1 is true, do we as a community want to release the branch > officially and assume responsibility? > I think so! We should all try to help out to the best we can. 3. What is our path forward after the release I.e. do we merge the branch > into master or swap out current master. > > What do we plan to do (options). > I feel there is a consensus on everyone wanting the best for the project > and wanting Marks’ changes released. > > #1 - There are differing opinions, and I personally think we can have our > test harnesses test the new branch, but I think most companies running Solr > at scale would have concerns with taking up an alpha release and deploying > it in production. The various tests that a bunch of folks are working on is > our best bet at testing out the branch, in which case I’m not sure if we > want an official release. > > #2 - I feel that having an official release and having artifacts show up > in maven central will confuse people. The 4.0 alpha release was very > different in the sense that it was the same branch, the code wasn’t > replacing anything existing but introducing a completely new feature i.e. > SolrCloud. > > #3 - I’m still unclear on how these changes will be released in terms of > the community consensus. I’ve tried to merge parts of Marks’ effort from > another time into master, but it’s very difficult, almost impossible to > isolate and extract commits on the basis of coverage/features/etc. This is > a lot of really great effort and after having spoken with Mark multiple > times, I really feel we should figure out a way to absorb this but I do > have concerns around replacing the master branch completely. > > While I do like the idea that Tomás proposed, I also feel that maintaining > and managing cherry-picking across 9x, master, and ref branch will only > make it difficult for people to work though the duration of 9x. > > I haven’t looked at the current ref branch recently, but the folks who > have looked at it, if you think that this code can be merged into master > even as big chunks, that’d be the most confidence building way forward. > > > > > On Tue, Oct 6, 2020 at 11:37 AM Ilan Ginzburg <[email protected]> wrote: > >> Copying below Mark's posts from ASF Slack #solr-next-big-thing channel. >> >> The Solr Reference Branch. >> Document 1, a quick intro. >> You can think of the Solr Reference Branch as a remaster of Solr. It >> is not an attempt to redesign Solr or make it more fancy. The goal of >> the Solr Reference Branch is to be a better incarnation of the current >> Apache Solr, which will provide a base for future development and >> design. >> There are a variety of problems with Solr today that make it difficult >> to adopt and run. This is me being as honest and objective as I can >> be, though no doubt, many will see it as an exaggeration or negative >> focus. I just see it as the way it is and has been, it's just taken me >> a real long time to actually get all the way under the rug to find the >> really hardened nasty cockroaches burrowed in there. >> 1. Resource usage and management is wasteful, inefficient, buggy, and >> haphazard. >> 2. SolrCloud is not long term reliable. Exceptional cases will >> frequently flummox the system, and exceptional cases are supposed to >> be our wheelhouse and primary focus. Leaders will be lost and not >> recover, the Overseer will go away, GC storms will hit, tight loops in >> a bad case will crank up resources, and retries will be abundant and >> overaggressive. >> 3. Our blocking and locking is generally not efficient, especially in key >> paths. >> 4. We get thread safety wrong (too often) in some important spots. >> 5. Distributed updates have to be added locally before they are >> distributed, and then that distribution is generally inefficient, >> prone to blocking and/or timeouts, and hobbled by HTTP1.1 and our need >> for packing updates into a single request to achieve any kind of >> performance, losing proper error handling and eating the many rough >> edges of the ConcurrentUpdateSolrClient. >> 6. Our Zookeeper foundation code is often inefficient, buggy, >> unreliable, and improperly used (we don’t always use async or multi >> where we should, we force updates from zk instead of being notified, >> we don’t handle session expiration as well as we should, our >> algorithms are slow and buggy, we make a multitude more calls than we >> should (especially on cluster startup), etc, etc) >> 7. We have circular dependencies between major classes that can start >> threads in their constructors that start interacting with the other >> classes before construction is complete. >> 8. Our XML handling is abysmally outdated and slow for multiple >> reasons. Our heavy Xpath usage is incredibly wasteful and expensive. >> 9. Our thread management is not understandable, not properly tunable, >> not efficient, sometimes buggy, not always consistent, and difficult >> to understand fundamentally. >> 10. Our Jetty configuration is lacking in a variety of ways, >> especially around shutdown and http2. >> 11. The dynamic schema feature can be very expensive and not fully thread >> safe. >> 12. The Overseer is extremely inefficient, can be extremely slow to >> stop, had a buggy leader election algorithm, doesn’t handle session >> expiration as well as it should, can keep trying to come back from the >> dead, and the list goes on. >> 13. Our connection resuse is often very poor or non existent, when >> it’s improved, it always reverts back to bad or worse. >> 14. HTTP1.1 is not great for our type of application in a variety of >> ways that HTTP2 solves – but we still use a lot of HTTP1.1 and HTTP2 >> is not configured well and the client needs some work. >> 15. Lifecycle of important objects is often off, most things can and >> will leak (SolrCores, SolrIndexSearchers, Directory’s, Solr clients), >> some things will close objects more than once or that don’t belong to >> them, or close things in a bad order. >> 16. There is often sleeps and/or pulling that is a magnitude slower >> than proper event driven waits. >> 17. Our tests are actually pretty unstable and making them stable is >> way, way more difficult than most people realize. I’m quite sure I’ve >> spent much, much more time on this than anyone out there, and I can >> tell you, the tests are not stable in a 1,000 shifting ways that have >> and will continue to cause lots of damage. >> 18. We don’t have good async update/search support for scaling and >> better resource usage. >> 19. We often duplicate resources or create new pools instead of sharing. >> 20. We don’t do tons of parallelizable stuff in parallel, when we do >> it’s inconsistent. >> 21. Our Collections API can often not wait correctly for the proper >> state for what it did to be ready before returning. Even if it gets it >> right, a cloud client that made the request won’t necessarily have the >> updated state local when the request returns. Things often still work, >> but with a variety of interesting and slow results possible. >> 22. We don’t often holistically look at what we have built and how it >> fits together and so often there are silly things, bad fits, one off >> bad patterns, lazy attempts at something, etc. >> 24. Close and shutdown are inefficient and slow across a huge swatch >> of our object tree. These issues tend to be growy and breed less >> concern over time. >> 25. There are a variety of ways and places that we can generate an >> absurd amount of unnecessary garbage. >> 26. SolrCore reload is not fully reliable and increasingly important and >> used. >> 27. The leader election code has a variety of ugly little bugs and is >> based on a recursive implementation that will eventually exhaust stack >> space – though it’s likely your cluster will be brought down by >> something else before that is a problem (unless you hit the infinite >> loop, no one can be leader, eat up the stack as fast as possible case >> – which should be hard these days with the leader election throttle). >> 28. The recovery processes, like almost everything you can imagine, >> has a variety of issues and rarer bad cases and affects. >> By and large, everything is inefficient and buggy and full of accepted >> compromise regardless. >> Interestingly, this does not make us an atypical open source Java >> distributed project. But, I’m kind of a software snob, and I would not >> run this thing and so I cannot work on it. What is there to do ... >> The Solr Reference Branch is intended to tackle every one of those >> issues. As well as about 1000+ more of varying and lesser importance. >> As all of that comes together, cool stuff starts to unlock, and you >> begin to see some phenomena that is together much greater than the sum >> of it’s many, many parts. >> 29. Our tests have been getting better and better are stamping out the >> legit noise they create - every scream a breadcrumb towards badness - >> but we have built a scream catching machine - though we will never be >> able to catch them all for a huge variety of deep reasons. >> >> The Solr Reference Branch >> Document 2 >> While the extent of the previously mentioned issues was not clear to >> me, that is a deep rabbit hole, I’ve always, as have many others, >> known the current state of things with Solr at a higher, broader >> level. >> So what about this effort is different? Is this not just a bunch of my >> standard JIRA issues all crammed into one? Should we not break them >> out proper and do things sensible? >> Well, previously, as is probably common, I was both a bit lost on >> where we were exactly and certainly on where to find firmer ground for >> real, not just the mirage always just over the hill. >> I love performance and efficiency though. I’ve always avoided it as a >> focus with Solr and SolrCloud, thinking stability has to come first. >> Having given up on stability and scale after a good 8 years or >> something, completely tossed out as a pipe dream, I started work on >> something new, something really just for me. I started plugging in >> HTTP2. And the effort and work needed for that and the learning and >> some of the results, completely opened my eyes. I also attacked very >> different than I have in the past, doing something I like for me, I >> drowned myself in it. Spent 2-3 weeks at a time here and there sitting >> at the computer with intense focus for 16-20 hours a day. The more I >> did, the more I found, the more I understood, the more I discovered. >> I discovered a discovery processes. It was leading me to everything I >> needed to do and I just had to follow the long, ever flowing path, >> keeping my mental models strong, re etching, ruminating, obsessing. >> I realized many test functions we have – most- should be taking on the >> order of milliseconds instead of seconds to dozens of seconds. I >> realized tons and tons of our issues and gremlins lived and prospered >> in our slow and inefficient smog. I realized that if I just spent the >> time to look where slowness and flakiness prevailed, really look, like >> take hours just for some random side road - build a bridge, burn it, >> and build one further down, etc, etc – that making huge improvement >> after huge improvement was actually very low hanging fruit, just >> hidden by some thorns and leaves and lack of any reasonable >> introspection into the system we have created and continue to build. >> Over time, I could see what had to be done and I could see what it >> would achieve. I built different parts at different times, lost them >> and rebuilt them a different way with different focus. I build and >> expanded my introspection and measuring tools and classes. >> That’s a sentence trying to cover a universe, but if you want to >> really boil it down even further, I’d invoke the normally faulty >> broken windows theory. There is magic in perfect windows that only >> those that have them know. Can we get perfect? I like to dream and >> there is no end to the introspection, experimentation, and >> improvements to try. The perfect landing aside though, no doubt we can >> move drastically from where we are. >> >> Another thing I learned is the crazy number of ways you can make all >> the tests pass like champions, and roll into production unusable. >> Which tells me that production users are a large part of our test >> strategy, and that can’t be to make any real change in a satisfactory >> way. >> >> The current goal is to have a mostly usable and testable system by >> mid-late October. Not everything 100%, some known caveats and cleanup >> and plenty to do, but it should be in good shape for a user to try out >> given the caveats outlined >> The biggest risk currently is the absorption of the search side async >> work from master - I'm familiar with that, I've worked on it myself, >> the code involved is derived from an old branch of mine, but async is >> a whole different animal and trying to nail it without any downsides >> to the old synchronous model is a tough nut >> one that I was already battling on the dist update side, so it's good >> stuff to work on and do, but its taking some effort to get in shape >> >> On Tue, Oct 6, 2020 at 8:00 PM Tomás Fernández Löbbe >> <[email protected]> wrote: >> > >> > > Let's say we cut 9x and now there is a new master taken from the >> reference branch. >> > I never said “make a new master”, I said merge changes in ref branch >> into master. If things are broken into pieces like Ishan is suggesting, >> those changes can be merged into 9.x too. I only suggested this because you >> felt unsure about merging to master now and I guess this is due to fear of >> introducing bugs so close to a potential 9.0 release, is that not right? >> > >> > >> > > We will never be able to reconcile these 2 branches >> > Sorry, but how is that different if we do an alpha release from the >> branch now? What would be the process after that? Let's say people don't >> find issues and we want to merge those changes, what’s the plan then? >> > >> > > Choice 1: >> > I’m fine with choice 1 if that’s what you want, as long as it’s not an >> official release for the reasons stated above. >> > >> > >> > > I promise to do code review & cleanup as much as possible. But I'm >> hesitant to give a stamp of approval to make it THE official release >> > What do you mean? I thought this is what you were suggesting, make an >> official release from the reference_impl branch? >> > >> > >> > I think Ilan’s last email is on spot, and I agree 100% with what he can >> express much better than I can :) >> > >> > > Mark's descriptions in Slack go in the right way but are still too >> high level >> > Can someone share those here? or in Jira? >> > >> > On Tue, Oct 6, 2020 at 5:09 AM Noble Paul <[email protected]> wrote: >> >> >> >> > I think the danger is high to treat this branch as a black box (or >> an "all or nothing"). >> >> >> >> True Ilan. Ideally, I would like a few of us to study the code & >> >> start pulling in changes we are confident of (even to 8x branch, why >> >> not). We cannot burden a single developer to do everything. >> >> >> >> This cannot be a task just for one or 2 devs. We all will have to work >> >> together to decompose the changes and digest them into master. I can >> >> do my bit. >> >> >> >> But, I'm sure we may hit a point where certain changes cannot be >> >> isolated and absorbed. We will have to collectively make a call, how >> >> to absorb them >> >> >> >> On Tue, Oct 6, 2020 at 9:00 PM Ishan Chattopadhyaya >> >> <[email protected]> wrote: >> >> > >> >> > >> >> > I'm willing to help and I believe others will too if the amount of >> work for contributing is reasonable (i.e. not a three months effort). >> >> > >> >> > I looked into the possibility of doing so. To me, it seemed to be >> that it is very hard to do so: possibly 1 year project for me. Problem is >> that it is hard to pull out a particular class of improvements (say thread >> management improvement) and have all tests pass with it (because tests have >> gotten extensive improvements of their own) and also observe the effect of >> the improvement. IIUC, every improvement to Solr seemed to require many >> iterations to get the tests happy. I remember Mark telling me that it may >> not even be possible for him to do something like that (i.e. bring all >> changes into master as tiny pieces). >> >> > >> >> > What I volunteered to do, however, is to decompose roughly all the >> general improvements into smaller, manageable commits. However, making sure >> all tests pass at every commit point is beyond my capability. >> >> > >> >> > On Tue, 6 Oct, 2020, 3:10 pm Ilan Ginzburg, <[email protected]> >> wrote: >> >> >> >> >> >> Another option to integrate this work into the main code line would >> be to understand what changes have been made and where (Mark's descriptions >> in Slack go in the right way but are still too high level), and then port >> or even redo them in main, one by one. >> >> >> >> >> >> I think the danger is high to treat this branch as a black box (or >> an "all or nothing"). Using the merging itself to change our understanding >> and increase our knowledge of what was done can greatly reduce the risk. >> >> >> >> >> >> We do develop new features in Solr 9 without beta releasing them, >> so if we port Mark's improvements by small chunks (and maybe in the process >> decide that some should not be ported or not now) I don't see why this >> can't integrate to become like other improvements done to the code. If >> specific changes do require a beta release, do that release from master and >> pick the right moment. >> >> >> >> >> >> I'm willing to help and I believe others will too if the amount of >> work for contributing is reasonable (i.e. not a three months effort). This >> requires documenting the changes done in that branch, pointing to where >> these changes happened and then picking them up one by one and porting them >> more or less independently of each other. We might only port a subset of >> changes by the time 9.0 is released, that's fine we can continue in >> following releases. >> >> >> >> >> >> My 2 cents... >> >> >> Ilan >> >> >> >> >> >> Le mar. 6 oct. 2020 à 09:56, Noble Paul <[email protected]> a >> écrit : >> >> >>> >> >> >>> Yes, A docker image will definitely help. I wasn't trying to >> downplay that >> >> >>> >> >> >>> On Tue, Oct 6, 2020 at 6:55 PM Ishan Chattopadhyaya >> >> >>> <[email protected]> wrote: >> >> >>> > >> >> >>> > >> >> >>> > > Docker is not a big requirement for large scale installations. >> Most of them already have their own install scripts. Availability of docker >> is not important for them. If a user is only encouraged to install Solr >> because of a docker image , most likely they are not running a large enough >> cluster >> >> >>> > >> >> >>> > I disagree, Noble. Having a docker image us going to be useful >> to some clients, with complex usecases. Great point, David! >> >> >>> > >> >> >>> > On Tue, 6 Oct, 2020, 1:09 pm Ishan Chattopadhyaya, < >> [email protected]> wrote: >> >> >>> >> >> >> >>> >> As I said, I'm *personally* not confident in putting such a big >> changeset into master that wasn't vetted in a real user environment widely. >> I have, in the past, done enough bad things to Solr (directly or >> indirectly), and I don't want to repeat the same. Also, I'll be very >> uncomfortable if someone else did so. >> >> >>> >> >> >> >>> >> Having said this, if someone else wants to port the changes >> over to master *without first getting enough real world testing*, feel free >> to do so, and I can focus my efforts elsewhere. >> >> >>> >> >> >> >>> >> On Tue, 6 Oct, 2020, 9:22 am Tomás Fernández Löbbe, < >> [email protected]> wrote: >> >> >>> >>> >> >> >>> >>> I was thinking (and I haven’t flushed it out completely but >> will throw the idea) that an alternative approach with this timeline could >> be to cut 9x branch around November/December? And then you could merge into >> master, it would have the latest changes from master plus the ref branch >> changes. From there any nightly build could be use to help test/debug. >> >> >>> >>> >> >> >>> >>> That said I don’t know for sure what are the changes in the >> branch that do not belong in 9. The problem with them being 10x only is >> that backports would potentially be more difficult for all the life of 9. >> >> >>> >>> >> >> >>> >>> On Mon, Oct 5, 2020 at 4:54 PM Noble Paul < >> [email protected]> wrote: >> >> >>> >>>> >> >> >>> >>>> >I don't think it can be said what committers do and don't do >> with regards to running Solr. All of us would answer this differently and >> at different points in time. >> >> >>> >>>> >> >> >>> >>>> " I have run it in one large cluster, so it is certified to >> be bug free/stable" I don't think it's a reasonable approach. We need as >> much feedback from our users because each of them stress Solr in a >> different way. This is not to suggest that committers are not doing testing >> or their tests are not valid. When I talk to the committers out here they >> say they do not see any performance stability issues at all. But, my client >> reports issues on a day to day basis. >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> > Definitely publish a Docker image BTW -- it's the best way >> to try out any software. >> >> >>> >>>> >> >> >>> >>>> Docker is not a big requirement for large scale >> installations. Most of them already have their own install scripts. >> Availability of docker is not important for them. If a user is only >> encouraged to install Solr because of a docker image , most likely they are >> not running a large enough cluster >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> On Tue, Oct 6, 2020, 6:30 AM David Smiley <[email protected]> >> wrote: >> >> >>> >>>>> >> >> >>> >>>>> Thanks so much for your responses Ishan... I'm getting much >> more information in this thread than my attempts to get questions answered >> on the JIRA issue months ago. And especially, thank you for volunteering >> for the difficult porting efforts! >> >> >>> >>>>> >> >> >>> >>>>> Tomas said: >> >> >>> >>>>>> >> >> >>> >>>>>> I do agree with the previous comments that calling it >> "Solr 10" (even with the "-alpha") would confuse users, maybe use >> "reference"? or maybe something in reference to SOLR-14788? >> >> >>> >>>>> >> >> >>> >>>>> >> >> >>> >>>>> I have the opposite opinion. This word "reference" is >> baffling to me despite whatever Mark's explanation is. I like the >> justification Ishan gave for 10-alpha and I don't think I could re-phrase >> his justification any better. *If* the release was _not_ official (thus >> wouldn't show up in the usual places anyone would look for a release), I >> think it would alleviate that confusion concern even more, although I think >> "alpha" ought to be enough of a signal not to use it without digging deeper >> on what's going on. >> >> >>> >>>>> >> >> >>> >>>>> Alex then Ishan said: >> >> >>> >>>>>> >> >> >>> >>>>>> > Maybe we could release it to >> >> >>> >>>>>> > committers community first and dogfood it "internally"? >> >> >>> >>>>>> >> >> >>> >>>>>> Alex: It is meaningless. Committers don't run large scale >> installations. We barely even have time to take care of running unit tests >> before destabilizing our builds. We are not the right audience. However, we >> all can anyway check out the branch and start playing with it, even without >> a release. There are orgs that don't want to install any code that wasn't >> officially released; this release is geared towards them (to help us test >> this at their scale). >> >> >>> >>>>> >> >> >>> >>>>> >> >> >>> >>>>> I don't think it can be said what committers do and don't do >> with regards to running Solr. All of us would answer this differently and >> at different points in time. From time to time, though not at present, >> I've been well positioned to try out a new version of Solr in a stage/test >> environment to see how it goes. (Putting on my Salesforce metaphorical >> hat...) Even though I'm not able to deploy it in a realistic way today, I'm >> able to run a battery of tests to see if one of the features we depend on >> have changed or is broken. That's useful feedback to an alpha release! >> And even though I'm saying I'm not well positioned to try out some new Solr >> release in a production-ish setting now, it's something I could make a good >> case for internally since upgrades take a lot of effort where I work. It's >> in our interest for SolrCloud to be very stable (of course). >> >> >>> >>>>> >> >> >>> >>>>> Regardless, I think what you're driving at Ishan is that you >> want an "official" release -- one that goes through the whole ceremony. >> You believe that people would be more likely to use it. I think all we >> need to do is announce (similar to a real release) that there is some >> unofficial alpha distribution and that we want to solicit your feedback -- >> basically, help us find bugs. Definitely publish a Docker image BTW -- >> it's the best way to try out any software. I'm -0 on doing an official >> release for alpha software because it's unnecessary to achieve the goals >> and somewhat confusing. I think the Solr 4 alpha/beta situation was >> different -- it was not some fork a committer was maintaining; it was the >> master branch of its time, and it was destined to be the very next release, >> not some possible future release. >> >> >>> >>>>> >> >> >>> >>>>> ~ David Smiley >> >> >>> >>>>> Apache Lucene/Solr Search Developer >> >> >>> >>>>> http://www.linkedin.com/in/davidwsmiley >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> ----------------------------------------------------- >> >> >>> Noble Paul >> >> >>> >> >> >>> >> --------------------------------------------------------------------- >> >> >>> To unsubscribe, e-mail: [email protected] >> >> >>> For additional commands, e-mail: [email protected] >> >> >>> >> >> >> >> >> >> -- >> >> ----------------------------------------------------- >> >> Noble Paul >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > -- > Anshum Gupta >
