Yeah we do a bad job of the things you listed Noble. :-( My colleagues want pointers to internal docs but the sad reality is there isn't any. You may notice I'm a stickler in my code reviews for requiring javadocs on all top level classes. I think more javadocs and code comments would be very helpful -- especially for the major classes. This might help us all and others a lot more. For example I think Lucene does a rather fine job of this for its major classes -- IndexWriter being a good example.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <noble.p...@gmail.com> wrote: > Hi, > > I believe there is a consensus on what is wrong with the way we have built > the cluster state and overseer. We need to focus a bit more on the design > aspect. Design, according to me, has the following elements: > > * How does it work? > > * What are the performance characteristics? Can it be done more > efficiently? > > * What are the public touch points? > > ** Which are the files we store in ZK? Are they expected to be watched > always? > > ** Or are they read on demand? > > ** The public APIs. Does it make sense to the user? Can it be further > simplified? How does it compare to the other APIs in the system? > > > We, as a community, do a bad job in dealing with these. While we focus on > internal things, these are not discussed before it is too late. We usually > do coding, tests, code review (sometimes) and commit. This leads to huge > technical debt. > > > This is not to put blame on one person or a group of people. (I > occasionally see people discussing design issues upfront, I just hope that > is the norm.) > > > Now, why am I discussing this in this thread? > > > While we agree there are problems, we are trying to solve the problem > using the same process we used to create these problems. Again, I'm not > questioning the intent or competence of anyone. Unless we set the process > right, we are doomed to make the same mistakes again. > > > I whole heartedly endorse any effort to improve SolrCloud/overseer. At the > same time I fail to see us leveraging the collective experience of our > community through meaningful discussion. > > > I hope we don't resort to personal attacks and use this as an opportunity > to improve our processes. > Thanks > > On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dragonsi...@gmail.com> wrote: > >> Very much agreed. I've been trying to figure out for a long time what is >> the point in having a replica DOWN state that has to be toggled (DOWN and >> then UP!) every time a node restarts. Considering that we could just >> combine ACTIVE and `live_nodes` to understand whether a replica is >> available. It's not even foolproof since kill -9 on a solr node won't mark >> all the replicas DOWN-- that doesn't happen until the node comes back up >> (perversely). >> >> What would it take to get to a state where restarting a node would >> require a minimal amount of ZK work in most cases? >> >> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com> wrote: >> >>> Give me a short bit to follow up and I will lay out my case and proposal. >>> >>> Everyone is then free to decide that we need to do something drastic or >>> that I'm wrong and we should just continue down the same road. If that's >>> the case, a lot of your work will get a lot easier and less impeded by me >>> and we will still all be happier. Win win. >>> >>> If we can just not make drastic changes for a just a brief week or so >>> window, I'll say what I have to say, you guys can judge and do whatever >>> you'd please. >>> >>> - mark >>> >>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com> >>> wrote: >>> >>>> Hey All Solr Dev's, >>>> >>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, >>>> the Overseer, is mix and mess of proper exception handling and super slow >>>> startup and shutdown, adding new things all the time with no concern for >>>> performance or proper ordering (which is harder to tell than you think). >>>> >>>> Our class dependency graph doesn't even work - we just force it. Sort >>>> of. If the whole system doesn't block and choke it's way to a start slow >>>> enough, lots of things fail. >>>> >>>> This thing coughs up, you toss stuff into the storm, a good chunk of >>>> time, what you want eventually come back without causing too much damage. >>>> >>>> There are so many things are are off or just plain wrong and the list >>>> is growing and growing. No one is following this or if you are, please back >>>> me up. This thing will collapse under it's own wait. >>>> >>>> So if you want to add yet another state format cluster state or some >>>> other optimization on this junk heap, you can expect me to push back. >>>> >>>> We should all be embarrassed by the state of things. >>>> >>>> I've got some ideas for addressing them that I'll share soon, but god, >>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That >>>> Overseer is an atrocity. >>>> >>>> -- >>>> - Mark >>>> >>>> http://about.me/markrmiller >>>> >>> >>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >>