Not much. Something you can understand. How about tests < 10 seconds fail or not. Good logging and as a backup good debug logging. Docs on how things are designed to work? Tracking of all important operations and how long they take with tight cutoffs? Proper response to interruption 100% of the time? The idea of a cluster start and stop? Of a cluster install to ZK initially. Drop all legacyCloud support, stateformat=1 support, maybe a few other things.
I've got some stuff, I'm gonna pull out as fast as I sensibly can given many setbacks and too little sleep for a long time. I'm not here to do all the of the lift for everyone, but unless I get sick in the next week or two or my 10 backup methods and git pushes and backup branches fail or I just burn the hell out, I have a solid refuge that we can knock out and then build on with confidence. - Mark On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dragonsi...@gmail.com> wrote: > Very much agreed. I've been trying to figure out for a long time what is > the point in having a replica DOWN state that has to be toggled (DOWN and > then UP!) every time a node restarts. Considering that we could just > combine ACTIVE and `live_nodes` to understand whether a replica is > available. It's not even foolproof since kill -9 on a solr node won't mark > all the replicas DOWN-- that doesn't happen until the node comes back up > (perversely). > > What would it take to get to a state where restarting a node would require > a minimal amount of ZK work in most cases? > > On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com> wrote: > >> Give me a short bit to follow up and I will lay out my case and proposal. >> >> Everyone is then free to decide that we need to do something drastic or >> that I'm wrong and we should just continue down the same road. If that's >> the case, a lot of your work will get a lot easier and less impeded by me >> and we will still all be happier. Win win. >> >> If we can just not make drastic changes for a just a brief week or so >> window, I'll say what I have to say, you guys can judge and do whatever >> you'd please. >> >> - mark >> >> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com> wrote: >> >>> Hey All Solr Dev's, >>> >>> SolrCloud is sick right now. The way low level Zookeeper is handeled, >>> the Overseer, is mix and mess of proper exception handling and super slow >>> startup and shutdown, adding new things all the time with no concern for >>> performance or proper ordering (which is harder to tell than you think). >>> >>> Our class dependency graph doesn't even work - we just force it. Sort >>> of. If the whole system doesn't block and choke it's way to a start slow >>> enough, lots of things fail. >>> >>> This thing coughs up, you toss stuff into the storm, a good chunk of >>> time, what you want eventually come back without causing too much damage. >>> >>> There are so many things are are off or just plain wrong and the list is >>> growing and growing. No one is following this or if you are, please back me >>> up. This thing will collapse under it's own wait. >>> >>> So if you want to add yet another state format cluster state or some >>> other optimization on this junk heap, you can expect me to push back. >>> >>> We should all be embarrassed by the state of things. >>> >>> I've got some ideas for addressing them that I'll share soon, but god, >>> don't keep optimizing a turd in non backcompat Overseer loving ways. That >>> Overseer is an atrocity. >>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> >> >> -- >> - Mark >> >> http://about.me/markrmiller >> > -- - Mark http://about.me/markrmiller