I mean the reality is - why do we not have just a single watcher per node pulling in state. We are we not tracking and minimizing state transfers and changes? Why are we not measuring the time it takes to round trip a state.json and adjusting? Looking at load to adjust overseerish duties and leader election? A million other smart things?
Because it's too hard. It's too hard and we all gave up long ago on figuring out what to do about it. Because we are programming in assembly in an abyss when we should be doing java in the clouds. Everyone knows the SolrCloud DNA one way or another.We all somehow made our peace with it or not. It's easy when you dont go deep. Hell thats easy to forget even if you do. But I'm looping on it now, have to eject. - Mark On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <markrmil...@gmail.com> wrote: > Not much. Something you can understand. How about tests < 10 seconds fail > or not. Good logging and as a backup good debug logging. Docs on how things > are designed to work? Tracking of all important operations and how long > they take with tight cutoffs? Proper response to interruption 100% of the > time? The idea of a cluster start and stop? Of a cluster install to ZK > initially. Drop all legacyCloud support, stateformat=1 support, maybe a few > other things. > > I've got some stuff, I'm gonna pull out as fast as I sensibly can given > many setbacks and too little sleep for a long time. > > I'm not here to do all the of the lift for everyone, but unless I get sick > in the next week or two or my 10 backup methods and git pushes and backup > branches fail or I just burn the hell out, I have a solid refuge that we > can knock out and then build on with confidence. > > - Mark > > On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dragonsi...@gmail.com> wrote: > >> Very much agreed. I've been trying to figure out for a long time what is >> the point in having a replica DOWN state that has to be toggled (DOWN and >> then UP!) every time a node restarts. Considering that we could just >> combine ACTIVE and `live_nodes` to understand whether a replica is >> available. It's not even foolproof since kill -9 on a solr node won't mark >> all the replicas DOWN-- that doesn't happen until the node comes back up >> (perversely). >> >> What would it take to get to a state where restarting a node would >> require a minimal amount of ZK work in most cases? >> >> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com> wrote: >> >>> Give me a short bit to follow up and I will lay out my case and proposal. >>> >>> Everyone is then free to decide that we need to do something drastic or >>> that I'm wrong and we should just continue down the same road. If that's >>> the case, a lot of your work will get a lot easier and less impeded by me >>> and we will still all be happier. Win win. >>> >>> If we can just not make drastic changes for a just a brief week or so >>> window, I'll say what I have to say, you guys can judge and do whatever >>> you'd please. >>> >>> - mark >>> >>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com> >>> wrote: >>> >>>> Hey All Solr Dev's, >>>> >>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, >>>> the Overseer, is mix and mess of proper exception handling and super slow >>>> startup and shutdown, adding new things all the time with no concern for >>>> performance or proper ordering (which is harder to tell than you think). >>>> >>>> Our class dependency graph doesn't even work - we just force it. Sort >>>> of. If the whole system doesn't block and choke it's way to a start slow >>>> enough, lots of things fail. >>>> >>>> This thing coughs up, you toss stuff into the storm, a good chunk of >>>> time, what you want eventually come back without causing too much damage. >>>> >>>> There are so many things are are off or just plain wrong and the list >>>> is growing and growing. No one is following this or if you are, please back >>>> me up. This thing will collapse under it's own wait. >>>> >>>> So if you want to add yet another state format cluster state or some >>>> other optimization on this junk heap, you can expect me to push back. >>>> >>>> We should all be embarrassed by the state of things. >>>> >>>> I've got some ideas for addressing them that I'll share soon, but god, >>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That >>>> Overseer is an atrocity. >>>> >>>> -- >>>> - Mark >>>> >>>> http://about.me/markrmiller >>>> >>> >>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> > > -- > - Mark > > http://about.me/markrmiller > -- - Mark http://about.me/markrmiller