Re: SolrCloud is sick.

Noble Paul Sat, 02 Nov 2019 22:39:04 -0700

Solr has to do more than Lucene. A Lucene user is mostly a developer
who reads javadocs. A Solr user's touch points are


* Public API
* Ref guide
* publicly visible files (in ZK as well as file system)
* What to see/look for in the log files to debug issues

Then we have more nuanced touch points such as the knowledge base of
what happens internally in the system when 'X' API is invoked or when
'Y' behavior is observed in ZK data.

The problem with delaying the review process till code completion is
that, any changes based on review comments will require massive amount
of work.

I don't have an answer to how we achieve it. But, I clearly see this
as a major gap in our development process today.

This discussion may not be relevant in this thread, may be because no
behavior is changed at all. We don't know yet

What I want to believe is Mark is doing the right thing & it's gonna
help us all in dealing with our operational issues. I don't want to
interrupt his work with more discussions.

Thanks you


On Sun, Nov 3, 2019 at 3:32 PM David Smiley <[email protected]> wrote:
>
> Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues 
> want pointers to internal docs but the sad reality is there isn't any.  You 
> may notice I'm a stickler in my code reviews for requiring javadocs on all 
> top level classes.  I think more javadocs and code comments would be very 
> helpful -- especially for the major classes.  This might help us all and 
> others a lot more.  For example I think Lucene does a rather fine job of this 
> for its major classes -- IndexWriter being a good example.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <[email protected]> wrote:
>>
>> Hi,
>>
>> I believe there is a consensus on what is wrong with the way we have built 
>> the cluster state and overseer. We need to focus a bit more on the design 
>> aspect. Design, according to me, has the following elements:
>>
>> * How does it work?
>>
>> * What are the performance characteristics? Can it be done more efficiently?
>>
>> * What are the public touch points?
>>
>> ** Which are the files we store in ZK? Are they expected to be watched 
>> always?
>>
>> ** Or are they read on demand?
>>
>> ** The public APIs. Does it make sense to the user? Can it be further 
>> simplified? How does it compare to the other APIs in the system?
>>
>>
>> We, as a community, do a bad job in dealing with these. While we focus on 
>> internal things, these are not discussed before it is too late. We usually 
>> do coding, tests, code review (sometimes) and commit. This leads to huge 
>> technical debt.
>>
>>
>> This is not to put blame on one person or a group of people. (I occasionally 
>> see people discussing design issues upfront, I just hope that is the norm.)
>>
>>
>> Now, why am I discussing this in this thread?
>>
>>
>> While we agree there are problems, we are trying to solve the problem using 
>> the same process we used to create these problems. Again, I'm not 
>> questioning the intent or competence of anyone. Unless we set the process 
>> right, we are doomed to make the same mistakes again.
>>
>>
>> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the 
>> same time I fail to see us leveraging the collective experience of our 
>> community through meaningful discussion.
>>
>>
>> I hope we don't resort to personal attacks and use this as an opportunity to 
>> improve our processes.
>> Thanks
>>
>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <[email protected]> wrote:
>>>
>>> Very much agreed.  I've been trying to figure out for a long time what is 
>>> the point in having a replica DOWN state that has to be toggled (DOWN and 
>>> then UP!) every time a node restarts.  Considering that we could just 
>>> combine ACTIVE and `live_nodes` to understand whether a replica is 
>>> available.  It's not even foolproof since kill -9 on a solr node won't mark 
>>> all the replicas DOWN-- that doesn't happen until the node comes back up 
>>> (perversely).
>>>
>>> What would it take to get to a state where restarting a node would require 
>>> a minimal amount of ZK work in most cases?
>>>
>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <[email protected]> wrote:
>>>>
>>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>>
>>>> Everyone is then free to decide that we need to do something drastic or 
>>>> that I'm wrong and we should just continue down the same road. If that's 
>>>> the case, a lot of your work will get a lot easier and less impeded by me 
>>>> and we will still all be happier. Win win.
>>>>
>>>> If we can just not make drastic changes for a just a brief week or so 
>>>> window, I'll say what I have to say, you guys can judge and do whatever 
>>>> you'd please.
>>>>
>>>> - mark
>>>>
>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <[email protected]> wrote:
>>>>>
>>>>> Hey All Solr Dev's,
>>>>>
>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, the 
>>>>> Overseer, is mix and mess of proper exception handling and super slow 
>>>>> startup and shutdown, adding new things all the time with no concern for 
>>>>> performance or proper ordering (which is harder to tell than you think).
>>>>>
>>>>> Our class dependency graph doesn't even work - we just force it. Sort of. 
>>>>> If the whole system  doesn't block and choke it's way to a start slow 
>>>>> enough, lots of things fail.
>>>>>
>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of 
>>>>> time, what you want eventually come back without causing too much damage.
>>>>>
>>>>> There are so many things are are off or just plain wrong and the list is 
>>>>> growing and growing. No one is following this or if you are, please back 
>>>>> me up. This thing will collapse under it's own wait.
>>>>>
>>>>> So if you want to add yet another state format cluster state or some 
>>>>> other optimization on this junk heap, you can expect me to push back.
>>>>>
>>>>> We should all be embarrassed by the state of things.
>>>>>
>>>>> I've got some ideas for addressing them that I'll share soon, but god, 
>>>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That 
>>>>> Overseer is an atrocity.
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>
>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller



-- 
-----------------------------------------------------
Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: SolrCloud is sick.

Reply via email to