Re: Solr and Jetty and Servlets

Mark Miller Wed, 18 Aug 2021 00:08:33 -0700

Agreement or not, I wouldn’t step in front of whatever you did.

I doubt anyone is going to take on actually pulling out of a web app
because it’s easier to lean on it for configuration and I doubt start up
speed is a very high priority for most.


It had never really occurred to me in the past that JettySolrRunner was so
much faster. I just know it because of driving the test down to
ridiculously times, starting up and stopping mini clusters of whatever size
in literally no time. With no QuickStart setup for embedded Jetty.

So than starting up standalone Solr, it became apparent how much slower it
was. So I started poking around. Cloud guys like those at google had
already been feeling pain - preferring to be able to spin up and down
relatively quickly - and so this QuickStart “precompile” web.xml / JSP’s
thing how they were dealing with it. And it’s simple, but it’s still easily
non comparable at the end of the day. Solr ships with like 10,000 classes
across a ton of jars, even with QuickStart, you have to follow the spec of
what to do and what might be done by any given webapp - though as
JettySolrRunner shows, we don’t actually need any of it. QuickStart is
trying to find every trick to speed up following the container and webapp
spec. Injecting the dispatch filter just tosses it all out :)

Any, may point was not that I disagree and you should make Solr inject a
dispatch filter - my point is just that it doesn’t really matter to
dispatch. You can do just as much via Jetty either way.

And anyone that doesn’t agree about “something should be done” is either
lying or not looking. So I wouldn’t worry about agreement. The problem is
that it’s quicksand that people are avoiding, not that anyone that looks
can’t see it a huge rats nest. Look at the V2 API. Someone actually wanted
to work towards getting out of that hole. But at the end of the day, it’s
now part of the hole. Made it deeper. Bermuda Triangle. But look at it -
there was hope there. Hope that one day, dispatching would be better.
Rather than twice as complicated. I don’t know that hope lives on. But no
one gonna stop you on reasonable technical sense with any proper license.

Mark



On Tue, Aug 17, 2021 at 10:12 AM Gus Heck <[email protected]> wrote:

> *ServletContextHandler not ServletContext
>
> On Tue, Aug 17, 2021 at 11:03 AM Gus Heck <[email protected]> wrote:
>
>> Ok nope, since that only happens with WebAppContext, not ServletContext
>> (the parent class). whew :)
>>
>> On Tue, Aug 17, 2021 at 10:33 AM Gus Heck <[email protected]> wrote:
>>
>>> WRT the JettySolrRunner tangent, I wonder if our lack of a web.xml and
>>> thus lack of metadata-complete=true in web.xml means that we wind up
>>> scanning for annotations?
>>>
>>> On Tue, Aug 17, 2021 at 10:06 AM Gus Heck <[email protected]> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/SOLR-15590
>>>>
>>>> On Tue, Aug 17, 2021 at 9:44 AM Gus Heck <[email protected]> wrote:
>>>>
>>>>> Ok well it would be interesting to compare quickstarting vs
>>>>> JettySolrRunner, and reading up on quickstart
>>>>> <https://webtide.com/jetty-9-quick-start/> gives ideas about
>>>>> pre-building the quickstart xml, but web.xml for JettySolrRunner is a
>>>>> tangent. Either way we are still in a container and I think I hear some
>>>>> agreement that something should be done about the dispatch here, and both
>>>>> of you seem to agree that an actual servlet would make more sense than a
>>>>> filter, so I'll make a ticket/pr to make it easier to track & easier to
>>>>> read the code.
>>>>>
>>>>> On Tue, Aug 17, 2021 at 1:41 AM Mark Miller <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The downside to respecting web.xml and making JettySolrServer serve a
>>>>>> webapp is that loading a webapp is very expensive and slow in comparison.
>>>>>> JettySolrServer actually starts up extremely quickly. It’s almost more
>>>>>> appealing to change the Server to use the JettySolrServer strategy. It’s 
>>>>>> so
>>>>>> slow to load a webapp because of all the things it needs to support and
>>>>>> scan jars for - in a kind of JavaEE situation, though not as bad. Jetty
>>>>>> QuickStart does improve the situation if used at least. But there is 
>>>>>> really
>>>>>> no need to eat the whole webapp standard for a non webapp app to have 
>>>>>> Jetty
>>>>>> manage more of the dispatching. I always wondering about the motivation /
>>>>>> upside of hacking in a straight servlet myself. But for a non webapp 
>>>>>> stuck
>>>>>> in a servlet container, it’s actually a beautiful move that side steps a
>>>>>> bunch of slow crap even better than the QuickStart pushes ever will, and
>>>>>> still allows for matching any support we would want.
>>>>>>
>>>>>> The HttpSolrCallV2 extending HttpSolrCall is horrendous. Personally,
>>>>>> I made a slim SolrCall base class that each extends. All that logic is
>>>>>> hairy enough to follow without them interleaving and sharing in a kind of
>>>>>> wild collaboration. It’s pretty unfortunate they both still exist.
>>>>>>
>>>>>> You are right, leaving that path in that state is a poor idea. If
>>>>>> there was anything that could be considered the “hot” path, it’s right
>>>>>> there - feverishly checking and ripping up streams for anyone and 
>>>>>> anything.
>>>>>> Is a core? A collection? A handler? Maybe a V2 handler? What I’d we cut 
>>>>>> the
>>>>>> path down? Do a dance? Treat the string like a date and see if that 
>>>>>> works.
>>>>>>
>>>>>> So yeah, agreed, nobody does or would dispatch this way, you have to
>>>>>> frog boil into it. It’s slow, almost incomprehensible, and incredibly 
>>>>>> good
>>>>>> at holding onto life, even building on it. But being a webapp and parsing
>>>>>> web.xml has little to do with dispatching and having sensible http api 
>>>>>> that
>>>>>> doesn’t work like it’s a perl compiler.
>>>>>>
>>>>>> Anyway, I can’t imagine trying to trace through that code with any
>>>>>> solid feeling about having a handle on it via inspection. But just like
>>>>>> Perl, if you debug / trace log the hell out of it, it is all actually
>>>>>> pretty simple to adjust and reign in.
>>>>>>
>>>>>>
>>>>>> MRM
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 16, 2021 at 10:39 PM Gus Heck <[email protected]> wrote:
>>>>>>
>>>>>>> So I'm not interested in *adhering* to anything here, just using
>>>>>>> stuff where it helps... I see tools sitting on the shelf, already bought
>>>>>>> and paid for and carried with us to every job but then they sit in the 
>>>>>>> back
>>>>>>> of the truck mostly unused... and (I think) they look useful.
>>>>>>>
>>>>>>> This should not be seen as in any way advocating a return to war
>>>>>>> file deployment or "choose your container". All of what I would suggest
>>>>>>> should likely be compatible with a jetty startup controlled by our code 
>>>>>>> (I
>>>>>>> assume we can convince it to read web.xml properly when doing so).
>>>>>>> Certainly point out anything that would interfere with that.
>>>>>>>
>>>>>>> Here we sit in a container that is a tool box containing:
>>>>>>>
>>>>>>>    - A nice mechanism for initializing stuff at the application
>>>>>>>    level, (ServletContextListener) and a well defined guarantee that it 
>>>>>>> runs
>>>>>>>    before any filter or servlet,
>>>>>>>    - An automatic shutdown mechanism that it will call for us on
>>>>>>>    graceful shutdown (ServletContextListner again).
>>>>>>>    - A nice layered filter chain mechanism which could have been
>>>>>>>    used to layer in things like tracing and authentication, and close 
>>>>>>> shields
>>>>>>>    etc as small succinct filters rather than weaving them into an ever 
>>>>>>> more
>>>>>>>    complex filter & call class.
>>>>>>>    - In more recent versions of the spec, for listeners defined in
>>>>>>>    web.xml the order is also guaranteed.
>>>>>>>    - Servlet classes that are *already* set up to distinguish http
>>>>>>>    verbs automatically when desired
>>>>>>>
>>>>>>> So why is this better? Because monster classes that do a hundred
>>>>>>> things are really hard to understand and maintain. Small methods, small
>>>>>>> classes whenever possible. I also suspect that there may be some gains 
>>>>>>> in
>>>>>>> performance to be had if we rely more on the container (which will 
>>>>>>> already
>>>>>>> be dispatching based on path) to choose our code paths (at least at a
>>>>>>> coarse level) and then have less custom dispatch logic executed on 
>>>>>>> *every*
>>>>>>> request
>>>>>>>
>>>>>>> Obviously I'm wrong and if the net result is less performant to any
>>>>>>> significant degree forget it that wouldn't be worth it. (wanted:
>>>>>>> standardized solr benchmarks)
>>>>>>>
>>>>>>> There WILL be complications with v2 because it is a subclass of
>>>>>>> HttpSolrCall, which will take a bit of teasing apart for sure. Ideally 
>>>>>>> it
>>>>>>> should be a separate servlet from v1, but we don't want to duplicate 
>>>>>>> code
>>>>>>> either... so work to do there...
>>>>>>>
>>>>>>> I think an incremental approach is necessary since very few of us
>>>>>>> have the bandwidth to more than that and certainly it becomes difficult 
>>>>>>> to
>>>>>>> find anyone with the time to review large changes. What I have thus far 
>>>>>>> is
>>>>>>> stable with respect to the tests, and simplifies some stuff already 
>>>>>>> which
>>>>>>> is why I chose this point to start the discussion
>>>>>>>
>>>>>>> But yeah, look at what I did and say what you like and what you
>>>>>>> don't. :).
>>>>>>>
>>>>>>> -Gus
>>>>>>>
>>>>>>> On Mon, Aug 16, 2021 at 9:42 PM David Smiley <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds like an interesting adventure last weekend.
>>>>>>>>
>>>>>>>> I'm unclear what the point of going this direction is; my instinct
>>>>>>>> is to go the opposite direction.  You seem to suggest there are some
>>>>>>>> simplification/organization benefits, which I love, so I'll need to 
>>>>>>>> look at
>>>>>>>> what you've done to judge that for myself.  Yes Jetty supports the 
>>>>>>>> Servlet
>>>>>>>> spec but we need not embrace it.  Adhering to that is useful if you 
>>>>>>>> have a
>>>>>>>> generic web app that can be deployed to a container of the user's
>>>>>>>> convenience/choosing.  No doubt this is why Solr started this way, and 
>>>>>>>> why
>>>>>>>> the apps I built in my early days adhered to that spec.  But since 
>>>>>>>> 6.0, we
>>>>>>>> view Solr as self-contained and more supportable if the project makes 
>>>>>>>> these
>>>>>>>> decisions and thus not needlessly constrain itself as well.
>>>>>>>>
>>>>>>>> It is super weird to me that SolrDispatchFilter is a Servlet
>>>>>>>> *Filter* and not a *Servlet* itself.
>>>>>>>>
>>>>>>>> Also, I suspect there may be complications in changes here relating
>>>>>>>> to Solr's v1 vs v2 API.  And most definitely also what you discovered 
>>>>>>>> --
>>>>>>>> JettySolrRunner.
>>>>>>>>
>>>>>>>> ~ David Smiley
>>>>>>>> Apache Lucene/Solr Search Developer
>>>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 16, 2021 at 11:37 AM Gus Heck <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> *TLDR:* I've got a working branch
>>>>>>>>> <https://github.com/gus-asf/solr/tree/servlet-solr> where
>>>>>>>>> CoreContainer & our startup process is extracted from SolrDispatch 
>>>>>>>>> Filter.
>>>>>>>>> Do other folks think that is interesting enough that I should make a 
>>>>>>>>> JIRA
>>>>>>>>> and/or PR with intention to make this change?
>>>>>>>>>
>>>>>>>>> *Details:*
>>>>>>>>>
>>>>>>>>> Jetty is a servlet container, yet we more or less ignore it by
>>>>>>>>> stuffing everything into a single filter (almost, admin UI is served
>>>>>>>>> separately). I'm sure there are lots of historical reasons for this,
>>>>>>>>> probably including servlet containers and their specs were much less 
>>>>>>>>> mature
>>>>>>>>> when solr was first started. Maybe also the early authors were more 
>>>>>>>>> focused
>>>>>>>>> on making search work than leveraging what the container could do for 
>>>>>>>>> them
>>>>>>>>> (but I wasn't there for that so that's just a guess).
>>>>>>>>>
>>>>>>>>> The result is that we have a couple of very large classes that are
>>>>>>>>> touched by almost every request, and they have a lot of conditional 
>>>>>>>>> logic
>>>>>>>>> trying to decide what the user is asking. Normally this sort of 
>>>>>>>>> dispatch is
>>>>>>>>> done by the container based on the request URL to direct it to the
>>>>>>>>> appropriate servlet. Solr does a LOT of different things so this code 
>>>>>>>>> is
>>>>>>>>> extremely tricky and complex to understand. Specifically, I'm 
>>>>>>>>> speaking of
>>>>>>>>> SolrDispatchFilter and HttpSolrCall, which are so inseparable that 
>>>>>>>>> being
>>>>>>>>> two classes probably makes them harder to understand.
>>>>>>>>>
>>>>>>>>> It seems to me (and this mail is asking if you agree) that these
>>>>>>>>> classes are long overdue for some subdivision. The most obvious thing 
>>>>>>>>> to
>>>>>>>>> pull out is all the admin calls. Admin code paths really have little 
>>>>>>>>> or
>>>>>>>>> nothing to do with query or update code paths since there are no 
>>>>>>>>> documents
>>>>>>>>> to route or sub-requests to some subset of nodes.
>>>>>>>>>
>>>>>>>>> The primary obstacle to any such separation and simplification is
>>>>>>>>> that most requests have some interaction with CoreContainer, and the 
>>>>>>>>> things
>>>>>>>>> it holds, and this is initialized and held by a field in
>>>>>>>>> SolrDispatchFilter. After spending a significant chunk of time 
>>>>>>>>> reading this
>>>>>>>>> code in the prior weeks and a timely and motivating conversation with 
>>>>>>>>> Eric
>>>>>>>>> Pugh, I dumped a chunk of my weekend into an experiment to see if I 
>>>>>>>>> could
>>>>>>>>> pull CoreContainer out of the dispatch filter, and leverage the 
>>>>>>>>> facilities
>>>>>>>>> of our servlet container.
>>>>>>>>>
>>>>>>>>> That wasn't too terribly hard, but keeping JettySolrRunner happy
>>>>>>>>> was very confusing, and worrisome since I've realized it's not 
>>>>>>>>> respecting
>>>>>>>>> our web.xml at all, and any configuration in web.xml needs to be 
>>>>>>>>> duplicated
>>>>>>>>> for our tests in JettySolrRunner (tangent alert)
>>>>>>>>>
>>>>>>>>> The result is that CoreContainer is now held by a class called
>>>>>>>>> CoreService (please help me name things if you don't like my names :) 
>>>>>>>>> ).
>>>>>>>>> CoreService is a ServletContextListener, appropriately configured in
>>>>>>>>> web.xml, and has a static method that can be used to get a reference 
>>>>>>>>> to the
>>>>>>>>> CoreContainer corresponding to the ServletContext in which code 
>>>>>>>>> wanting a
>>>>>>>>> core container is running (this supports having multiple 
>>>>>>>>> JettySolrRunners
>>>>>>>>> in tests, though probably never has more than one CoreContainer in the
>>>>>>>>> running application)
>>>>>>>>>
>>>>>>>>> I achieved this in 4 stages shown here:
>>>>>>>>> https://github.com/gus-asf/solr/tree/servlet-solr
>>>>>>>>>
>>>>>>>>> Ignore the AdminServlet class, it's a placeholder, and can be
>>>>>>>>> subtracted without harm.
>>>>>>>>>
>>>>>>>>> Since the current state of the code in that branch is apparently
>>>>>>>>> test-stable (4 runs of check in a row passing, none slower than any 
>>>>>>>>> run of
>>>>>>>>> 3 runs of main, both as low as 10.5 min if I don't continue working 
>>>>>>>>> on the
>>>>>>>>> machine)...
>>>>>>>>>
>>>>>>>>> Do we want to push this refactor in now to avoid making a huge
>>>>>>>>> ball of changes that gets harder and harder to merge? The next push 
>>>>>>>>> point
>>>>>>>>> would probably be when AdminServlet was functional (if/when that 
>>>>>>>>> happens)
>>>>>>>>> (and we could not push that class for now).
>>>>>>>>>
>>>>>>>>> If you read this far, thanks :) I wasn't sure how feasible this
>>>>>>>>> would be so I felt the need to prove it to my self in code before 
>>>>>>>>> wasting
>>>>>>>>> your time, but please don't hesitate to point to costs I might be 
>>>>>>>>> missing
>>>>>>>>> or anything that looks like a mistake, or say why this was a total 
>>>>>>>>> waste of
>>>>>>>>> time :)
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> -Gus
>>>>>>>>> --
>>>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>>>> http://www.the111shift.com (play)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>> http://www.the111shift.com (play)
>>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
-- 
- Mark

http://about.me/markrmiller

Re: Solr and Jetty and Servlets

Reply via email to