RE: Solr and Jetty and Servlets

Uwe Schindler Wed, 18 Aug 2021 03:00:53 -0700

Hi,


Thanks Mark, I fully agree with you on this: Booting up jetty with start.jar 
and using a web application is dramatically slower.

 

Nevertheless, I think that we should go away with the web application 
alltogether. We can still stay with the servlet API, but just – as in Gus patch 
– to split responsibilities and go away with a singleton ServletFilter, 
although from my own experience, I figured out that Servletapis’s URI handling 
is far away from ideal. In most cases at end I always landed in a separate 
“master” servlet (hooked to “/*”) that only parses the PathInfo with regular 
expressions (one thing that does not work with servlet api) and then delegates 
using RequestDispatcher to several more specific servlets or returns 404 not 
found. This setup would also help to split the huge DispatchFilter mess, but 
still the URI parsing logic would be centralized. But this servlet would only 
take care of the dispatching, so DispatchServlet would be fine. But it should 
only dispatch, nothing else.

 

My ideal Solr would be: Add some org.apache.solr.Bootstrap class to the main 
distribution and remove start.jar completely. Also remove jetty-webapp.jar and 
related dependencies. In Boostrap just start Jetty embedded, add listen ip, 
port numbers, etc. We can still use the config API of Jetty for this, so maybe 
put this into an XML file (although XML is a bad format for this stuff, goes 
back to web applications). Then Bootsrap starts the Corecontainer and the 
embedded Zookeeper (this is another horrible thing in Solr: the embedded 
zookeper starts somehow as part of the web application) and hooks the servlets 
into a (at first) single root ServletContextHandler. This is upside down. So 
Solr starts with a Bootstrap class, starts zookeeper, jetty and hands over to 
CoreContainer.

 

Some other ideas (not sure if they are already implemented in the patch): Maybe 
we should install a separate ServletContextHandler/Classloader for each Core. 
This would allow us to go away with our own classloader magic. Bootstrap could 
also do this: It starts the CoreContainer, Zookeeper, Jetty Installs a servlet 
for the admin UI and the Core/Collection management and then hands over to the 
CoreContainer to initialize all Cores. Each CoreContainer gets his own 
ServletContextHandler and can therefore easily be loaded unloaded from Jetty.

 

I wrote many other apps in my daily live like this, the startup time 
(especially for microservices) is great. A Jetty starting up in milliseconds 
and going to service is unbeatable.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [email protected]

 

From: Mark Miller <[email protected]> 
Sent: Wednesday, August 18, 2021 9:08 AM
To: Gus Heck <[email protected]>
Cc: [email protected]
Subject: Re: Solr and Jetty and Servlets

 

Agreement or not, I wouldn’t step in front of whatever you did.

 

I doubt anyone is going to take on actually pulling out of a web app because 
it’s easier to lean on it for configuration and I doubt start up speed is a 
very high priority for most.

 

It had never really occurred to me in the past that JettySolrRunner was so much 
faster. I just know it because of driving the test down to ridiculously times, 
starting up and stopping mini clusters of whatever size in literally no time. 
With no QuickStart setup for embedded Jetty.

 

So than starting up standalone Solr, it became apparent how much slower it was. 
So I started poking around. Cloud guys like those at google had already been 
feeling pain - preferring to be able to spin up and down relatively quickly - 
and so this QuickStart “precompile” web.xml / JSP’s thing how they were dealing 
with it. And it’s simple, but it’s still easily non comparable at the end of 
the day. Solr ships with like 10,000 classes across a ton of jars, even with 
QuickStart, you have to follow the spec of what to do and what might be done by 
any given webapp - though as JettySolrRunner shows, we don’t actually need any 
of it. QuickStart is trying to find every trick to speed up following the 
container and webapp spec. Injecting the dispatch filter just tosses it all out 
:)

 

Any, may point was not that I disagree and you should make Solr inject a 
dispatch filter - my point is just that it doesn’t really matter to dispatch. 
You can do just as much via Jetty either way.

 

And anyone that doesn’t agree about “something should be done” is either lying 
or not looking. So I wouldn’t worry about agreement. The problem is that it’s 
quicksand that people are avoiding, not that anyone that looks can’t see it a 
huge rats nest. Look at the V2 API. Someone actually wanted to work towards 
getting out of that hole. But at the end of the day, it’s now part of the hole. 
Made it deeper. Bermuda Triangle. But look at it - there was hope there. Hope 
that one day, dispatching would be better. Rather than twice as complicated. I 
don’t know that hope lives on. But no one gonna stop you on reasonable 
technical sense with any proper license.

 

Mark

 

 

 

On Tue, Aug 17, 2021 at 10:12 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

*ServletContextHandler not ServletContext 

 

On Tue, Aug 17, 2021 at 11:03 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

Ok nope, since that only happens with WebAppContext, not ServletContext (the 
parent class). whew :)

 

On Tue, Aug 17, 2021 at 10:33 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

WRT the JettySolrRunner tangent, I wonder if our lack of a web.xml and thus 
lack of metadata-complete=true in web.xml means that we wind up scanning for 
annotations? 

 

On Tue, Aug 17, 2021 at 10:06 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

https://issues.apache.org/jira/browse/SOLR-15590

 

On Tue, Aug 17, 2021 at 9:44 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

Ok well it would be interesting to compare quickstarting vs JettySolrRunner, 
and reading up on quickstart <https://webtide.com/jetty-9-quick-start/>  gives 
ideas about pre-building the quickstart xml, but web.xml for JettySolrRunner is 
a tangent. Either way we are still in a container and I think I hear some 
agreement that something should be done about the dispatch here, and both of 
you seem to agree that an actual servlet would make more sense than a filter, 
so I'll make a ticket/pr to make it easier to track & easier to read the code.

 

On Tue, Aug 17, 2021 at 1:41 AM Mark Miller <[email protected] 
<mailto:[email protected]> > wrote:

The downside to respecting web.xml and making JettySolrServer serve a webapp is 
that loading a webapp is very expensive and slow in comparison.  
JettySolrServer actually starts up extremely quickly. It’s almost more 
appealing to change the Server to use the JettySolrServer strategy. It’s so 
slow to load a webapp because of all the things it needs to support and scan 
jars for - in a kind of JavaEE situation, though not as bad. Jetty QuickStart 
does improve the situation if used at least. But there is really no need to eat 
the whole webapp standard for a non webapp app to have Jetty manage more of the 
dispatching. I always wondering about the motivation / upside of hacking in a 
straight servlet myself. But for a non webapp stuck in a servlet container, 
it’s actually a beautiful move that side steps a bunch of slow crap even better 
than the QuickStart pushes ever will, and still allows for matching any support 
we would want.

 

The HttpSolrCallV2 extending HttpSolrCall is horrendous. Personally, I made a 
slim SolrCall base class that each extends. All that logic is hairy enough to 
follow without them interleaving and sharing in a kind of wild collaboration. 
It’s pretty unfortunate they both still exist.

 

You are right, leaving that path in that state is a poor idea. If there was 
anything that could be considered the “hot” path, it’s right there - feverishly 
checking and ripping up streams for anyone and anything. Is a core? A 
collection? A handler? Maybe a V2 handler? What I’d we cut the path down? Do a 
dance? Treat the string like a date and see if that works.

 

So yeah, agreed, nobody does or would dispatch this way, you have to frog boil 
into it. It’s slow, almost incomprehensible, and incredibly good at holding 
onto life, even building on it. But being a webapp and parsing web.xml has 
little to do with dispatching and having sensible http api that doesn’t work 
like it’s a perl compiler.

 

Anyway, I can’t imagine trying to trace through that code with any solid 
feeling about having a handle on it via inspection. But just like Perl, if you 
debug / trace log the hell out of it, it is all actually pretty simple to 
adjust and reign in. 

 

 

MRM

 

 

 

On Mon, Aug 16, 2021 at 10:39 PM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

So I'm not interested in *adhering* to anything here, just using stuff where it 
helps... I see tools sitting on the shelf, already bought and paid for and 
carried with us to every job but then they sit in the back of the truck mostly 
unused... and (I think) they look useful. 

 

This should not be seen as in any way advocating a return to war file 
deployment or "choose your container". All of what I would suggest should 
likely be compatible with a jetty startup controlled by our code (I assume we 
can convince it to read web.xml properly when doing so). Certainly point out 
anything that would interfere with that.

 

Here we sit in a container that is a tool box containing:

*       A nice mechanism for initializing stuff at the application level, 
(ServletContextListener) and a well defined guarantee that it runs before any 
filter or servlet, 
*       An automatic shutdown mechanism that it will call for us on graceful 
shutdown (ServletContextListner again). 
*       A nice layered filter chain mechanism which could have been used to 
layer in things like tracing and authentication, and close shields etc as small 
succinct filters rather than weaving them into an ever more complex filter & 
call class.
*       In more recent versions of the spec, for listeners defined in web.xml 
the order is also guaranteed.
*       Servlet classes that are *already* set up to distinguish http verbs 
automatically when desired

So why is this better? Because monster classes that do a hundred things are 
really hard to understand and maintain. Small methods, small classes whenever 
possible. I also suspect that there may be some gains in performance to be had 
if we rely more on the container (which will already be dispatching based on 
path) to choose our code paths (at least at a coarse level) and then have less 
custom dispatch logic executed on *every* request 

 

Obviously I'm wrong and if the net result is less performant to any significant 
degree forget it that wouldn't be worth it. (wanted: standardized solr 
benchmarks)

 

There WILL be complications with v2 because it is a subclass of HttpSolrCall, 
which will take a bit of teasing apart for sure. Ideally it should be a 
separate servlet from v1, but we don't want to duplicate code either... so work 
to do there...  

 

I think an incremental approach is necessary since very few of us have the 
bandwidth to more than that and certainly it becomes difficult to find anyone 
with the time to review large changes. What I have thus far is stable with 
respect to the tests, and simplifies some stuff already which is why I chose 
this point to start the discussion

 

But yeah, look at what I did and say what you like and what you don't. :). 

 

-Gus

 

On Mon, Aug 16, 2021 at 9:42 PM David Smiley <[email protected] 
<mailto:[email protected]> > wrote:

Sounds like an interesting adventure last weekend.

 

I'm unclear what the point of going this direction is; my instinct is to go the 
opposite direction.  You seem to suggest there are some 
simplification/organization benefits, which I love, so I'll need to look at 
what you've done to judge that for myself.  Yes Jetty supports the Servlet spec 
but we need not embrace it.  Adhering to that is useful if you have a generic 
web app that can be deployed to a container of the user's convenience/choosing. 
 No doubt this is why Solr started this way, and why the apps I built in my 
early days adhered to that spec.  But since 6.0, we view Solr as self-contained 
and more supportable if the project makes these decisions and thus not 
needlessly constrain itself as well.

 

It is super weird to me that SolrDispatchFilter is a Servlet *Filter* and not a 
*Servlet* itself.

 

Also, I suspect there may be complications in changes here relating to Solr's 
v1 vs v2 API.  And most definitely also what you discovered -- JettySolrRunner.




~ David Smiley

Apache Lucene/Solr Search Developer

http://www.linkedin.com/in/davidwsmiley

 

 

On Mon, Aug 16, 2021 at 11:37 AM Gus Heck <[email protected] 
<mailto:[email protected]> > wrote:

TLDR: I've got a working branch 
<https://github.com/gus-asf/solr/tree/servlet-solr>  where CoreContainer & our 
startup process is extracted from SolrDispatch Filter. Do other folks think 
that is interesting enough that I should make a JIRA and/or PR with intention 
to make this change?

 

Details:

 

Jetty is a servlet container, yet we more or less ignore it by stuffing 
everything into a single filter (almost, admin UI is served separately). I'm 
sure there are lots of historical reasons for this, probably including servlet 
containers and their specs were much less mature when solr was first started. 
Maybe also the early authors were more focused on making search work than 
leveraging what the container could do for them (but I wasn't there for that so 
that's just a guess).

 

The result is that we have a couple of very large classes that are touched by 
almost every request, and they have a lot of conditional logic trying to decide 
what the user is asking. Normally this sort of dispatch is done by the 
container based on the request URL to direct it to the appropriate servlet. 
Solr does a LOT of different things so this code is extremely tricky and 
complex to understand. Specifically, I'm speaking of SolrDispatchFilter and 
HttpSolrCall, which are so inseparable that being two classes probably makes 
them harder to understand.

 

It seems to me (and this mail is asking if you agree) that these classes are 
long overdue for some subdivision. The most obvious thing to pull out is all 
the admin calls. Admin code paths really have little or nothing to do with 
query or update code paths since there are no documents to route or 
sub-requests to some subset of nodes. 

 

The primary obstacle to any such separation and simplification is that most 
requests have some interaction with CoreContainer, and the things it holds, and 
this is initialized and held by a field in SolrDispatchFilter. After spending a 
significant chunk of time reading this code in the prior weeks and a timely and 
motivating conversation with Eric Pugh, I dumped a chunk of my weekend into an 
experiment to see if I could pull CoreContainer out of the dispatch filter, and 
leverage the facilities of our servlet container. 

 

That wasn't too terribly hard, but keeping JettySolrRunner happy was very 
confusing, and worrisome since I've realized it's not respecting our web.xml at 
all, and any configuration in web.xml needs to be duplicated for our tests in 
JettySolrRunner (tangent alert)

 

The result is that CoreContainer is now held by a class called CoreService 
(please help me name things if you don't like my names :) ). CoreService is a 
ServletContextListener, appropriately configured in web.xml, and has a static 
method that can be used to get a reference to the CoreContainer corresponding 
to the ServletContext in which code wanting a core container is running (this 
supports having multiple JettySolrRunners in tests, though probably never has 
more than one CoreContainer in the running application)

 

I achieved this in 4 stages shown here: 
https://github.com/gus-asf/solr/tree/servlet-solr

 

Ignore the AdminServlet class, it's a placeholder, and can be subtracted 
without harm.

 

Since the current state of the code in that branch is apparently test-stable (4 
runs of check in a row passing, none slower than any run of 3 runs of main, 
both as low as 10.5 min if I don't continue working on the machine)...

 

Do we want to push this refactor in now to avoid making a huge ball of changes 
that gets harder and harder to merge? The next push point would probably be 
when AdminServlet was functional (if/when that happens) (and we could not push 
that class for now).

 

If you read this far, thanks :) I wasn't sure how feasible this would be so I 
felt the need to prove it to my self in code before wasting your time, but 
please don't hesitate to point to costs I might be missing or anything that 
looks like a mistake, or say why this was a total waste of time :)

 

Thoughts?

 

-Gus

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)

-- 

- Mark

 

http://about.me/markrmiller




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)




 

-- 

http://www.needhamsoftware.com (work)

http://www.the111shift.com (play)

-- 

- Mark

 

http://about.me/markrmiller

RE: Solr and Jetty and Servlets

Reply via email to