Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > > 3) there's a comment in RequestHandlerBase.init about "indexOf" that : > > comes form the existing impl in DismaxRequestHandler -- but doesn't match : > > the new code ... i also wasn't certain that the change you made matches : > I just copied the code from DismaxRequestHandler and made sure it : > passes the tests. I don't totally understand what that case is doing. : : The first iteration of dismax (before we did generic defaults, : invariants, etc for request handlers) took defaults directly from the : init params, and that is what that case is checking for and bingo .. the reason it jumped out at me in your patch, is that the comment still refered to indexOf, but the code didn't ... it might be functionally equivilent, i just wasn't sure when i did my quick read. there's mention in the comment that indexOf is used so that can indicate that you don't want all the init params as defaults, but you don't acctually want defaults either -- but there doesn't seem to be a test for that case. you can see support for the legacy defaults syntax in src/test/test-files/solr/conf/solrconfig.xml if you grep for dismaxOldStyleDefaults -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > throw new SolrException( 400, "missing parameter: "+p ); : > : > This will return 400 with a message "missing parameter: " + p. : > : > Exceptions or SolrExceptions with code=500 || code<100 are sent to : > client with status code 500 and a full stack trace. : : That all seems ideal to me, but there had been talk in the past about : formatted responses on errors. Given that even update handlers can : return full responses, I don't see the point of formatted (XML,etc) : response bodies when an exception is thrown. I can't find the thread at the moment, but as I recall, there was once some conscensus that while errors should definitely be returned with appropriate HTTP status codes, and the exception message should be included in the status line, the QueryResponseWriter whould be given an opportunity to format the Exception -- the rationale being that all clients should check the HTTP status code, and if it's not 2xx, then they should use the status message for simple error reporting, but if they want more details they can check the Content-Type of the response and if it matches what they were expecting, they can get the detailed error info from it. So if you are writing a python client and expecting python back, the stack trace will be formated in python so you can easily parse it ... if you are expecting XML back, the stack trace will be formated in XML, etc... i think the only time the dispatcher should return an html (or plain text) error page is if it encounters an exception before it can extract the writer to use from the request params, or if the exception is in the ResponseWriter itself. This would be one reason to leave getException() in the SolrQueryResponse interface ... it let's us keep the API the same for ResponseWriters (no need to add a new writeErrorPage(Exception) method) ... another advantage to keeping that encapsalation is it gives the ResponseWriters the ability to generate pages which contain the partial results from the RequestHandler (prior to encountering an exception) as well the Exception itself. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On Jan 21, 2007, at 2:39 PM, Yonik Seeley wrote: On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > So is everyone happy with the way that errors are currently reported? > If not, now (or right after this is committed), is the time to change > that. /solr/select/qt="myhandler" should be backward compatible, but > /solr/myhandler doesn't need to be. Same for the update stuff. > In SOLR-104, all exceptions are passed to the client as HTTP Status codes with the message. If you write: throw new SolrException( 400, "missing parameter: "+p ); This will return 400 with a message "missing parameter: " + p. Exceptions or SolrExceptions with code=500 || code<100 are sent to client with status code 500 and a full stack trace. That all seems ideal to me, but there had been talk in the past about formatted responses on errors. Given that even update handlers can return full responses, I don't see the point of formatted (XML,etc) response bodies when an exception is thrown. Just making sure there's a consensus. Being able to check the HTTP status code to determine if there is an error, rather than having to parse XML and get a Solr-specific status code seems best for the Ruby work we're doing. I'll confer with the others working on it and report back if they have any suggestions for improvement also. Erik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > I don't think i'll have time to look at your new patch today, design wise > i think you are right, but there was still stuff that needed to be > refactored out of core.update and into the UpdateHandler wasn't there? > Yes, I avoided doing that in an effort to minimize refactoring and focus just on adding ContentStreams to RequestHandlers. Sounds like a good idea. It's easier to review and process in smaller steps if practical. I just posted (yet another) update to SOLR-104. This one moves the core.update logic into UpdateRequestHander, and adds some glue to make old request behave as they used to. Cool! I also deprecated the exception in SolrQueryResponse. Handlers should throw the exception, not put it in the response. (If you want error messages, put that in the response, not the exception) Agreed. I can't for the life of me remember *why* I did that. I think it was because I thought ResponseHandlers might format the exception. > 3) there's a comment in RequestHandlerBase.init about "indexOf" that > comes form the existing impl in DismaxRequestHandler -- but doesn't match > the new code ... i also wasn't certain that the change you made matches > the old semantics for dismax (i don't think we have a unit test for that > case) When you get a chance to look at the patch, can you investigate this. I just copied the code from DismaxRequestHandler and made sure it passes the tests. I don't totally understand what that case is doing. The first iteration of dismax (before we did generic defaults, invariants, etc for request handlers) took defaults directly from the init params, and that is what that case is checking for and replicating if there isn't a "defaults" in the list, it assumes the entire list is defaults. It's only needed for dismax since other handlers didn't support "defaults" until later. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I don't think i'll have time to look at your new patch today, design wise i think you are right, but there was still stuff that needed to be refactored out of core.update and into the UpdateHandler wasn't there? Yes, I avoided doing that in an effort to minimize refactoring and focus just on adding ContentStreams to RequestHandlers. I just posted (yet another) update to SOLR-104. This one moves the core.update logic into UpdateRequestHander, and adds some glue to make old request behave as they used to. I also deprecated the exception in SolrQueryResponse. Handlers should throw the exception, not put it in the response. (If you want error messages, put that in the response, not the exception) It still needs some cleanup and some idea what data/messages should be returned in the SolrResponse. The bottom of http://localhost:8983/solr/test.html has a form calling /update2 with posted XML so you can see the output a couple of minor comments i had when i read the last patch (but didn't mention since i was focusing on design issues) ... 1) why rename the servlets "Legacy*" instead of just marking them deprecated? In the new version, I got rid of both Servlets and am handling the 'legacy' cases explicitly in the dispatch filter. This minimizes the duplicated code and keeps things consisten. 2) getSourceId and getSoure need to be left in the concrete Handlers so they get illed in with the correct file version info on checkout. done. 3) there's a comment in RequestHandlerBase.init about "indexOf" that comes form the existing impl in DismaxRequestHandler -- but doesn't match the new code ... i also wasn't certain that the change you made matches the old semantics for dismax (i don't think we have a unit test for that case) When you get a chance to look at the patch, can you investigate this. I just copied the code from DismaxRequestHandler and made sure it passes the tests. I don't totally understand what that case is doing. 4) ContentStream.getFieldName() would proabably be more general as ContentStream.getSourceInfo() ... done.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > So is everyone happy with the way that errors are currently reported? > If not, now (or right after this is committed), is the time to change > that. /solr/select/qt="myhandler" should be backward compatible, but > /solr/myhandler doesn't need to be. Same for the update stuff. > In SOLR-104, all exceptions are passed to the client as HTTP Status codes with the message. If you write: throw new SolrException( 400, "missing parameter: "+p ); This will return 400 with a message "missing parameter: " + p. Exceptions or SolrExceptions with code=500 || code<100 are sent to client with status code 500 and a full stack trace. That all seems ideal to me, but there had been talk in the past about formatted responses on errors. Given that even update handlers can return full responses, I don't see the point of formatted (XML,etc) response bodies when an exception is thrown. Just making sure there's a consensus. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
So is everyone happy with the way that errors are currently reported? If not, now (or right after this is committed), is the time to change that. /solr/select/qt="myhandler" should be backward compatible, but /solr/myhandler doesn't need to be. Same for the update stuff. In SOLR-104, all exceptions are passed to the client as HTTP Status codes with the message. If you write: throw new SolrException( 400, "missing parameter: "+p ); This will return 400 with a message "missing parameter: " + p. Exceptions or SolrExceptions with code=500 || code<100 are sent to client with status code 500 and a full stack trace.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : > The bugaboo is if the POST data is NOT in fact : > application/x-www-form-urlencoded but the user agent says it is -- as : > both of you have indicated can be the case when using curl. Could that : > be why Yonik thought POST params was broken? : : Correct. That's the format that post.sh in the example sends : (application/x-www-form-urlencoded) and we ignore it in the update : handler and always treat the body as binary. : : Now if you wanted to add some query args to what we already have, you : can't use getParameterMap(). I think i mentioned this before, but I think what we should do is make the stream "guessing" code in the Dispatcher/RequestBuilder very strict, and make it's decisison about how to treat the post body entirely based on the Content-Type ... meanwhile the existing (eventually know as "old") way of doing updates via "/update" to the UpdateServlet can be more lax, and assume everything is a raw POST of XML. we can change post.sh to spcify XML as the Content-Type by default, modify the example schema to have other update handlers registered with names like "/update/csv" and eventually add an "/update/xml" encouraging people to use it if they want to send updates as xml dcouments, regardless of wehter htey want to POST them raw, uplodae them, or identify them by filename -- as long as they are explicit about their content type. I think I agree with all that. A long time ago in this thread, I remember saying that new URLs are an opportunity to change request/response formats w/o worrying about backward compatibility. So is everyone happy with the way that errors are currently reported? If not, now (or right after this is committed), is the time to change that. /solr/select/qt="myhandler" should be backward compatible, but /solr/myhandler doesn't need to be. Same for the update stuff. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > The bugaboo is if the POST data is NOT in fact : > application/x-www-form-urlencoded but the user agent says it is -- as : > both of you have indicated can be the case when using curl. Could that : > be why Yonik thought POST params was broken? : : Correct. That's the format that post.sh in the example sends : (application/x-www-form-urlencoded) and we ignore it in the update : handler and always treat the body as binary. : : Now if you wanted to add some query args to what we already have, you : can't use getParameterMap(). I think i mentioned this before, but I think what we should do is make the stream "guessing" code in the Dispatcher/RequestBuilder very strict, and make it's decisison about how to treat the post body entirely based on the Content-Type ... meanwhile the existing (eventually know as "old") way of doing updates via "/update" to the UpdateServlet can be more lax, and assume everything is a raw POST of XML. we can change post.sh to spcify XML as the Content-Type by default, modify the example schema to have other update handlers registered with names like "/update/csv" and eventually add an "/update/xml" encouraging people to use it if they want to send updates as xml dcouments, regardless of wehter htey want to POST them raw, uplodae them, or identify them by filename -- as long as they are explicit about their content type. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: Great! I just posted an update to SOLR-104 that I hope will make you happy. Dude ... i can *not* keep up with you. : If i'm following our discussion correctly, I *think* this takes care : of all the major issues we have. I don't think i'll have time to look at your new patch today, design wise i think you are right, but there was still stuff that needed to be refactored out of core.update and into the UpdateHandler wasn't there? a couple of minor comments i had when i read the last patch (but didn't mention since i was focusing on design issues) ... 1) why rename the servlets "Legacy*" instead of just marking them deprecated? 2) getSourceId and getSoure need to be left in the concrete Handlers so they get illed in with the correct file version info on checkout. 3) there's a comment in RequestHandlerBase.init about "indexOf" that comes form the existing impl in DismaxRequestHandler -- but doesn't match the new code ... i also wasn't certain that the change you made matches the old semantics for dismax (i don't think we have a unit test for that case) 4) ContentStream.getFieldName() would proabably be more general as ContentStream.getSourceInfo() ... it could stay as it is for files/urls, but raw posts and multipart posts could have a usefull debuging description as well. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: At the bottom of this email is a quick and dirty servlet i just tried to prove to myself that posting with params in the URL and the body worked fine ... I tried that by simply posting to the Solr standard request handler (it echoes params in the example config), and yes, it worked fine. The problem is if the body should be the stream, and the content-type is wrong (and we currently send it wrong with curl). The nut shell being: i'm totally on board with Ryan's simple URL scheme, having a single RequestParser/SolrRequestBuilder, going with an entirely "inspection" based approach for deciding where the streams come from, and leaving all mention of parsers or "stream.type" out of the URL. (because i have a good idea of how to support it in a backwards campatible way *later*) A. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/21/07, J.J. Larrea <[EMAIL PROTECTED]> wrote: The bugaboo is if the POST data is NOT in fact application/x-www-form-urlencoded but the user agent says it is -- as both of you have indicated can be the case when using curl. Could that be why Yonik thought POST params was broken? Correct. That's the format that post.sh in the example sends (application/x-www-form-urlencoded) and we ignore it in the update handler and always treat the body as binary. Now if you wanted to add some query args to what we already have, you can't use getParameterMap(). -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
The nut shell being: i'm totally on board with Ryan's simple URL scheme, having a single RequestParser/SolrRequestBuilder, going with an entirely "inspection" based approach for deciding where the streams come from, and leaving all mention of parsers or "stream.type" out of the URL. (because i have a good idea of how to support it in a backwards campatible way *later*) Great! I just posted an update to SOLR-104 that I hope will make you happy. It moved the various request parsing methods into distinct classes that could easily be pluggable if that is necessary. As written, It supports stream.type="raw|multipart|simple|standard" We can comment that out and use 'standard' for everything as a first pass. I added configuation to solrconfig.xml: I removed LegacySelectServlet and added an explicit check in the DispatchFilter for paths starting with "/select" This seems like a better idea as the logic and expected results are identical. If i'm following our discussion correctly, I *think* this takes care of all the major issues we have.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
At 1:20 AM -0800 1/21/07, Chris Hostetter wrote: >: We need code to do that anyway since getParameterMap() doesn't support >: getting params from the URL if it's a POST (I believe I tried this in >: the past and it didn't work). > >Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you >are *definitely* mistaken. > >getParameterMap will in fact pull out params from both the URL and the >body if it's a POST -- but only if you have not allready accessed either >getReader or getInputStream -- this was at the heart of my cumbersome >preProcess/process API that we all agree now was way too complicated. The rules are very explicitly laid out in the Servlet 2.4 specification: - SRV.4.1.1 When Parameters Are Available The following are the conditions that must be met before post form data will be populated to the parameter set: 1. The request is an HTTP or HTTPS request. 2. The HTTP method is POST. 3. The content type is application/x-www-form-urlencoded. 4. The servlet has made an initial call of any of the getParameter family of methods on the request object. If the conditions are not met and the post form data is not included in the parameter set, the post data must still be available to the servlet via the request object's input stream. If the conditions are met, post form data will no longer be available for reading directly from the request object's input stream. - As Hoss notes a POST request can still have GET-style parameters in the URL query string, and getParameterMap will return both sets intermixed for a POST meeting the above conditions. And calling getParameterMap won't impede the ability to subsequently read the input stream if the conditions are not met: "the post data must still be available to the servlet". So it's theoretically valid to simply call getParameterMap and then blindly call getInputStream (possibly catching an Exception), or else use the results of getParameterMap to decide whether and how to process the input stream. The bugaboo is if the POST data is NOT in fact application/x-www-form-urlencoded but the user agent says it is -- as both of you have indicated can be the case when using curl. Could that be why Yonik thought POST params was broken? - J.J.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > ...i was trying to avoid keeping the parser name out of the query string, : > so we don't have to do any hack parsing of : > HttpServletRequest.getQueryString() to get it. : : We need code to do that anyway since getParameterMap() doesn't support : getting params from the URL if it's a POST (I believe I tried this in : the past and it didn't work). Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you are *definitely* mistaken. getParameterMap will in fact pull out params from both the URL and the body if it's a POST -- but only if you have not allready accessed either getReader or getInputStream -- this was at the heart of my cumbersome preProcess/process API that we all agree now was way too complicated. At the bottom of this email is a quick and dirty servlet i just tried to prove to myself that posting with params in the URL and the body worked fine ... i do rememebr reading up on this a few years back and verifying that it's documented somewhere in the servlet spec, a quick google search points this this article implying it was solidified in 2.2... http://java.sun.com/developer/technicalArticles/Servlets/servletapi/ (grep for "Nit-picky on Parameters") : Pluggable request parsers seems needlessly complex, and it gets harder : to explain it all to someone new. : Can't we start simple and defer anything like that until there is a real need? Alas ... i appear to be getting worse at explaining myself in my old age. What i was trying to say is that this idea i had for expressing requestParsers as an optional prefix in fron of the requestHandler would allow us to worry about the things i'm worried about *later* -- if/when they become a problem (or when i have time to stop whinning, and actually write the code) The nut shell being: i'm totally on board with Ryan's simple URL scheme, having a single RequestParser/SolrRequestBuilder, going with an entirely "inspection" based approach for deciding where the streams come from, and leaving all mention of parsers or "stream.type" out of the URL. (because i have a good idea of how to support it in a backwards campatible way *later*) public class TestServlet extends HttpServlet { public void doPost(HttpServletRequest request, HttpServletResponse response) throws Exception { response.setContentType("text/plain"); java.util.Map params = request.getParameterMap(); for (Object k : params.keySet()) { Object v = params.get(k); if (v instanceof Object[]) { for (Object vv : (Object[])v) { response.getWriter().println(k.toString() + ":" + vv); } } else { response.getWriter().println(k.toString() + ":" + v); } } } }
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I'm on board as long as the URL structure is: : ${path/from/solr/config}?stream.type=raw actually the URL i was suggesting was... ${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val ...i was trying to avoid keeping the parser name out of the query string, so we don't have to do any hack parsing of HttpServletRequest.getQueryString() to get it. We need code to do that anyway since getParameterMap() doesn't support getting params from the URL if it's a POST (I believe I tried this in the past and it didn't work). Aesthetically, having an optional parser in the queryString seems nicer than in the path. basically if you have this... Pluggable request parsers seems needlessly complex, and it gets harder to explain it all to someone new. Can't we start simple and defer anything like that until there is a real need? if they really had a reason to want to force one type of parsing, they could register it with a differnet prefix. That is a point. I'm not sure of the usecases though... it's not safe to let untrusted people update solr at all, so I don't understand prohibiting certain types of streams. * default URLs stay clean * no need for an extra "stream.type" param * urls only get ugly if people want them to get ugly because they don't want to make their clients set the mime type correctly. The first and last points are also true for a stream.type type of thing. After all, we will need other parameters for specifying local files, right? Or is opening local files up to the RequestHandler again? Anyway, I'm not too unhappy either way, as long as I can leave out any explicit "parser" and just get the right thing to happen. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On Sat, 20 Jan 2007, Ryan McKinley wrote: : Date: Sat, 20 Jan 2007 19:17:16 -0800 : From: Ryan McKinley <[EMAIL PROTECTED]> : Reply-To: solr-dev@lucene.apache.org : To: solr-dev@lucene.apache.org : Subject: Re: Update Plugins (was Re: Handling disparate data sources in : Solr) : : > : > ...what if we bring that idea back, and let people configure it in the : > solrconfig.xml, using path like names... : > : > : > : > : > : > : > ...but don't make it a *public* interface ... make it package protected, : > or maybe even a private static interface of the Dispatch Filter .. either : > way, don't instantiate instances of it using the plugin-lib ClassLoader, : > make sure it comes from the WAR to only uses the ones provided out of hte : > box. : I'm on board as long as the URL structure is: : ${path/from/solr/config}?stream.type=raw actually the URL i was suggesting was... ${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val ...i was trying to avoid keeping the parser name out of the query string, so we don't have to do any hack parsing of HttpServletRequest.getQueryString() to get it. basically if you have this... ...then these urls are all valid... http://localhost:/solr/raw/update?param=val ..uses raw post body for update http://localhost:/solr/multi/update?param=val ..uses multipart mime for update http://localhost:/solr/update?param=val ..no requestParser matched path prefix, so default is choosen and COntent-Type is used to decide where streams come from. but if instead my config looks like this... ...then these URLs would fail... http://localhost:/solr/raw/update?param=val http://localhost:/solr/multi/update?param=val ...because the empty string would match as a parser, but "/raw/update" and "/multi/update" wouldn't match as requestHandlers (the registration of "/raw" as a parser would be useless) this URL would work however... http://localhost:/solr/update?param=val ..treat all requetss as if they have multi-part mime streams ...i use this only as an example of what i'm describing ... not sa an example of soemthing we shoudl recommend. The key to all of this being that we'd check parser names against the URL prefix in order from shortest to longest, then check the rest of the path as a requestHandler ... if either of those fail, then the filter would skip the request. What we would probably recommended is that people map the "guess" request parser to "/" so that they could put in all of hte options they want on buffer sizes and such, then map their requestHandlers without a "/" prefix, and use content types correctly. if they really had a reason to want to force one type of parsing, they could register it with a differnet prefix. * default URLs stay clean * no need for an extra "stream.type" param * urls only get ugly if people want them to get ugly because they don't want to make their clients set the mime type correctly. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
...what if we bring that idea back, and let people configure it in the solrconfig.xml, using path like names... ...but don't make it a *public* interface ... make it package protected, or maybe even a private static interface of the Dispatch Filter .. either way, don't instantiate instances of it using the plugin-lib ClassLoader, make sure it comes from the WAR to only uses the ones provided out of hte box. I'm on board as long as the URL structure is: ${path/from/solr/config}?stream.type=raw and if you are missing the parameter it chooses a good option. (stream.type can change, just that the parser is configured in the query string, not he path) I like it! Also, this would give us a natural place to configure the max size etc for multi-part upload
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
(the three of us are online way to much ... for crying out loud it's a saturday night folks!) : In my opinion, I don't think we need to worry about it for the : *default* handler. That is not a very difficult constraint and, there : is no one out there expecting to be able to post parameters in the URL : and the body. I'm not sure it is worth complicating anything if this : is the only thing we are trying to avoid. you'd be suprised the number of people i've run into who expect thta to work. : I think the *default* should handle all the cases mentioned without : the client worrying about different URLs for the various methods. : : The next question is which (if any) of the explicit parsers you think : are worth including in web.xml? holy crap, i think i have a solution that will make all of us really happy... remember that idea we all really detested of a public plugin interface, configured in the solrconfig.xml that looked like this... public interface RequestParser( SolrRequest parse(HttpServletRequest req); } ...what if we bring that idea back, and let people configure it in the solrconfig.xml, using path like names... ...but don't make it a *public* interface ... make it package protected, or maybe even a private static interface of the Dispatch Filter .. either way, don't instantiate instances of it using the plugin-lib ClassLoader, make sure it comes from the WAR to only uses the ones provided out of hte box. then make the dispatcher check each URL first by seeeing if it starts with the name of any registered requestParser ... if it doesn't then use the default "UseContentTypeRequestParser" .. *then* it does what the rest of ryans current Dispatcher does, taking the rest of hte path to pick a request handler. the bueaty of this approach, is that if no tags appear in the solrconfig.xml, then the URLs look exactly like you guys want, and the request parsing / stream building semantics are exactly the same as they are today ... if/when we (or maybe just "i") write those other RequestParsers people can choose to turn them on (and change their URLs) if they want, but if they don't they can keep having the really simple URLs ... OR they could register something like this... ...and have really simple URLs, but be garunteed that they allways got their streams from raw POST bodies. This would also solve Ryans concern about allowing people to turn off fetching streams from remote URLs (or from local files, a small concern i had but hadn't mentioend yet since we had bigger fish to fry) Thoughts? -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > It would be: > http://${context}/${path}?stream.type=post Yes! Feels like a much more natural place to me than as part of the path of the URL. Just need to hash out meaningful param names/values? Oh, and I'm more interested in the semantics of those param/values, and not what request parser it happens to get mapped to. I'd vote for different request parsers being an implementation detail, and keeping those details (plugability) out of solrconfig.xml for now. We could always add it later, but it's a lot tougher to remove things. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > >- put everyone > > understands how to put something in a URL. if nothing else, think of > > putting the "parsetype" in the URL as a checksum that the RequestParaser > > can use to validate it's assumptions -- if it's not there, then it can do > > all of the intellegent things you think it should do, but if it is there > > that dictates what it should do. > > If it's optional in the args, I could be on board with that. > If its optional in the req.getQueryString() I'm in. Ignore my previous post about ${context}/multipart/asdgadsga It would be: http://${context}/${path}?stream.type=post Yes! Feels like a much more natural place to me than as part of the path of the URL. Just need to hash out meaningful param names/values? -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
>- put everyone > understands how to put something in a URL. if nothing else, think of > putting the "parsetype" in the URL as a checksum that the RequestParaser > can use to validate it's assumptions -- if it's not there, then it can do > all of the intellegent things you think it should do, but if it is there > that dictates what it should do. If it's optional in the args, I could be on board with that. If its optional in the req.getQueryString() I'm in. Ignore my previous post about ${context}/multipart/asdgadsga It would be: http://${context}/${path}?stream.type=post
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
> consider the example you've got on your test.html page: "POST - with query > string" ... that doesn't obey the typical semantics of a POST with a query > string ... if you used the methods on HttpServletRequest to get the params > it would give you all the params it found both in the query strings *and* > in the post body. Blech. I was wondering about that. Sounds like bad form, but perhaps could be supported via something like /solr/foo?postbody=args In my opinion, I don't think we need to worry about it for the *default* handler. That is not a very difficult constraint and, there is no one out there expecting to be able to post parameters in the URL and the body. I'm not sure it is worth complicating anything if this is the only thing we are trying to avoid. I think the *default* should handle all the cases mentioned without the client worrying about different URLs for the various methods. The next question is which (if any) of the explicit parsers you think are worth including in web.xml? http://${host}/${context}/${path/from/config} (default) http://${host}/${context}/params/${path/from/config} (used getParameterMap() to fill args) http://${host}/${context}/multipart/${path/from/config} (force multipart request) http://${host}/${context}/stream/${path/from/config} (params from URL, body as stream)
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: but the HTTP Client libraries in vaious languages don't allways make it easy to set Content-type -- and even if they do that doesn't mean the person using that library knows how to use it properly - I think we have to go with common usages. We neither rely on, nor discard content-type in all cases. - When it has a charset, believe it. - When it says form-encoded, only believe it if there aren't args on the URL (because many clients like curl default to "application/x-www-form-urlencoded" for a post. - put everyone understands how to put something in a URL. if nothing else, think of putting the "parsetype" in the URL as a checksum that the RequestParaser can use to validate it's assumptions -- if it's not there, then it can do all of the intellegent things you think it should do, but if it is there that dictates what it should do. If it's optional in the args, I could be on board with that. (aren't you the one that convinced me a few years back that it was better to trust a URL then to trust HTTP Headers? ... because people understand URLs and put things in them, but they don't allways know what headers to send .. curl being the great example, it allways sends a Content-TYpe even if the user doesn't ask it to right?) Well, for the update server, we do ignore the form-data stuff, but we don't ignore the charset. : Multi-part posts will have the content-type set correctly, or it won't work. : The big use-case I see is browser file upload, and they will set it correctly. right, but my point is what if i want the multi-part POST body left alone so my RequestHandler can deal with it as a single stream -- if i set every header correctly, the "smart" parsing code will parse it -- which is why sometihng in the URL telling it *not* to parse it is important. That sounds like a pretty rare corner case. : We should not preclude wacky handlers from doing things for : themselves, calling our stuff as utility methods. how? ... if there is one and only one RequestParser which makes the SolrRequest before the RequestHandler ever sees it, and parses the post body because the content-type is multipart/mixed how can a wacky handler ever get access to the raw post body? I wasn't thinking *that* whacky :-) There are always other options, such as using your own servlet though. I don't think we should try to solve every case (the whole 80/20 thing). -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: Ryan: this patch truely does kick ass ... we can probably simplify a lot of the Legacy stuff by leveraging your new StandardRequestBuilder -- but that can be done later. Much is already done by the looks of it. i'm stil really not liking the way there is a single SolrRequestBuilder with a big complicated build method that "guesses" what streams the user wants. But I don't need a separate URL to do GET vs POST in HTTP. It seems like having a different URL for where you put the args would be hard to explain to people. i really feel strongly that even if all the parsing logic is in the core, even if it's all in one class: a piece of the path should be used to determine where the streams come from. If there's a ? in the URL, then it's args, so that could always safetly be parsed. Perhaps a special arg, if present, could override the default method of getting input streams? consider the example you've got on your test.html page: "POST - with query string" ... that doesn't obey the typical semantics of a POST with a query string ... if you used the methods on HttpServletRequest to get the params it would give you all the params it found both in the query strings *and* in the post body. Blech. I was wondering about that. Sounds like bad form, but perhaps could be supported via something like /solr/foo?postbody=args -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: To be clear, (with the current implementation in SOLR-104) you would : have to put this in your solrconfig.xml : : : : Notice the preceding '/'. I think this is a strong indication that : someone *wants* /select to behave distinctly. crap ... i totally misread that ... so if people have a requestHandler registered with a name that doesn't start with a slash, they can't use the new URL structure and they have to use the old one. DAMN! ... that is slick dude ... okay, i agree with you, the odds of that causing problems are pretty fucking low. I'm still hung up on this "parse" logic thing ... i really think it needs to be in the path .. or at the very least, there needs to be a way to specify it in the path to force one behavior or another, and if it's not in the path then we can guess based on the Content-Type. Putting it in a query arg would make getting it without contaminating the POST body kludgy, putting it at the start of the path doesn't work well for supporting a default if it isn't there, and putting it at the end of the PATH messes up the nice work you've done letting RequestHandlers have extra path info for encoding info they want. H... What if we did soemthing like this... /exec/handler/name:extra/path?param1=val1 /raw/handler/name:extra/path?param1=val1 /url/handler/name:extra/path?param1=val1&url=...&url=... /file/handler/name:extra/path?param1=val1&file=...&file=... where "exec" means guess based on the Content-TYpe, "raw" means use the POST body as a single stream regardless of Content-Type, etc... thoughts? -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I just posted a new patch on SOLR-104. I think it addresses most of : the issues we have discussed. (Its a little difficult to know as it : has been somewhat circular) I was going to reply to your points one : by one, but i think that would just make the discussion more confusing : then it already is! Ryan: this patch truely does kick ass ... we can probably simplify a lot of the Legacy stuff by leveraging your new StandardRequestBuilder -- but that can be done later. i'm stil really not liking the way there is a single SolrRequestBuilder with a big complicated build method that "guesses" what streams the user wants. i really feel strongly that even if all the parsing logic is in the core, even if it's all in one class: a piece of the path should be used to determine where the streams come from. consider the example you've got on your test.html page: "POST - with query string" ... that doesn't obey the typical semantics of a POST with a query string ... if you used the methods on HttpServletRequest to get the params it would give you all the params it found both in the query strings *and* in the post body. This is a great example of what i was talking about: if i have no intention of sending a stream, it should be possible for me to send params in both the URL and in the POST body -- but in other cases i should be able to POST some raw XML and still have params in the URL. arguable: we could look at the Content-Type of the request and make the assumption based on that -- but as i mentioned before, people don't allways set the Content-TYpe perfectly. if we used a URL fragment to determine where the streams should come from we could be a lot more confident that we know where the stream should come from -- and let the RequestHandler decide if it wants to trust the ContentType the multipart/mixed example i gave previously is another example -- your code here assumes that should be given to the RequsetHandler as multiple streams -- which is a great assumption to make for fileuploads, but which gives me no way to POST multipart/mixed mime data that i want given to the RequestHandler as a single ContentStream (so it can have access to all of hte mime headers for each part) -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
easy thing to deal with just by scoping the URLs .. put something, ANYTHING, in front of these urls, that isn't "select" or "update" and I'll let you and Yonik decide this one. I'm fine either way, but I really don't see a problem letting people easily override URLs. I actually think it is a good thing. consider the case where a user today has this in his solrconfig... To be clear, (with the current implementation in SOLR-104) you would have to put this in your solrconfig.xml Notice the preceding '/'. I think this is a strong indication that someone *wants* /select to behave distinctly.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > that scares me ... not only does it rely on the client code sending the : > correct content-type : : Not really... that would perhaps be the default, but the parser (or a : handler) can make intelligent decisions about that. : : If you put the parser in the URL, then there's *that* to be messed up : by the client. but the HTTP Client libraries in vaious languages don't allways make it easy to set Content-type -- and even if they do that doesn't mean the person using that library knows how to use it properly -- put everyone understands how to put something in a URL. if nothing else, think of putting the "parsetype" in the URL as a checksum that the RequestParaser can use to validate it's assumptions -- if it's not there, then it can do all of the intellegent things you think it should do, but if it is there that dictates what it should do. (aren't you the one that convinced me a few years back that it was better to trust a URL then to trust HTTP Headers? ... because people understand URLs and put things in them, but they don't allways know what headers to send .. curl being the great example, it allways sends a Content-TYpe even if the user doesn't ask it to right?) : Multi-part posts will have the content-type set correctly, or it won't work. : The big use-case I see is browser file upload, and they will set it correctly. right, but my point is what if i want the multi-part POST body left alone so my RequestHandler can deal with it as a single stream -- if i set every header correctly, the "smart" parsing code will parse it -- which is why sometihng in the URL telling it *not* to parse it is important. : We should not preclude wacky handlers from doing things for : themselves, calling our stuff as utility methods. how? ... if there is one and only one RequestParser which makes the SolrRequest before the RequestHandler ever sees it, and parses the post body because the content-type is multipart/mixed how can a wacky handler ever get access to the raw post body? -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > A user should be confident that they can pick anyname they possily want : > for their plugin, and it won't collide with any future addition we might : > add to Solr. : : But that doesn't seem possible unless we make user plugins : second-class citizens by scoping them differently. In the event there : is a collision in the future, the user could rename one of the : plugins. when it comes to URLs, our plugins currently are second class citizens -- plugin names appear in the "qt" or "wt" params -- users can pick any names they want and they are totally legal, they don't have to worry about any possibility that a name they pick will collide with a path we have mapped to a servlet. Users shouldn't have the change the names of requestHandlers juse because SOlr adds a new feature with the same name -- changing a requestHandler name could be a heavy burden for a Solr user to make depending on how many clients *they* have using that requestHandler with that name. i wouldn't make a big deal out of this if it was unavoidable -- but it is such an easy thing to deal with just by scoping the URLs .. put something, ANYTHING, in front of these urls, that isn't "select" or "update" and then put the requestHandler name and we've now protected ourself and our users. consider the case where a user today has this in his solrconfig... ..with the URL structure you guys are talking about, with the DispatchFilter matching on /* and interpreting the first part of hte path as a posisble requestHandler name, that user can't upgrade Solr because he's relying on the old "/select?qt=select" style URLs to work ... he has to change the name of his requestHandler and all of his clients, then upgrade, then change all of his clients againt to take advantage of the new URL structure (and the new features it provides for updates) -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I just posted a new patch on SOLR-104. I think it addresses most of the issues we have discussed. (Its a little difficult to know as it has been somewhat circular) I was going to reply to your points one by one, but i think that would just make the discussion more confusing then it already is! > (i don't trust HTTP Client code -- but for the sake > of argument let's assume all clients are perfect) what happens when a > person wants to send a mim multi-part message *AS* the raw post body -- so > the RequestHandler gets it as a single ContentStream (ie: single input > stream, mime type of multipart/mixed) ? Multi-part posts will have the content-type set correctly, or it won't work. The big use-case I see is browser file upload, and they will set it correctly. I don't see it as a big problem because we don't have to deal with legacy streams yet. No one is expecting their existing stream code to work. The only header values the SOLR-104 code relies on is 'multipart' I think that is a reasonable constraint since it has to be implemented properly for commons-file-upload to work. ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
> > I'm not sure what "it" is in the above sentence ... i believe from the > context of the rest of hte message you are you refering to > using a ServletFilter instead of a Servlet -- i honestly have no opinion > about that either way. I thought a filter required you to open up the WAR file and change web.xml, or am I misunderstanding? If your question is do you need to edit web.xml to change the URL it will apply to, my suggestion is to may /* to the DispatchFilter and have it decide weather or not to handle the requests. With a filter, you can handle the request directly or pass it up the chain. This would allow us to have the URL structures defined by solrconfig.xml (without a need to edit web.xml) If your question is about configuring the RequestParser, Yes, you would need to edit web.xml My (our?) reasons for suggesting this are 1) I think we only have one RequestParser that will handle all normal requests. Unless you have extreemly specialized needs, this is not something you would change. 2) Since the RequestParser is tied so closely to HttpServletRequest and your desired URL structure, it seems appropriate to configure it in web.xml. A RequestParser is just a utility class for servlets/filters 3) We don't want to add RequestParser to 'core' unless it really needs to be a pluggable interface. I don't see the need for it just yet. ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Chris Hostetter wrote: : 1) I think it should be a ServletFilter applied to all requests that : will only process requests with a registered handler. I'm not sure what "it" is in the above sentence ... i believe from the context of the rest of hte message you are you refering to using a ServletFilter instead of a Servlet -- i honestly have no opinion about that either way. I thought a filter required you to open up the WAR file and change web.xml, or am I misunderstanding? -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I have imagined the single default parser handles *all* the cases you : just mentioned. A ... a lot of confusing things make more sense now. .. but some things are more confusing: If there is only one parser, and it decides what to do based entirely on param names and HTTP headers, then what's the point of having the parser name be part of the path in your URL design? I didn't think it would be part of the URL anymore. : POST: depending on headers/content type etc you parse the body as a : single stream, multi-part files or read the params. : : It will take some careful design, but I think all the standard cases : can be handled by a single parser. that scares me ... not only does it rely on the client code sending the correct content-type Not really... that would perhaps be the default, but the parser (or a handler) can make intelligent decisions about that. If you put the parser in the URL, then there's *that* to be messed up by the client. (i don't trust HTTP Client code -- but for the sake of argument let's assume all clients are perfect) what happens when a person wants to send a mim multi-part message *AS* the raw post body -- so the RequestHandler gets it as a single ContentStream (ie: single input stream, mime type of multipart/mixed) ? Multi-part posts will have the content-type set correctly, or it won't work. The big use-case I see is browser file upload, and they will set it correctly. This may sound like a completely ridiculous idea, but consider the situation where someone is indexing email ... they've written a RequestHandler that knows how to parser multipart mime emails and convert them to documents, they want to POST them directly to Solr and let their RequestHandler deal with them as a single entity. We should not preclude wacky handlers from doing things for themselves, calling our stuff as utility methods. ..i think life would be a lot simpler if we kept the RequestParser name as part of hte URL, completely determined by the client (since the client knows what it's trying to send) ... even if there are only 2 or 3 types of RequestParsing being done. Having to do different types of posts to different URLs doesn't seem optimal, esp if we can do it in one. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: the thing about Solr, is there really aren't a lot of "defaults" in the sense you mean ... there is just an example -- people might copy the example, but if they don't have something in their solrconfig, most things just aren't there I expect that most users will fall into that category though. A minority use custom request handlers and I expect a vast minority to use custom update handlers. A user should be confident that they can pick anyname they possily want for their plugin, and it won't collide with any future addition we might add to Solr. But that doesn't seem possible unless we make user plugins second-class citizens by scoping them differently. In the event there is a collision in the future, the user could rename one of the plugins. The same type of collision can happen today with our current request handler framework, but I don't think it's worth uglifying URLs over. It will be very rare and there are ways to easily work around it. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
i would relaly feel a lot happier with something like these that you mentioned... If it will make you happier, then I think its a good idea! (even if i don't see it as a Problem) : /solr/dispatch/update/xml : /solr/cmd/update/xml : /solr/handle/update/xml : /solr/do/update/xml http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params} (assuming the number of parsers is <3 and solr.war would only have 1) How about: http://${host}:${port}/${context}/${parser}/${handler/with/optional/slashes}?${params} Thoughts for the default parser name. 'do' gives me the struts he-be-je-bes :) we can still handle... http://${host}:${port}/${context}/select/?qt=${handler}&${params} ..with a really simple ServletFilter (that has no risk of collision, with the new URL structure one, so it can go anywhere in the FilterChain) yes. likewise with /update
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I have imagined the single default parser handles *all* the cases you : just mentioned. A ... a lot of confusing things make more sense now. .. but some things are more confusing: If there is only one parser, and it decides what to do based entirely on param names and HTTP headers, then what's the point of having the parser name be part of the path in your URL design? : POST: depending on headers/content type etc you parse the body as a : single stream, multi-part files or read the params. : : It will take some careful design, but I think all the standard cases : can be handled by a single parser. that scares me ... not only does it rely on the client code sending the correct content-type (i don't trust HTTP Client code -- but for the sake of argument let's assume all clients are perfect) what happens when a person wants to send a mim multi-part message *AS* the raw post body -- so the RequestHandler gets it as a single ContentStream (ie: single input stream, mime type of multipart/mixed) ? This may sound like a completely ridiculous idea, but consider the situation where someone is indexing email ... they've written a RequestHandler that knows how to parser multipart mime emails and convert them to documents, they want to POST them directly to Solr and let their RequestHandler deal with them as a single entity. ..i think life would be a lot simpler if we kept the RequestParser name as part of hte URL, completely determined by the client (since the client knows what it's trying to send) ... even if there are only 2 or 3 types of RequestParsing being done. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: : This would drop the ':' from my proposed URL and change the scheme to look like: : /parser/path/the/parser/knows/how/to/extract/?params i was totally okay with the ":" syntax (although we should double check if ":" is actaully a legal unescaped URL character) .. but i'm confused by this new suggestions ... is "parser" the name of the parser in that example and "path/the/parser/knows/how/to/extract" data that the parser may use to build to SolrRequest with? (ie: perhaps the RequestHandler) would parser names be required to not have slashes in them in that case? (working with the assumption that most cases can be defined by a single request parser) I am/was suggesting that a dispatch servlet/fliter has a single request parser. The default request parser will choose the handler based on names defined in solrconfig.xml. If someone needs a custom RequestParser, it would be linked to a new servlet/filter (possibly) mapped to a distinct prefix. If it is not possible to handle most standard stream cases with a single request parser, i will go back to the /path:parser format. I suggest it is configured in web.xml because that is a configurable place that is not solrconfg.xml. I don't think it is or should be a highly configurable component. : : Thank goodness you didn't! I'm confident you won't let me (or anyone) : talk you into something like that! You guys made a lot of good the point i was trying to make is that if we make a RequestParser interface with a "parseRequest(HttpServletRequest req)" method, it amouts to just as much badness -- the key is we can make that interface as long as all the implimentations are in the SOlr code base where we can keep an eye on them, and people have to go way, WAY, *WAY* into solr to start shanging them. Yes, implementing a RequestParser is more like writing a custom Servlet then adding a Tokenizer.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > then all is fine and dandy ... but what happens if someone tries to : > configure a plugin with the name "admin" ... now all of the existing admin : that is exactly what you would expect to happen if you map a handler : to /admin. The person configuring solrconfig.xml is saying "Hey, use : this instead of the default /admin. I want mine to make sure you are : logged in using my custom authentication method." In addition, It may : be reasonable (sometime in the future) to implement /admin as a : RequestHandler. This could be a clean way to address SOLR-58 (xml : with stylesheets, or JSON, etc...) yeah i guess that wouldn't be too horrible ... i think what i was trying to point out was that if we'd roll out these super simple urls containg just the plugin name and someone did register a plugin overriding the admin pages, we'd screw them over later when we did get arround to replacing the admin pages with a plugin if added it as a special override ServletFilter mapping : > also: what happens a year from now when we add some completely new : > Servlet/ServletFilter to Solr, and want to give it a unique URL... : > : > http://host:/solr/bar/ : obviously, I think the default solr settings should be prudent about : selecting URLs. The standard configuration should probably map most : things to /select/xxx or /update/xxx. the thing about Solr, is there really aren't a lot of "defaults" in the sense you mean ... there is just an example -- people might copy the example, but if they don't have something in their solrconfig, most things just aren't there : > ...we could put it earlier in the processing chain before the existing : > ServletFilter, but then we break any users that have registered a plugin : > with the name "bar". : : Even if we move this to have a prefix path, we run into the exact same : issue when sometime down the line solr has a default handler mapped to : 'bar' the point i was trying to make is that the "namespaces" that Solr uses should be unique -- the piece of the URL path that is used to pick the Servlet or filter for dispatching the request, should be uniquely distinguishable from the piece of the URL that is used to lookup a plugin. A user should be confident that they can pick anyname they possily want for their plugin, and it won't collide with any future addition we might add to Solr. if the new and improved solr URLs (minus host:port/context) are just /${plugin}/... with a dispatcher that matches on any URL and checks that path for a plugin matching that name then we have no way of ever adding any other URL for a new in the future without running the risk that whatever bsae path we pick for that new features URLs, we might screw over a user who just so happened to pick that features name when registering a plugin -- either becuase we put the new feature earlier in the FilterChain and it circumvents requests the user expects to to that plugin, or because we put that feature later in the FilterChain and that user doesn't ge to take advantage of it unless he changes the name he registered the plugin with (and changes all of his clients) i would relaly feel a lot happier with something like these that you mentioned... : /solr/dispatch/update/xml : /solr/cmd/update/xml : /solr/handle/update/xml : /solr/do/update/xml http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params} sounds great to me... just as long as we have some constant prefix in there so that later on we can use something else. we can still handle... http://${host}:${port}/${context}/select/?qt=${handler}&${params} ..with a really simple ServletFilter (that has no risk of collision, with the new URL structure one, so it can go anywhere in the FilterChain) -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/20/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > what!? .. really? ... you don't think the ones i mentioned before are > things we should support out of the box? > > - no stream parser (needed for simple GETs) > - single stream from raw post body (needed for current updates > - multiple streams from multipart mime in post body (needed for SOLR-85) > - multiple streams from files specified in params (needed for SOLR-66) > - multiple streams from remote URL specified in params > I have imagined the single default parser handles *all* the cases you just mentioned. Yes, this is what I had envisioned. And if we come up with another cool standard one, we can add it and all the current/older handlers get that additional behavior for free. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > This would give people a relativly easy way to implement 'restful' : > URLs if they need to. (but they would have to edit web.xml) : : A handler could alternately get the rest of the path (absent params), right? only if the RequestParser adds it to the SolrRequest as a SolrParam. : > Unit tests should be handled by execute( handler, req, res ) : : How does the unit test get the handler? i think ryans point is that when testing a handler, you should know which handler you are testing, so construct it and execute it directly. : > I am proposing we have a single interface to do this: : > SolrRequest r = RequestParser.parse( HttpServletRequest ) : : That's currently what new SolrServletRequest(HttpServletRequest) does. : We just need to figure out how to get InputStreams, Readers, etc. we start by adding "Iterable getStreams()" to the SolrRequest interface, with a setter on all of the Impls that's not part of the interface. then i suspect what we'll see is two classes that look like this.. public class NoStreamRequestParser implements RequestParser { public SolrRequest parse(HttpServletRequest req) { return new SolrServletRequest(HttpServletRequest); } } public class RawPostStreamRequestParser extends NoStreamRequestParser { public SolrRequest parse(HttpServletRequest req) { ContentStream c = makeContentStream(req.getInputStream()) SolrServletRequest s = super.parse(req); s.setStreams(new SinglItemCollection(c)) return s; } } : So, the hander needs to be able to get an InputStream, and HTTP headers. : Other plugins (CSV) will ask for a Reader and expect the details to be : ironed out for it. : : Method1: come up with ways to expose all this info through an : interface... a "headers" object could be added to the SolrRequest : context (see getContext()) this is why Ryan and i have been talking in terms of a "ContentStream" interface instead of just "InputStream" .. at some point we talked about the ContentStream having getters for mime type, and charset that might be null if unknown. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
what!? .. really? ... you don't think the ones i mentioned before are things we should support out of the box? - no stream parser (needed for simple GETs) - single stream from raw post body (needed for current updates - multiple streams from multipart mime in post body (needed for SOLR-85) - multiple streams from files specified in params (needed for SOLR-66) - multiple streams from remote URL specified in params I have imagined the single default parser handles *all* the cases you just mentioned. GET: read params from paramMap(). Check thoes params for special params that send you to one or many remote streams. POST: depending on headers/content type etc you parse the body as a single stream, multi-part files or read the params. It will take some careful design, but I think all the standard cases can be handled by a single parser.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: The RequestParser is not be part of the core API - It would be a : helper function for Servlets and Filters that call the core API. It : could be configured in web.xml rather then solrconfig.xml. A : RequestDispatcher (Servlet or Filter) would be configured with a : single RequestParser. : : The RequestParser would be in charge of taking HttpRequest and determining: : 1) The RequestHandler : 2) The SolrRequest (Params & Streams) This sounds fine to me ... i was going to suggest that having a public API for RequestParser that people could extend and register intsnces of in the solrconfig would be better then no public API at all -- but if we do that we've let the genie out of the bottle, better to be more restrictive about the internal API, and if/when new usecase come up we can revisit the decision then. If the RequestParser is going to pick the RequestHandler, we might as stick with the current model where the RequestHandler is determined by the "qt" SolrParam (it just wouldn't neccessarily come from the "qt" param of the URL, since the RequestParser can decide where everything comes form it could be from a URL param or it could be from the path) to keep the API simple right? interface RequestParser { public SolrRequest makeSolrRequest(HttpServletRequest req); } I'm curious though why you think RequestParsers should be managed in the web.xml ... do you mean they would each be a Servlet Filter? ... if we assume there's going to be a fixed list and they aren't easily extended, then why not just: - have a HashMap of them in a single ServletFilter dispatcher, - lookup the one to use pased on the appropriate part of the path - let that RequestParser make the SolrRequest - continue with common code for all requests regardless of format: - get RequestHandler from the core by name - execute RequestHandler - get output writer by name - write out response : It would not be the most 'pluggable' of plugins, but I am still having : trouble imagining anything beyond a single default RequestParser. what!? .. really? ... you don't think the ones i mentioned before are things we should support out of the box? - no stream parser (needed for simple GETs) - single stream from raw post body (needed for current updates - multiple streams from multipart mime in post body (needed for SOLR-85) - multiple streams from files specified in params (needed for SOLR-66) - multiple streams from remote URL specified in params : Assuming anything doing *really* complex ways of extracting : ContentStreams will do it in the Handler not the request parser. For : reference see my argument for a seperate DocumentParser interface in: : http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html aggreed ... but that can easily be added later. : In my view, the default one could be mapped to "/*" and a custom one : could be mapped to "/mycustomparser/*" : : This would drop the ':' from my proposed URL and change the scheme to look like: : /parser/path/the/parser/knows/how/to/extract/?params i was totally okay with the ":" syntax (although we should double check if ":" is actaully a legal unescaped URL character) .. but i'm confused by this new suggestions ... is "parser" the name of the parser in that example and "path/the/parser/knows/how/to/extract" data that the parser may use to build to SolrRequest with? (ie: perhaps the RequestHandler) would parser names be required to not have slashes in them in that case? : > Imagine if 3 years ago, when Yonik and I were first hammering out the API : > for SolrRequestHandlers, we had picked this... : > : >public interface SolrRequestHandlers extends SolrInfoMBean { : > public void init(NamedList args); : > public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp); : >} : : Thank goodness you didn't! I'm confident you won't let me (or anyone) : talk you into something like that! You guys made a lot of good the point i was trying to make is that if we make a RequestParser interface with a "parseRequest(HttpServletRequest req)" method, it amouts to just as much badness -- the key is we can make that interface as long as all the implimentations are in the SOlr code base where we can keep an eye on them, and people have to go way, WAY, *WAY* into solr to start shanging them. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : First Ryan, thank you for your patience on this *very* long hash I could not agree more ... as i was leaving work this afternoon, it occured to me "I really hope Ryan realizes i like all of his ideas, i'm just wondering if they can be better" -- most people I work with don't have the stamina to deal with my design reviews :) Thank you both! This is the first time I've taken the time and effort to contribute to an open source project. I'm learning the pace/etiquette etc as I go along :) Honestly your critique is refreshing - I'm used to working alone or directing others. I *think* we are close to something we will all be happy with.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: First Ryan, thank you for your patience on this *very* long hash I could not agree more ... as i was leaving work this afternoon, it occured to me "I really hope Ryan realizes i like all of his ideas, i'm just wondering if they can be better" -- most people I work with don't have the stamina to deal with my design reviews :) What occured to me as i was *getting* home was that since I seem to be the only one that's (overly) worried about the RequestParser/HTTP abstraction -- and since i haven't managed to convince Ryan after all of my badgering -- it's probably just me being paranoid. I think in general, the approach you've outlined should work great -- i'll reply to some of your more recent comments directly. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
First Ryan, thank you for your patience on this *very* long hash session. Most wouldn't last that long unless it were a flame war ;-) And thanks to Hoss, who seems to have the highest read+response bandwidth of anyone I've ever seen (I'll admit I've only been selectively reading this thread, with good intentions of coming back to it). On 1/19/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: It would not be the most 'pluggable' of plugins, but I am still having trouble imagining anything beyond a single default RequestParser. Assuming anything doing *really* complex ways of extracting ContentStreams will do it in the Handler not the request parser. Agreed... a custom handler opening various streams not covered by the default will most easily be handled by the handler opening the streams themselves. This would give people a relativly easy way to implement 'restful' URLs if they need to. (but they would have to edit web.xml) A handler could alternately get the rest of the path (absent params), right? Correct, SolrCore shoudl not care what the request path is. That is why I want to deprecate the execute( ) function that assumes the handler is defined by 'qt' Unit tests should be handled by execute( handler, req, res ) How does the unit test get the handler? If I had my druthers, It would be: res = handler.execute( req ) but that is too big of leap for now :) Yep... esp since the response writers now need the request for parameters, for the searcher (streaming docs, etc). You guys made a lot of good choices and solr is an amazing platform for it. I just wish I had known Lucene when I *started* Sol(a)r ;-) I am proposing we have a single interface to do this: SolrRequest r = RequestParser.parse( HttpServletRequest ) That's currently what new SolrServletRequest(HttpServletRequest) does. We just need to figure out how to get InputStreams, Readers, etc. I agree. This is why i suggest the RequestParsers is not a core part of the API, just a helper class for Servlets and Filters. Sounds good to as a practical starting point to me. If we need more in the future, we can add it then. USECASE: The XML update plugin using the woodstox XML parser: Woodstox docs say to give the parser an InputStream (with char encoding, if available) for best performance. This is also preferable since if the char isn't specified, the parser can try to snoop it from the stream. So, the hander needs to be able to get an InputStream, and HTTP headers. Other plugins (CSV) will ask for a Reader and expect the details to be ironed out for it. Method1: come up with ways to expose all this info through an interface... a "headers" object could be added to the SolrRequest context (see getContext()) Method2: consider it a more special case, have an XML update servlet that puts that info into the SolrRequest (perhaps via the context again) -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
(Note: this is different then what i have suggested before. Treat it as brainstorming on how to take what i have suggested and mesh it with your concerns) What if: The RequestParser is not be part of the core API - It would be a helper function for Servlets and Filters that call the core API. It could be configured in web.xml rather then solrconfig.xml. A RequestDispatcher (Servlet or Filter) would be configured with a single RequestParser. The RequestParser would be in charge of taking HttpRequest and determining: 1) The RequestHandler 2) The SolrRequest (Params & Streams) It would not be the most 'pluggable' of plugins, but I am still having trouble imagining anything beyond a single default RequestParser. Assuming anything doing *really* complex ways of extracting ContentStreams will do it in the Handler not the request parser. For reference see my argument for a seperate DocumentParser interface in: http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html In my view, the default one could be mapped to "/*" and a custom one could be mapped to "/mycustomparser/*" This would drop the ':' from my proposed URL and change the scheme to look like: /parser/path/the/parser/knows/how/to/extract/?params This would give people a relativly easy way to implement 'restful' URLs if they need to. (but they would have to edit web.xml) : Would that be configured in solrconfig.xml as Correct, SolrCore shoudl not care what the request path is. That is why I want to deprecate the execute( ) function that assumes the handler is defined by 'qt' Unit tests should be handled by execute( handler, req, res ) If I had my druthers, It would be: res = handler.execute( req ) but that is too big of leap for now :) ... A third use case of doing queries with POST might be that you want to use standard CGI form encoding/multi-part file upload semantics of HTTP to send an XML file (or files) to the above mentioned XmlQPRequestHandler ... so then we have "MultiPartMimeRequestParser" ... I agree with all your use cases. It just seems like a LOT of complex overhead to extract the general aspects of translating a URL+Params+Streams => Handler+Request(Params+Streams) Again, since the number of 'RequestParsers' is small, it seems overly complex to have a separate plugin to extract URL, another to extract the Handler, and another to extract the streams. Particulary since the decsiions on how you parse the URL can totally affect the other aspects. ...i really, really, REALLY don't like the idea that the RequestParser Impls -- classes users should be free to write on their own and plugin to Solr using the solrconfig.xml -- are responsible for the URL parsing and parameter extraction. Maybe calling them "RequestParser" in my suggested design is missleading, maybe a better name like "StreamExtractor" would be better ... but they shouldn't be in charge of doing anything with the URL. What if it were configured in web.xml, would you feel more comfortable letting it determine how the URL is parsed and streams are extracted? Imagine if 3 years ago, when Yonik and I were first hammering out the API for SolrRequestHandlers, we had picked this... public interface SolrRequestHandlers extends SolrInfoMBean { public void init(NamedList args); public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp); } Thank goodness you didn't! I'm confident you won't let me (or anyone) talk you into something like that! You guys made a lot of good choices and solr is an amazing platform for it. That said, the task at issue is: How do we convert an arbitrary HttpServletRequest into a SolrRequest. I am proposing we have a single interface to do this: SolrRequest r = RequestParser.parse( HttpServletRequest ) You are proposing this is broken down further. Something like: Handler h = (the filter) getHandler( req.getPath() ) SolrParams = (the filter) do stuff to extract the params (using parser.preProcess()) ContentStreams = parser.parse( request ) While it is not great to have plugins manipulate the HttpRequest - someone needs to do it. In my opinion, the RequestParser's job is to isolate *everything* *else* from the HttpServletRequest. Again, since the number of RequestParser is small, it seems ok (to me) keeping HttpServletRequest out of the API for RequestParsers helps us future-proof against breaking plugins down the road. I agree. This is why i suggest the RequestParsers is not a core part of the API, just a helper class for Servlets and Filters. ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/19/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: All that said, this could just as cleanly map everything to: /solr/dispatch/update/xml /solr/cmd/update/xml /solr/handle/update/xml /solr/do/update/xml thoughts? That was my original assumption (because I was thinking of using servlets, not a filter), but I see little advantage to scoping under additional path elements. I also agree with the other points you make. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
then all is fine and dandy ... but what happens if someone tries to configure a plugin with the name "admin" ... now all of the existing admin pages break. that is exactly what you would expect to happen if you map a handler to /admin. The person configuring solrconfig.xml is saying "Hey, use this instead of the default /admin. I want mine to make sure you are logged in using my custom authentication method." In addition, It may be reasonable (sometime in the future) to implement /admin as a RequestHandler. This could be a clean way to address SOLR-58 (xml with stylesheets, or JSON, etc...) also: what happens a year from now when we add some completely new Servlet/ServletFilter to Solr, and want to give it a unique URL... http://host:/solr/bar/ obviously, I think the default solr settings should be prudent about selecting URLs. The standard configuration should probably map most things to /select/xxx or /update/xxx. ...we could put it earlier in the processing chain before the existing ServletFilter, but then we break any users that have registered a plugin with the name "bar". Even if we move this to have a prefix path, we run into the exact same issue when sometime down the line solr has a default handler mapped to 'bar' /solr/dispatcher/bar But, if it ever becomes a problem, we can add an "excludes" pattern to the filter-config that would skip processing even if it maps to a known handler. more short term: if there is no prefix that the ervletFilter requires, then supporting the legacy "http://host:/solr/update"; and "http://host:/solr/select"; URLs becomes harder, I don't think /update or /select need to be legacy URLs. They can (and should) continue work as they currently do using a new framework. The reason I was suggesting that the Handler interface adds support to ask for the default RequestParser and/or ResponseWriter is to support this exact issue. (However in the case of path="/select" the filter would need to get the handler from ?qt=xxx) - - - - - All that said, this could just as cleanly map everything to: /solr/dispatch/update/xml /solr/cmd/update/xml /solr/handle/update/xml /solr/do/update/xml thoughts?
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : > > whoa ... hold on a minute, even if we use a ServletFilter do do all of the : > > dispatching instead of a Servlet we still need a base path right? : > I thought that's what the filter gave you... the ability to filter any : > URL to the /solr webapp, and Ryan was doing a lookup on the next : > element for a request handler. : yes, this is the beauty of a Filter. It *can* process the request : and/or it can pass it along. There is no problem at all with mapping : a filter to all requests and a servlet to some paths. The filter will : only handle paths declared in solrconfig.xml everything else will be : handled however it is defined in web.xml sorry ... i kow that a ServletFilter can look at a request, choose to process it, or choose to ignore it ... my point was that if we use a Filter, we still should put in that fiter logic to only look at requests starting with a fixed prefix. consider this URL... http://host:/solr/foo/ ...where "solr" is the webapp name as usual. if the filter matches on "/*" and then does a lookup in the solrconfig for "foo" to find the Plugin to use for that request, and ignores the request and passesit down the chain if one isn't configured with the name "foo" then all is fine and dandy ... but what happens if someone tries to configure a plugin with the name "admin" ... now all of the existing admin pages break. also: what happens a year from now when we add some completely new Servlet/ServletFilter to Solr, and want to give it a unique URL... http://host:/solr/bar/ ...we could put it earlier in the processing chain before the existing ServletFilter, but then we break any users that have registered a plugin with the name "bar". more short term: if there is no prefix that the ervletFilter requires, then supporting the legacy "http://host:/solr/update"; and "http://host:/solr/select"; URLs becomes harder, because how do we safely tell if the remote client is expecting the legacy behavior of those URLs, or if we are trying to support some plugin configured using the names "select" and "update" ? -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/19/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > whoa ... hold on a minute, even if we use a ServletFilter do do all of the > dispatching instead of a Servlet we still need a base path right? I thought that's what the filter gave you... the ability to filter any URL to the /solr webapp, and Ryan was doing a lookup on the next element for a request handler. yes, this is the beauty of a Filter. It *can* process the request and/or it can pass it along. There is no problem at all with mapping a filter to all requests and a servlet to some paths. The filter will only handle paths declared in solrconfig.xml everything else will be handled however it is defined in web.xml (As a sidenote, wicket 2.0 replaces their dispatch servlet with a filter - it makes it MUCH easier to have their app co-exist with other things in a shared URL structure.)
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: whoa ... hold on a minute, even if we use a ServletFilter do do all of the dispatching instead of a Servlet we still need a base path right? I thought that's what the filter gave you... the ability to filter any URL to the /solr webapp, and Ryan was doing a lookup on the next element for a request handler. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: 1) I think it should be a ServletFilter applied to all requests that : will only process requests with a registered handler. I'm not sure what "it" is in the above sentence ... i believe from the context of the rest of hte message you are you refering to using a ServletFilter instead of a Servlet -- i honestly have no opinion about that either way. : 2) I think the RequestParser should take care off parsing : ContentStreams *and* SolrParams - not just the streams. The dispatch : servlet/filter should never call req.getParameter(). If that's the case, then the RequestParser is in control of the URL structure ... except that it's not in control of the path info since that's how we pick the RequestParser in the first place ... what if we decide later that we want to change the URL structure -- then every RequestParser would have to be changed. : 3) I think the dispatcher picks the Handler and either calls it : directly or passes it to SolrCore. It does not put "qt" in the : SolrParams and have SolrCore extract it (again) that's perfectly fine with me - i only had it that way because that's how RequestHandler execution currently works, i wanted to leave anything not directly related to what i was suggestion exactly the way it was currently in my psuedo code. : == Arguments for a ServletFilter: == : If we implement the dispatcher as a Filter: : * the URL is totally customizable from solrconfig.xml can you explain this more ... why does a ServletFilter make the URL more customizable then an alternative (which i believe is jsut a Servlet) : If we implement the dispatcher as a Servlet : * we have to define a 'base' path for each servlet - this would make : the names longer then then need to be and add potential confusion in : the configuration. whoa ... hold on a minute, even if we use a ServletFilter do do all of the dispatching instead of a Servlet we still need a base path right? ... even if we ignore the current admin pages and assume we're going to replace them all with new RequestHandlers when we do this, what happens a year from now when we decide we want to add some new piece of functionality that needs a differnet Servlet/ServletFilter ... if we've got a Filter matching on "/*" don't we burn every possible bridge we have for adding something else latter. : Consider the servlet 'update' and another servlet 'select' When our : proposed changes, these could both be the same servlet class : configured to distinct paths. Now lets say you want to call: : http://localhost/solr/update/xml?params : Would that be configured in solrconfig.xml as http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html) : It seems like we may need a few ways to parse params out of the : request. The way one handles the parameters directly affects the : streams. This logic should be contained in a single place. The intent there is to use a regular CGI form encoded POST body to express more params then the client feels safe putting in a URL, under teh API i was suggesting that would be solved with a "No-Op" RequestParser that has empty preProcess and process methods. when the Servlet (or ServletFilter) builds the Solrparams (in between calling parser.preProcess and parser.process) it gets *all* of hte form encoded params from the HttpServletRequest (because no code has touched the input stream) an alternative situation in which you might want to "Query using HTTP POST" is if you had an XmlQPRequestHandler that understood the xml-query-parser syntax from this contrib... http://svn.apache.org/viewvc/lucene/java/trunk/contrib/xml-query-parser/ ...which expected to read the XML from the ContentStreams of the SolrRequest, and you wnated to put the XML in the raw POSt body of the request (the same way our current update POSTs work) but there were other options XmlQPRequestHandler wanted to get out of the SolrRequest's SolrParams. that would be handled by a "RawPostRequestParser" whose process method would be a No-Op, but the preProcess method would make a ContentStream out of the InputStream from the HttpServletRequest -- then the Servlet/ServletFilter would parse the url useing the HttpServletRequest.getParameter() methods (which are now safe to call without damanging the InputStream). (That RawPostRequestParser would be reused along with an XmlUpdateHandler that we refactor the existing updated logic from the core to support the legacy /update URLs) A third use case of doing queries with POST might be that you want to use standard CGI form encoding/multi-part file upload semantics of HTTP to send an XML file (or files) to the above mentioned XmlQPRequestHandler ... so then we have "MultiPartMimeRequestParser" that has a No-Op preProcess method, and uses the Commons FileUpload code with a org.apache.commons.fileupload.RequestContext it builds out of the header info passed to preProcess by the Servlet. : == The Dispatcher should pick the handler == : There is no reason it would need to inject 'qt' int
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Ok, now i think I get what you are suggesting. The differences are that: 1) I think it should be a ServletFilter applied to all requests that will only process requests with a registered handler. 2) I think the RequestParser should take care off parsing ContentStreams *and* SolrParams - not just the streams. The dispatch servlet/filter should never call req.getParameter(). 3) I think the dispatcher picks the Handler and either calls it directly or passes it to SolrCore. It does not put "qt" in the SolrParams and have SolrCore extract it (again) == Arguments for a ServletFilter: == If we implement the dispatcher as a Filter: * the URL is totally customizable from solrconfig.xml * we have a single Filter to handle all standard requests * with this single Filter, we can easily handle the existing URL structures * configured URLs can sit at the 'Top level' next to 'top level' servlets If we implement the dispatcher as a Servlet * we have to define a 'base' path for each servlet - this would make the names longer then then need to be and add potential confusion in the configuration. Consider the servlet 'update' and another servlet 'select' When our proposed changes, these could both be the same servlet class configured to distinct paths. Now lets say you want to call: http://localhost/solr/update/xml?params Would that be configured in solrconfig.xml as http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html) It seems like we may need a few ways to parse params out of the request. The way one handles the parameters directly affects the streams. This logic should be contained in a single place. == The Dispatcher should pick the handler == In the proposed url scheme: /path/to/handler:parser, the dispatcher has to decide what handler it is. If we use a filter, it will look for a registered handler - if it can't find one, it will not process the request. There is no reason it would need to inject 'qt' into the solr params just so it can be pulled out by SolrCore (using the @depricated function: solrReq.getQueryType()!) If the dispatcher is required to put a parameter in SolrParams, we could not make the RequestParser in charge of filling the SolrParams. This would require something like your pre/process system. == Pseudo-Java == The real version will do error handling and will need some special logic to make '/select' behave exactly as it does now. class SolrFilter implements Filter { void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) { String path = req.getServletPath(); SolrRequestHandler handler = getHandlerFromPath( path ); if( handler != null ) { SolrRequestParser parser = getParserFormPath( path ); SolrQueryResponse solrRes = new SolrQueryResponse(); SolrQueryRequest solrReq = parser.parse( request ); core.execute( handler, solrReq, solrRes ); return; } chain.doFilter(request, response); } } Modify core to directly accept the 'handler': class SolrCore { public void execute(SolrRequestHandler handler, SolrQueryRequest req, SolrQueryResponse rsp) { // setup response header and handle request final NamedList responseHeader = new NamedList(); rsp.add("responseHeader", responseHeader); handler.handleRequest(req,rsp); setResponseHeaderValues(responseHeader,req,rsp); log.info(req.getParamString()+ " 0 "+ (int)(rsp.getEndTime() - req.getStartTime())); } @Depricated public void execute(SolrQueryRequest req, SolrQueryResponse rsp) { SolrRequestHandler handler = getRequestHandler(req.getQueryType()); if (handler==null) { log.warning("Unknown Request Handler ... } this.execute( handler, req, rsp ); } } ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > Ah ... this is the one problem with high volume on an involved thread ... : > i'm sending replies to messages you write after you've already read other : > replies to other messages you sent and changed your mind :) : Should we start a new thread? I don't think it would make a differnece ... we just need to slow down :) : Ok, now (I think) I see the difference between our ideas. : : >From your code, it looks like you want the RequestParser to extract : 'qt' that defines the RequestHandler. In my proposal, the : RequestHandler is selected independent of the RequestParser. no, no, no ... i'm sorry if i gave that impression ... the RequestParser *only* worries about getting a streams, it shouldn't have any way of even *guessing* what RequestHandler is going to be used. for refrence: http://www.nabble.com/Re%3A-p8438292.html note that i never mention "qt" .. instead i refer to "core.execute(solrReq, solrRsp);" doing exactly what it does today ... core.execute will call getRequestHandler(solrReq.getQueryType()) to pick the RequestHandler to use. the Servlet is what creates the SolrRequest object, and puts whatever SolrParams it wants (including "qt") in that SolrRequest before asking the SolrCore to take care of it. : What do you imagine happens in: : > : > String p = pickRequestParser(req); let's use the URL syntax you've been talking about that people seem to have agreed looks good (assuming i understand correctly) ... /servlet/${requesthandler}:${requestparser}?param1=val1¶m2=val2 what i was suggesting was that then the servlet which uses that URL structure might have a utility method called pickRequestParser that would look like... private String pickRequestParser(HttpServletRequest req) { String[] pathParts = req.getPathInfo().split("\:"); if (pathParts.length < 2 || "".equal(pathParts[1])) return "default"; // or "standard", or null -- whatever return pathParts[1]; } : If the RequestHandler is defined by the RequestParser, I would : suggest something like: again, i can't emphasis enough that that's not what i was proposing ... i am in no way shape or form trying to talk you out of the idea that it should be possible to specify the RequestParser, the RequestHandler, and the OutputWriter all as part of the URL, and completley independent of eachother. the RequestHandler and the OutputWriter could be specified as regular SolrParams that come from any part of the HTTP request, but the RequestParser needs to come from some part of the URL thta can be inspected with out any risk of affecting the raw post stream (ie: no HttpServletRequest.getParameter() calls) : I still don't see why: : : > : > // let the parser preprocess the streams if it wants... : > Iterable s = solrParser.preprocess : > (getStreamIno(req), new Pointer() { : > InputStream get() { : > return req.getInputStream(); : > }); : > : > Solrparams params = makeSolrRequest(req); : > : > // let the parser decide what to do with the existing streams, : > // or provide new ones : > Iterable solrParser.process(solrReq, s); : > : > // ServletSolrRequest is a basic impl of SolrRequest : > SolrRequest solrReq = new ServletSolrRequest(params, s); : > : : can not be contained entirely in: : : SolrRequest solrReq = parser.parse( req ); because then the RequestParser would be defining how the URL is getting parsed -- the makeSolrRequest utility placeholder i described had the wrong name, i should have called it makeSolrParams ... it would look something like this in the URL syntax i described above... private SolrParams makeSolrParams(HttpServletRequest req) { // this class already in our code base, used as is SolrParams p = new ServletSolrParams(req); String[] pathParts = req.getPathInfo().split("\:"); if ("".equal(pathParts[0])) return p; Map tmp = new HashMap(); tmp.put("qt", pathPaths[0]); return new DefaultSolrParams(new MapSolrParams(tmp), p); } the nutshell version of everything i'm trying to say is... SolrRequest - models all info about a request to solr to do something: - the key=val params assocaited with that request - any streams of data associated with that request RequestParser(s) - different instances for different sources of streams - is given two chances to generate ContentStreams: - once using the raw stream from the HTTP request - once using the params for the SolrRequest SolrSerlvet - the only thing with direct access to the HttpServletRequest, shields the other interface APIs from from the mechanincs of HTTP - dictates the URL structure - determines the name of the RequestParser to use - lets parser have the raw input stream - determines where SolrParams for request come from - lets parser have params to make more streams if it wants to. SolrCore - does all of hte name lookups for processing a SolrRequest: -
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I was... then you talked me out of it! You are correct, the client : should determine the RequestParser independent of the RequestHandler. Ah ... this is the one problem with high volume on an involved thread ... i'm sending replies to messages you write after you've already read other replies to other messages you sent and changed your mind :) Should we start a new thread? Here's a more fleshed out version of the psuedo-java i posted earlier, with all of my adendums inlined and a few simple metho calls changed to try and make the purpose more clear... Ok, now (I think) I see the difference between our ideas. From your code, it looks like you want the RequestParser to extract 'qt' that defines the RequestHandler. In my proposal, the RequestHandler is selected independent of the RequestParser. What do you imagine happens in: String p = pickRequestParser(req); This looks like you would have to have a standard way (per servlet) of gettting the RequestParser. How do you invision that? What would be the standard way to choose your request parser? If the RequestHandler is defined by the RequestParser, I would suggest something like: interface SolrRequest { RequestHandler getHandler(); Iterable getContentStreams(); SolrParams getParams(); } interface RequestParser { SolrRequest getRequest( HttpServletRequest req ); // perhaps remove getHandler() from SolrRequest and add: RequestHandler getHandler(); } And then configure a servlet or filter with the RequestParser SolrRequestFilter ... RequestParser org.apache.solr.parser.StandardRequestParser Given that the number of RequestParsers is realistically small (as Yonik mentioned), I think this could be a good solution. To update my current proposal: 1. Servlet/Filter defines the RequestParser 2. requestParser parses handler & request from HttpServletRequest 3. handled essentially as before To update the example URLs, defined by the "StandardRequestParser" /path/to/handler/?param where /path/to/handler is the "name" defined in solrconfig.xml To use a different RequestParser, it would need to be configured in web.xml /customparser/whatever/path/i/like - - - - - - - - - - - - - - I still don't see why: // let the parser preprocess the streams if it wants... Iterable s = solrParser.preprocess (getStreamIno(req), new Pointer() { InputStream get() { return req.getInputStream(); }); Solrparams params = makeSolrRequest(req); // let the parser decide what to do with the existing streams, // or provide new ones Iterable solrParser.process(solrReq, s); // ServletSolrRequest is a basic impl of SolrRequest SolrRequest solrReq = new ServletSolrRequest(params, s); can not be contained entirely in: SolrRequest solrReq = parser.parse( req ); assuming the SolrRequest interface includes Iterable getContentStreams(); the parser can use use req.getInputStream() however it likes - either to make params and/or to build ContentStreams - - - - - - - - good good ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Cool. I think i need more examples... concrete is good :-) I don't quite grok your format below... is it one line or two? /path/defined/in/solrconfig:parser?params /${handler}:${parser} Is that simply /${handler}:${parser}?params yes. the ${} is just to show what is extracted from the request URI, not a specific example Imagine you have a CsvUpdateHander defined in solrconfig.xml with a "name"="my/update/csv". The standard RequestParser could extract the parameters and Iterable for each of the following requests: POST: /my/update/csv/?separator=,&fields=foo,bar,baz (body) "10,20,30" POST:/my/update/csv/ multipart post with 5 files and 6 form fields defining (unlike the previous example this the handle would get 5 input streams rather then 1) GET: /my/update/csv/?post.remoteURL=http://..&separator=,&fields=foo,bar,baz&;... fill the stream with the content from a remote URL GET: /my/update/csv/?post.body=bodycontent,&fields=foo,bar,baz&... use 'bodycontent' as the input stream. (note, this does not make much sense for csv, but is a useful example) POST: /my/update/csv:remoteurls/?separator=,&fields=foo,bar,baz (body) http://url1,http://url2,http:/url3... In this case we would use a custom RequestParser ("remoteurls") that would read the post body and convert it to a stream of content urls. - - - - - - - The URL path (everything before the ':') would be entirely defined and configured by solrconfig.xml A filter would see if the request path matches a registered handler - if not it will pass it up the filter chain. This would allow custom filters and servlets to co-exist in the top level URL path. Consider: solrconfig.xml web.xml: MyRestfulDelete /mydelete/* POST: /delete?id=AAA would be sent to DeleteHandler POST: /mydelete/AAA/ would be sent to MyRestfulDelete Alternativly, you could have: solrconfig.xml web.xml: MyRestfulDelete /delete/* POST: /standard/delete?id=AAA would be sent to DeleteHandler POST: /delete/AAA/ would be sent to MyRestfulDelete I am suggesting we do not try have the default request servlet/filter support extracting parameters from the URL. I think this is a reasonable tradeoff to be able to have the request path easily user configurable using the *existing* plugin configuration. - - - - - - - - In a previous email, you mentioned changing the URL structure. With this proposal, we would continue to support: /select?wt=XXX for the Csv example, you would also be able to call: GET: /select?qt=/my/update/csv/&post.remoteURL=http://..&sepa... ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: However, I'm not yet convinced the benefits are worth the costs. If : the number of RequestParsers remain small, and within the scope of : being included in the core, that functionality could just be included : in a single non-pluggable RequestParser. : : I'm not convinced is a bad idea either, but I'd like to hear about : usecases for new RequestParsers (new ways of generically getting an : input stream)? I don't really see it being a very high cost ... and even if we can't imagine any other potential user written RequestParser, we already know of at least 4 use cases we want to support out of the box for getting streams: 1) raw post body (as a single stream) 2) multi-part post body (file upload, potentially several streams) 3) local file(s) specified by path (1 or more streams) 4) remote resource(s) specified by URL(s) (1 or more streams) ...we could put all that logic in a single class with that looks at a SolrParam to pick what method to use or we could extract each one into it's own class using a common interface ... either way we can hardcode the list of viable options if we want to avoid the issue of letting the client configure them .. but i still think it's worth the effort to talk about what that common interface might be. I think my idea of having both a preProcess and a process method in RequestParser so it can do things before and after the Servlet has extracted SolrParams from the URL would work in all of the cases we've thought of. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I was... then you talked me out of it! You are correct, the client : should determine the RequestParser independent of the RequestHandler. Ah ... this is the one problem with high volume on an involved thread ... i'm sending replies to messages you write after you've already read other replies to other messages you sent and changed your mind :) : Are you suggesting there would be multiple servlets each with a : different methods to get the SolrParams from the url? How does the : servlet know if it can touch req.getParameter()? I'm suggesting that their *could* be multiple Servlets with multiple URL structures ... my worry is not that we need multiple options now, it's that i don't wnat to cope up with an API for writting plugins that then has to be throw out down the road when if we want/ened to change the URL : How would the default servlet fill up SolrParams? prior to calling RequestParser.preProcess, it would only access very limited parts of the HttpServletRequest -- the bare minimum it needs to pick a RequsetParser ... probably just the path, maybe the HTTP Headers -- but if we had a URL structure where we really wanted to specify the RequestParser in a URL param it could do it using getQueryString *after* calling RequestParser.preProcess the Servlet can access any part of the HttpServletRequest (because if the RequestParser wanted to use the raw POST InputStream it would have, and if it doesn't then it's fair game to let HttpServletRequest pull data out of it when the Servlet calls HttpServletRequest.getParameterMap() -- or any of the other HttpServletRequest methods to build up the SolrParams however it wants based on the URL structure it wants to use ... then RequestParser.process can use those SolrParams to get any other streams it may want and add them to the SolrRequest. Here's a more fleshed out version of the psuedo-java i posted earlier, with all of my adendums inlined and a few simple metho calls changed to try and make the purpose more clear... // Simple inteface for having a lazy refrence to something interface Pointer { T get(); } interface RequestParser { public init(NamedList nl); // the usual /** will be passed the raw input stream from the * HttpServletRequest, ... as well as whatever other HttpServletRequest * header info we decide its important for the RequestParser to know * about the stream, and is safe for Servlets to access and make * available to the RequestParser (ie: HTTP method, content-type, * content-length, etc...) * * I'm using a NamedList instance instead of passing the * HttpServletRequest to maintain a good abstraction -- only the Serlet * know about HTTP, so if we ever want to write an RMI interface to Solr, * the same RequestParser plugins will still work ... in practice it * might be better to explicitly spell out every piece of info about * the stream we want to pass * * This is the method where a RequestParser which is going to use the * raw POST body to build up eithera single stream, or several streams * from a multi-part request has the info it needs to do so. */ public Iterable preProcess(NamedList streamInfo, Pointer s); /** garunteed that the second arg will be the result from * a previous call to preProcess, and that that Iterable from * preProcess will not have been inspected or touched in anyway, nor * will any refrences to it be maintained after this call. * * this is the method where a RequestParser which is going to use * request params to open streams from local files, or remote URLs * can do so -- a particulararly ambitious RequestParser could use * both the raw POST data *and* remote files specified in params * because it has the choice of what to do with the * Iterable it reutnred from the earlier preProcess call. */ public Iterable process(SolrRequest request, Iterable i); } class SolrUberServlet extends HttpServlet { // servlet specific method which does minimal inspection of // req to determine the parser name based on the URL private String pickRequestParser(HttpServletRequest req) { ... } // extracts just the most crucial info about the HTTP Stream from the // HttpServletRequest, so it can be passed to RequestParser.preProcss // must be careful not to use anything that might access the stream. private NamedLIst getStreamInfo(HttpServletRequest req) { ... } // builds the SolrParams for the request using servlet specific URL rules, // this method is free to use anything in the HttpServletRequest // because it won't be called untill after preProcess private SolrParams makeSolrRequestParams(HttpServletRequest req) { ... } public service(HttpServletRequest req, HttpServletResponse response) { SolrCore core = getCore(); Solr(Query)Response solrRsp = new Solr(Query)Response(); String p = pickRequestParser(req)
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: On 1/18/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > Yes, this proposal would fix the URL structure to be > > /path/defined/in/solrconfig:parser?params > > /${handler}:${parser} > > > > I *think* this cleanly handles most cases cleanly and simply. The > > only exception is where you want to extract variables from the URL > > path. > > But that's not a hypothetical case, extracting variables from the URL > path is something I need now (to add metadata about the data in the > raw post body, like the CSV separator). > > POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz > with a body of "10,20,30" > Sorry, by "in the URL" I mean "in the URL path." The RequestParser can extract whatever it likes from getQueryString() The url you list above could absolutely be handled with the proposed format. Cool. I think i need more examples... concrete is good :-) I don't quite grok your format below... is it one line or two? /path/defined/in/solrconfig:parser?params /${handler}:${parser} Is that simply /${handler}:${parser}?params Or is it all one line where you actually have params twice? -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/18/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > Yes, this proposal would fix the URL structure to be > /path/defined/in/solrconfig:parser?params > /${handler}:${parser} > > I *think* this cleanly handles most cases cleanly and simply. The > only exception is where you want to extract variables from the URL > path. But that's not a hypothetical case, extracting variables from the URL path is something I need now (to add metadata about the data in the raw post body, like the CSV separator). POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz with a body of "10,20,30" Sorry, by "in the URL" I mean "in the URL path." The RequestParser can extract whatever it likes from getQueryString() The url you list above could absolutely be handled with the proposed format. The thing that could not be handled is: http://localhost:8983/solr/csv/foo/bar/baz/ with body "10,20,30" > There are pleanty of ways to rewrite RESTfull urls into a > path+params structure. If someone absolutly needs RESTfull urls, it > can easily be implemented with a new Filter/Servlet that picks the > 'handler' and directly creates a SolrRequest from the URL path. While being able to customize something is good, having really good defaults is better IMO :-) We should also be focused on exactly what we want our standard update URLs to look like in parallel with the design of how to support them. again, i totally agree. My point is that I don't think we need to make the dispatch filter handle *all* possible ways someone may want to structure their request. It should offer the best defaults possible. If that is not sufficient, someone can extend it.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: Yes, this proposal would fix the URL structure to be /path/defined/in/solrconfig:parser?params /${handler}:${parser} I *think* this cleanly handles most cases cleanly and simply. The only exception is where you want to extract variables from the URL path. But that's not a hypothetical case, extracting variables from the URL path is something I need now (to add metadata about the data in the raw post body, like the CSV separator). POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz with a body of "10,20,30" There are pleanty of ways to rewrite RESTfull urls into a path+params structure. If someone absolutly needs RESTfull urls, it can easily be implemented with a new Filter/Servlet that picks the 'handler' and directly creates a SolrRequest from the URL path. While being able to customize something is good, having really good defaults is better IMO :-) We should also be focused on exactly what we want our standard update URLs to look like in parallel with the design of how to support them. As a site note, with a change of URLs, we get a "free" chance to change whatever we want about the parameters or response format... backward compatibility only applies to the original URLs IMO. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I'm confused by your sentence "A RequestParser converts a HttpServletRequest to a SolrRequest." .. i thought you were advocating that the servlet parse the URL to pick a RequestHandler, and then the RequestHandler dicates the RequestParser? I was... then you talked me out of it! You are correct, the client should determine the RequestParser independent of the RequestHandler. : /path/registered/in/solr/config:requestparser?params : : If no ':' is in the URL, use 'standard' parser : : 1. The URL path determins the RequestHandler : 2. The URL path determins the RequestParser : 3. SolrRequest = RequestParser.parse( HttpServletRequest ) : 4. handler.handleRequest( req, res ); : 5. write the response do you mean the path before hte colon determins the RequestHandler and the path after the colon determines the RequestParser? yes, that is my proposal. fine too ... i was specificly trying to avoid making any design decissions that required a particular URL structure, in what you propose we are dictating more then just the "/handler/path:parser" piece of the URL, we are also dicating that the Parser decides how the rest of the path and all URL query string data will be interpreted ... Yes, this proposal would fix the URL structure to be /path/defined/in/solrconfig:parser?params /${handler}:${parser} I *think* this cleanly handles most cases cleanly and simply. The only exception is where you want to extract variables from the URL path. There are pleanty of ways to rewrite RESTfull urls into a path+params structure. If someone absolutly needs RESTfull urls, it can easily be implemented with a new Filter/Servlet that picks the 'handler' and directly creates a SolrRequest from the URL path. In my opinion, for this level of customization is reasonable that people edit web.xml and put in their own servlets and filters. what i'm proposing is that the Servlet decide how to get the SolrParams out of an HttpServletRequest, using whatever URL that servlet wants; I guess I'm not understanding this yet: Are you suggesting there would be multiple servlets each with a different methods to get the SolrParams from the url? How does the servlet know if it can touch req.getParameter()? How would the default servlet fill up SolrParams? I think i'm getting confused ... i thought you were advocating that RequestParsers be implimented as ServletFilters (or Servlets) ... Originally I was... but again, you talked me out of it. (this time not totally) I think the /path:parser format is clear and allows for most everything off the shelf. If you want to do something different, that can easily be a custom filter (or servlet) Essentially, i think it is reasonable for people to skip 'RequestParsers' in a custom servlet and be able to build the SolrRequest directly. This level of customization is reasonable to handle directly with web.xml
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
OK, trying to catch up on this huge thread... I think I see why it's become more complicated than I originally envisioned. What I originally thought: 1) add a way to get a Reader or InputStream from SolrQueryRequest, and then reuse it for updates too 2) use the plugin name in the URL 3) write code that could handle multi-part post, or could grab args from the URL. 4) profit! I think the main additional complexity is the idea that RequestParser (#3) be both pluggable and able to be specified in the actual request. I hadn't considered that, and it's an interesting idea. Without pluggable RequestParser: - something like CSV loader would have to check the params for a "file" param and if so, open the local file themselves With a pluggable RequestParser: - the LocalFileRequestParser would be specified in the url (like /update/csv:local) and it will handle looking for the "file" param and opening the file. The CSV plugin can be a little simpler by just getting a Reader. - a new way of getting a stream could be developed (a new RequestParser) and most stream oriented plugins could just use it. However, I'm not yet convinced the benefits are worth the costs. If the number of RequestParsers remain small, and within the scope of being included in the core, that functionality could just be included in a single non-pluggable RequestParser. I'm not convinced is a bad idea either, but I'd like to hear about usecases for new RequestParsers (new ways of generically getting an input stream)? -Yonik
RE: Update Plugins (was Re: Handling disparate data sources in Solr)
: > With all this talk about plugins, registries etc., /me can't help : > thinking that this would be a good time to introduce the Spring IoC : > container to manage this stuff. I don't have a lot of familiarity with spring except for the XML configuration file used for telling the spring context what objects you want it to create on startup and what constructor args to pass then and what methods to call and so on -- with an easy ability to tell it to pass one object you had it construct as a param to another object you are hving it construct. on the whole, it seems really nice, and eventually using it to replace a lot of the home-growm configuration in SOlr would probably make a lot of sense ... but i don't think migrating to Spring is neccessary as part of the current push to support more configurable plugins for updates ... SOlr already has a pretty decent set of utilities for allowing class instances to be specified in the xml config file and have configuration arguments passed to them on initialization .. it's not as fancy as spring and it doesn't support as many features as spring, but it works well enough that it should be easy to use with the new plugins we start to add -- switching to spring right now would probably only complicate the issues, and probably wouldn't make adding Update plugins any easier. equally important: adding a few new types of plugins now probably won't make it any harder to switch to somehting like spring later ... which as i said, is something i definitely anticipate happening -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I think the confusion is that (in my view) the RequestParser is the : *only* object able to touch the stream. I don't think anything should : happen between preProcess() and process(); A RequestParser converts a : HttpServletRequest to a SolrRequest. Nothing else will touch the : servlet request. that makes it the RequestParsers responsibility to dictate the URL format (if it's the only one that can touch the HttpServletRequest) i was proposing a method by which the Servlet could determine the URL format -- there could in fact be multiple servlets supporting different URL formats if we had some need for it -- and the RequestParser could generate streams based on the raw POST data and/or any streams it wants to find based on the SolrParams generated from the URL (ie: local files, remote resources, etc) I'm confused by your sentence "A RequestParser converts a HttpServletRequest to a SolrRequest." .. i thought you were advocating that the servlet parse the URL to pick a RequestHandler, and then the RequestHandler dicates the RequestParser? : /path/registered/in/solr/config:requestparser?params : : If no ':' is in the URL, use 'standard' parser : : 1. The URL path determins the RequestHandler : 2. The URL path determins the RequestParser : 3. SolrRequest = RequestParser.parse( HttpServletRequest ) : 4. handler.handleRequest( req, res ); : 5. write the response do you mean the path before hte colon determins the RequestHandler and the path after the colon determines the RequestParser? ... that would work fine too ... i was specificly trying to avoid making any design decissions that required a particular URL structure, in what you propose we are dictating more then just the "/handler/path:parser" piece of the URL, we are also dicating that the Parser decides how the rest of the path and all URL query string data will be interpreted -- which means if we have a PostBodyRequestParser and a LocalFileRequestParser and a RemoteUrlRequestParser and which all use the query stirng params to get the SolrParams for the request (and in the case of the last two: to know what file/url to parse) and then we decide that we want to support a URL structure that is more REST like and uses the path for including information, now we have to write a new version of all of those RequestParsers ( subclass of each probably) that knows what our new URL structure looks like ... even if that never comes up, every RequestParser (even custom ones written by users to use some crazy proprietery binary protocols we've never heard of to fetch stream of data has to worry about extracting the SOlrParams out of the URL. what i'm proposing is that the Servlet decide how to get the SolrParams out of an HttpServletRequest, using whatever URL that servlet wants; the RequestParser decides how to get the ContentStreams needed for that request -- in a way that can work regardless of wether the stream is acctually part of the HttpServletRequest, or just refrenced by a param in the the request; the RequestHandler decides what to do with those params and streams; and the the ResponseWriter decides how to format the results produced by the RequestHandler back to the client. : > : If anyone needs to customize this chain of events, they could easily : > : write their own Servlet/Filter : I don't *think* this would happen often, and the people would only do : it if they are unhappy with the default URL structure -> behavior : mapping. I am not suggesting this would be the normal way to : configure solr. I think i'm getting confused ... i thought you were advocating that RequestParsers be implimented as ServletFilters (or Servlets) ... but if that were the case it wouldn't just be able changing hte URL structure, it would be able picking new ways to get streams .. but that doesn't seem to be what you are suggesting, so i'm not sure what i was missunderstanding. -Hoss
RE: Update Plugins (was Re: Handling disparate data sources in Solr)
Sorry for the "flame" , but I've used spring on 2 large projects and it worked out great.. you should check out some of the GUIs to help manage the XML configuration files, if that is reason your team thought it was a nightmare because of the configuration(we broke ours up to help).. Jeryl Cook -Original Message- From: Alan Burlison [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 16, 2007 10:52 AM To: solr-dev@lucene.apache.org Subject: Re: Update Plugins (was Re: Handling disparate data sources in Solr) Bertrand Delacretaz wrote: > With all this talk about plugins, registries etc., /me can't help > thinking that this would be a good time to introduce the Spring IoC > container to manage this stuff. > > More info at http://www.springframework.org/docs/reference/beans.html > for people who are not familiar with it. It's very easy to use for > simple cases like the ones we're talking about. Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I'm not sure i underestand preProcess( ) and what it gets us. it gets us the abiliity for a RequestParser to be able to pull out the raw InputStream from the HTTP POST body, and make it available to the RequestHandler as a ContentStream and/or it can wait untill the servlet has parsed the URL to get the params and *then* it can generate ContentStreams based on those param values. - preProcess is neccessary to write a RequestParser that can handle the current POST raw XML model, - process is neccessary to write RequestParsers that can get file names or URLs out of escaped query params and fetch them as streams I think the confusion is that (in my view) the RequestParser is the *only* object able to touch the stream. I don't think anything should happen between preProcess() and process(); A RequestParser converts a HttpServletRequest to a SolrRequest. Nothing else will touch the servlet request. : 1. The URL path selectes the RequestHandler : 2. RequestParser = RequestHandler.getRequestParser() (typically from : its default params) : 3. SolrRequest = RequestParser.parse( HttpServletRequest ) : 4. handler.handleRequest( req, res ); : 5. write the response the problem i see with that, is that the RequestHandler shouldn't have any say in what RequestParser is used -- ... got it. Then i vote we use a syntax like: /path/registered/in/solr/config:requestparser?params If no ':' is in the URL, use 'standard' parser 1. The URL path determins the RequestHandler 2. The URL path determins the RequestParser 3. SolrRequest = RequestParser.parse( HttpServletRequest ) 4. handler.handleRequest( req, res ); 5. write the response : If anyone needs to customize this chain of events, they could easily : write their own Servlet/Filter this is why i was confused about your Filter comment earlier: if the only way a user can customize behavior is by writting a Servlet, they can't specify that servlet in a solr config file -- they'd have to unpack the war and manually eidt the web.xml ... which makes upgrading a pain. I don't *think* this would happen often, and the people would only do it if they are unhappy with the default URL structure -> behavior mapping. I am not suggesting this would be the normal way to configure solr. The main case where I imagine someone would need to write their own servlet/filter is if they insist the parameters need to be in the URL. For example: /delete/id/ The URL structure I am proposing could not support this (unless you had a handler mapped to each id :) ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: I'm not sure i underestand preProcess( ) and what it gets us. it gets us the abiliity for a RequestParser to be able to pull out the raw InputStream from the HTTP POST body, and make it available to the RequestHandler as a ContentStream and/or it can wait untill the servlet has parsed the URL to get the params and *then* it can generate ContentStreams based on those param values. - preProcess is neccessary to write a RequestParser that can handle the current POST raw XML model, - process is neccessary to write RequestParsers that can get file names or URLs out of escaped query params and fetch them as streams : 1. The URL path selectes the RequestHandler : 2. RequestParser = RequestHandler.getRequestParser() (typically from : its default params) : 3. SolrRequest = RequestParser.parse( HttpServletRequest ) : 4. handler.handleRequest( req, res ); : 5. write the response the problem i see with that, is that the RequestHandler shouldn't have any say in what RequestParser is used -- the client is hte only one that knows what type of data they are sending to Solr, they should put information in the URL that directly picks the RequestParser. If you think about it in terms of the current POSTing XML model, an XmlUpdateRequestHandler that reads in our "..." style info shouldn't know anywhere in it's confiuration where that stream of XML bytes came from -- when it gets asked to handle the request, all it should know is that it has some optional params, and an InputStream to work with ... the RequestParsers job is to decide where that input stream came from. : If anyone needs to customize this chain of events, they could easily : write their own Servlet/Filter this is why i was confused about your Filter comment earlier: if the only way a user can customize behavior is by writting a Servlet, they can't specify that servlet in a solr config file -- they'd have to unpack the war and manually eidt the web.xml ... which makes upgrading a pain. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I'm not sure i underestand preProcess( ) and what it gets us. I like the model that 1. The URL path selectes the RequestHandler 2. RequestParser = RequestHandler.getRequestParser() (typically from its default params) 3. SolrRequest = RequestParser.parse( HttpServletRequest ) 4. handler.handleRequest( req, res ); 5. write the response If anyone needs to customize this chain of events, they could easily write their own Servlet/Filter On 1/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: Acctually, i have to amend that ... it occured to me in my slep last night that calling HttpServletRequest.getInputStream() wasn't safe unless we *now* the Requestparser wasnts it, and will close it if it's non-null, so the API for preProcess would need to look more like this... interface Pointer { T get(); } interface RequestParser { ... /** the will be passed a "Pointer" to the raw input stream from the * HttpServletRequest, ... if this method accesses the IputStream * from the pointer, it is required to close it if it is non-null. */ public Iterable preProcess(SolrParam headers, Pointer s); ... } -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Acctually, i have to amend that ... it occured to me in my slep last night that calling HttpServletRequest.getInputStream() wasn't safe unless we *now* the Requestparser wasnts it, and will close it if it's non-null, so the API for preProcess would need to look more like this... interface Pointer { T get(); } interface RequestParser { ... /** the will be passed a "Pointer" to the raw input stream from the * HttpServletRequest, ... if this method accesses the IputStream * from the pointer, it is required to close it if it is non-null. */ public Iterable preProcess(SolrParam headers, Pointer s); ... } -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
At 11:48 PM -0800 1/16/07, Chris Hostetter wrote: >yeah ... once we have a RequestHandler doing that work, and populating a >SolrQueryResponse with it's result info, it >would probably be pretty trivial to make an extremely bare-bones >LegacyUpdateOutputWRiter that only expected that simple mount of response >data and wrote it out in the current update response format .. so the >current SolrUpdateServlet could be completley replaced with a simple url >mapping... > > /update --> /select?qt=xmlupdate&wt=legacyxmlupdate Yah! But in my vision it would be /update -> qt=update because pathInfo is "update". There's no need to remap anything in the URL, the existing SolrServlet is ready for dispatch once it: - Prepares request params into SolrParams - Sets params("qt") to pathInfo - Somehow (perhaps with StreamIterator) prepares streams for RequestParser use I'm still trying to conceptually maintain a separation of concerns between handling the details of HTTP (servlet-layer) and handling different payload encodings (a different layer, one I believe can be invoked after config is read). The following is "vision" more than "proposal" or "suggestion"... legacyxml xml So when incoming URL comes in: /update?rp=json the pipeline which is established is: SolrServlet -> solr.JSONStreamRequestParser | |- request data carrier e.g. SolrQueryRequest | lets.write.this.UpdateRequestHandler | |- response data carrier e.g. SolrQueryResponse | do.we.really.need.LegacyUpdateOutputWRiter I expect this is all fairly straightforward, except for one sticky question: Is there a "universal" format which can efficiently (e.g. lazily, for stream input) convey all kinds of different request body encodings, such that the RequestHandler has no idea how it was dispatched? Something to think about... - J.J.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Ryan McKinley wrote: In addition, consider the case where you want to index a SVN repository. Yes, this could be done in SolrRequestParser that logs in and returns the files as a stream iterator. But this seems like more 'work' then the RequestParser is supposed to do. Not to mention you would need to augment the Document with svn specific attributes. This is indeed one of the things I'd like to do - use Solr as a back-end for OpenGrok (http://www.opensolaris.org/os/project/opengrok/) -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Chris Hostetter wrote: i'm totally on board now ... the RequestParser decides where the streams come from if any (post body, file upload, local file, remote url, etc...); the RequestHandler decides what it wants to do with those streams, and has a library of DocumentProcessors it can pick from to help it parse them if it wants to, then it takes whatever actions it wants, and puts the response information in the existing Solr(Query)Response class, which the core hands off to any of the various OutputWriters to format according to the users wishes. +1 -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
talking about the URL structure made me realize that the Servlet should dicate the URL structure and the param parsing, but it should do it after giving the RequestParser a crack at any streams it wants (actually i think that may be a direct quote from JJ ... can't remember now) ... *BUT* the RequestParser may not want to provide a list of streams, untill the params have been parsed (if for example, one of the params is the name of a file) so what if the interface for RequestParser looked like this... interface RequestParser { public init(NamedList nl); // the usual /** will be passed the raw input stream from the * HttpServletRequest, ... may need other HttpServletRequest info as * SolrParam (ie: method, content-type/content-length, ...but we use * a SolrParam instance instead of the HttpServletRequest to * maintain an abstraction. */ public Iterable preProcess(SolrParam headers, InputStream s); /** garunteed that the second arg will be the result from * a previous call to preProcess, and that that Iterable from * preProcess will not have been inspected or touched in anyway, nor * will any refrences to it be maintained after this call. * this method is responsible for calling * request.setContentStreams(Iterable i); } ...the idea being that many RequestParsers will choose to impliment one or both of those methods as a NOOP that just returns null but if they want to impliment both, they have the choice of obliterating the Iterable returned by preProcess and completely replacing it once they see the SolrParams in the request : specifically what i had in mind was something like this... : : class SolrUberServlet extends HttpServlet { : public service(HttpServletRequest req, HttpServletResponse response) { : SolrCore core = getCore(); : Solr(Query)Response solrRsp = new Solr(Query)Response(); : : // servlet specific method which does minimal inspection of : // req to determine the parser name : String p = pickRequestParser(req); : : // looks up a registered instance (from solrconfig.xml) : // matching that name : RequestParser solrParser = coreGetParserByName(p); : // let the parser preprocess the streams if it wants... Iterable s = solrParser.preprocess(req.getInputStream()) // build the request using servlet specific URL rules Solr(Query)Request solrReq = makeSolrRequest(req); // let the parser decide what to do with the existing streams, // or provide new ones solrParser.process(solrReq, s); : // does exactly what it does now: picks the RequestHandler to : // use based on the params, calls it's handleRequest method : core.execute(solrReq, solrRsp) : : // the rest of this is cut/paste from the current SolrServlet. : // use SolrParams to pick OutputWriter name, ask core for instance, : // have that writer write the results. : QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq); : response.setContentType(responseWriter.getContentType(solrReq, solrRsp)); : PrintWriter out = response.getWriter(); : responseWriter.write(out, solrReq, solrRsp); : : } : } : : : -Hoss : -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
data and wrote it out in the current update response format .. so the current SolrUpdateServlet could be completley replaced with a simple url mapping... /update --> /select?qt=xmlupdate&wt=legacyxmlupdate Using the filter method above, it could (and i think should) be mapped to: /update
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : >I left out "micro-plugins" because i don't quite have a good answer : >yet :) This may a place where a custom dispatcher servlet/filter : >defined in web.xml is the most appropriate solution. : : If the issue is munging HTTPServletRequest information, then a proper : separation of concerns suggests responsibility should lie with a Servlet : Filter, as Ryan suggests. I'm not making sense of this ... i don't see how the micro-plugins (aka: RequestParsers) could be implimented as Filters and still be plugins that users could provide ... don't Filters have to be specified in the web.xml Yes. I'm suggesting we map a filter to intercept ALL requests, then see which ones it should handle. Consider: public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { if(request instanceof HttpServletRequest) { HttpServletRequest req = (HttpServletRequest) request; String path = req.getServletPath(); SolrRequestHandler handler = core.getRequestHandler( path ); if( handler != null ) { HANDLE THE REQUEST return; } } // Otherwise let the webapp handle the request chain.doFilter(request, response); } ... is there some progromatic way a Servlet or Filter can register other Servlets/Filters dynamicly when the application is initalized? ... if users have to extract the solr.war and modify the web.xml to add a RequestParser they've written, that doesn't seem like much of a plugin :) You would not need to extract the war, just change the registered handler name. ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
kind of like a binary stream equivilent to the way analyzers can be customized -- is thta kind of what you had in mind? exactly. interface SolrDocumentParser { public init(NamedList args); Document parse(SolrParams p, ContentStream content); } yes
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > - Revise the XML-based update code (broken out of SolrCore into a : > RequestHandler) to use all the above. : : +++1, that's been needed forever. yeah ... once we have a RequestHandler doing that work, and populating a SolrQueryResponse with it's result info, it would probably be pretty trivial to make an extremely bare-bones LegacyUpdateOutputWRiter that only expected that simple mount of response data and wrote it out in the current update response format .. so the current SolrUpdateServlet could be completley replaced with a simple url mapping... /update --> /select?qt=xmlupdate&wt=legacyxmlupdate -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On Jan 17, 2007, at 1:41 AM, Chris Hostetter wrote: : The number of people writing update plugins will be small compared to : the number of users using the external HTTP API (the URL + query : parameters, and the relationship URL-wise between different update : formats). My main concern is making *that* as nice and utilitarian as : possible, and any plugin stuff is implementation and a secondary : concern IMO. Agreed, but my point was that we should try to design the internal APIs indepently from the URL structure ... if we have a set of APIs, it's easy to come up with a URL structure that will map well (we could theoretically have several URL structures using different servlets) but if we worry too much about what hte URL should look like, we may hamstring the model design. +1 web.xml allows for servlets to be mapped however desired, and cleverly using a servlet filters could add in some other URL mapping goodness, or in the extreme must-have-certain-URLs there is always mod_rewrite. I still think a microcontainer is a good way to go for solr. It's exactly what microcontainers were designed for. While not spring- savvy myself (but tinkered with HiveMind via Tapestry a while back), I know enough to reiterate that its not heavy or horrible for basic IoC which is what is being reinvented in a sense. Erik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: >I left out "micro-plugins" because i don't quite have a good answer : >yet :) This may a place where a custom dispatcher servlet/filter : >defined in web.xml is the most appropriate solution. : : If the issue is munging HTTPServletRequest information, then a proper : separation of concerns suggests responsibility should lie with a Servlet : Filter, as Ryan suggests. I'm not making sense of this ... i don't see how the micro-plugins (aka: RequestParsers) could be implimented as Filters and still be plugins that users could provide ... don't Filters have to be specified in the web.xml ... is there some progromatic way a Servlet or Filter can register other Servlets/Filters dynamicly when the application is initalized? ... if users have to extract the solr.war and modify the web.xml to add a RequestParser they've written, that doesn't seem like much of a plugin :) In general i'm not too worried about what the URL structure looks like ... i agree it makes the most sense for the RequestParser to be determinede using the path, but beyond that i don't think it matters much -- the existing servlet could stay arround as is with a hardcoded use of a "DefaultRequestParser" that doesn't provide any streams and gets the params from HttpServletRequest while a new Servlet could get the qt and wt from the path info as well. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > : In addition to RequestProcessors, maybe there should be a general : > : DocumentProcessor : > : interface SolrDocumentParser : > : { : > : Document parse(ContentStream content); : > : } : > what else would the RequestProcessor do if it was delegating all of the : > parsing to something else? : Parsing is just one task that a RequestProcessor may do. It is the : entry point for all kinds of stuff: searching, admin tasks, augment : search results with SQL queries, writing uploaded files to the file : system. This is where people will do whatever suits their fancy. ah ... i see what you mean. so DocumentProcessors would be reusable classes that RequestHandlers/RequestProcessers could use to parse streams -- but instead of needing to hardcoding class dependencies in the RequestHandler on specific DocumentProcessors, the RequestHandler could do a "lookup" on the mime/type of the stream (or any other key it wanted to i suppose) to parse the stream ... so you could have a SimpleHtmlDocumentProcesser that you use, and then one day you replace it with a CompleHtmlDocumentProcessor which you probably have to configure a bit differnetly but you don't have to recompile your RequestHandler ... kind of like a binary stream equivilent to the way analyzers can be customized -- is thta kind of what you had in mind? (i was confused and thinking that picking a DocumentProcessor would be done by the core independent of picking the RequestHandler --- just like hte OUtputWriter is) : In addition, consider the case where you want to index a SVN : repository. Yes, this could be done in SolrRequestParser that logs in : and returns the files as a stream iterator. But this seems like more : 'work' then the RequestParser is supposed to do. Not to mention you : would need to augment the Document with svn specific attributes. : : Parsing a PDF file from svn should (be able to) use the same parser if : it were uploaded via HTTP POST. i'm totally on board now ... the RequestParser decides where the streams come from if any (post body, file upload, local file, remote url, etc...); the RequestHandler decides what it wants to do with those streams, and has a library of DocumentProcessors it can pick from to help it parse them if it wants to, then it takes whatever actions it wants, and puts the response information in the existing Solr(Query)Response class, which the core hands off to any of the various OutputWriters to format according to the users wishes. The DocumentProcessors are the ones that are really going to need a lot of configuration telling them how to map the chunks of data from the stream to fields in the schema -- but in the same way that OutputWriters get the request after the RequestHandler has had a chance to wrap the SolrParams, it probably makes sense to let the request handler override configuration for the DocumentProcessors as well (so i can say "normally i want the HtmlDocumentProcessor to map these HTML elements to these schema fields ... but i have one type of HTML doc that breaks the rules, so i'll use a seperate RequestHandler to index them, and it will override some of those field mappings... interface SolrDocumentParser { public init(NamedList args); Document parse(SolrParams p, ContentStream content); } -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > So to understand better: : > : > user request -> micro-plugin -> RequestHandler -> ResponseHandler : or: : : HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor -> : SolrResponse -> SolrResponseWriter specifically what i had in mind was something like this... class SolrUberServlet extends HttpServlet { public service(HttpServletRequest req, HttpServletResponse response) { SolrCore core = getCore(); Solr(Query)Response solrRsp = new Solr(Query)Response(); // servlet specific method which does minimal inspection of // req to determine the parser name String p = pickRequestParser(req); // looks up a registered instance (from solrconfig.xml) // matching that name RequestParser solrParser = coreGetParserByName(p); // RequestParser is the only plugin class that knows about // HttpServletRequest, it builds up the SolrRequest (aka // SolrQueryRequest) which contains the SolrParams and streams SolrRequest solrReq = solrParser.parse(req); // does exactly what it does now: picks the RequestHandler to // use based on the params, calls it's handleRequest method core.execute(solrReq, solrRsp) // the rest of this is cut/paste from the current SolrServlet. // use SolrParams to pick OutputWriter name, ask core for instance, // have that writer write the results. QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq); response.setContentType(responseWriter.getContentType(solrReq, solrRsp)); PrintWriter out = response.getWriter(); responseWriter.write(out, solrReq, solrRsp); } } -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: The number of people writing update plugins will be small compared to : the number of users using the external HTTP API (the URL + query : parameters, and the relationship URL-wise between different update : formats). My main concern is making *that* as nice and utilitarian as : possible, and any plugin stuff is implementation and a secondary : concern IMO. Agreed, but my point was that we should try to design the internal APIs indepently from the URL structure ... if we have a set of APIs, it's easy to come up with a URL structure that will map well (we could theoretically have several URL structures using different servlets) but if we worry too much about what hte URL should look like, we may hamstring the model design. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, J.J. Larrea <[EMAIL PROTECTED]> wrote: - Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above. +++1, that's been needed forever. If one has the time, I'd also advocate moving to StAX (via woodstox for Java5, but it's built into Java6). -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, J.J. Larrea <[EMAIL PROTECTED]> wrote: >POST: > if( multipart ) { > read all form fields into parameter map. This should use the same req.getParameterMap as for GET, which Servlet 2.4 says is suppose to be automatically by the servlet container if the payload is application/x-www-form-urlencoded; in that case the input stream should be null. Unfortunately, curl puts application/x-www-form-urlencoded in there by default. Our current implementation of updates always ignores that and treats the stream as binary. An alternative for non-multipart posts could check the URL for args, and if they are there, treat the body as the input instead of params. $ curl http://localhost:5000/a/b?foo=bar --data-binary "hi there" $ nc -l -p 5000 POST /a/b?foo=bar HTTP/1.1 User-Agent: curl/7.15.4 (i686-pc-cygwin) libcurl/7.15.4 OpenSSL/0.9.8d zlib/1.2. 3 Host: localhost:5000 Accept: */* Content-Length: 8 Content-Type: application/x-www-form-urlencoded hi there -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/15/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : The most important issue is to nail down the external HTTP interface. I'm not sure if i agree with that statement .. i would think that figuring out the "model" or how updates should be handled in a generic way, what all of the "Plugin" types are, and what their APIs should be is the most important issue -- once we have those issues settled we could allways write a new "SolrServlet2" that made the URL structure work anyway we want. The number of people writing update plugins will be small compared to the number of users using the external HTTP API (the URL + query parameters, and the relationship URL-wise between different update formats). My main concern is making *that* as nice and utilitarian as possible, and any plugin stuff is implementation and a secondary concern IMO. -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I'm in frantic deadline mode so I'm just going to throw in some (hopefully) short comments... At 11:02 PM -0800 1/15/07, Ryan McKinley wrote: >>the one thing that still seems missing is those "micro-plugins" i was >> [SNIP] >> >> interface SolrRequestParser { >> SolrRequest process( HttpServletRequest req ); >> } >> > > >I left out "micro-plugins" because i don't quite have a good answer >yet :) This may a place where a custom dispatcher servlet/filter >defined in web.xml is the most appropriate solution. If the issue is munging HTTPServletRequest information, then a proper separation of concerns suggests responsibility should lie with a Servlet Filter, as Ryan suggests. For example, while the Servlet 2.4 spec doesn't have specifications for how the servlet container can/should "burst" a multipart-MIME payload into separate files or streams, there are a number of 3rd party Filters which do this. The Iterator is a great idea because if each stream is read to completion before the next is opened it doesn't impose any limitation on individual stream length and doesn't require disk buffering. (Of course some handlers may require access to more than one stream at a time; each time next() is called on the iterator before the current stream is closed, the remainder of that stream will have to be buffered in memory or on disk, depending on the part length. Nonetheless that detail can be entirely hidden from the handler, as it should be. I am not sure if any available ServletFilter implementations work this way, but it's certainly doable.) But that detail is irrelevant for now; as I suggest below, using this API lets one immediately implement it with only next() value of the entire POST stream; that would answer the needs of the existing update request handling code, but establish an API to handle multi-part. Whenever someone wants to write a multi-stream handler, they can write or find a better Iterator implementation, which would best be cast as a ServletFilter. >I like the SolrRequestParser suggestion. Me too. It answers a hole in my vision for how this can all fit together. >Consider: >qt='RequestHandler' >wt='ResponseWriter' >rp='RequestParser ' (rb='SolrBuilder'?) > >To avoid possible POST read-ahead stream mungling: qt,wt, and rp >should be defined by the URL, not parameters. (We can add special >logic to allow /query?qt=xxx) > >For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people >define arbitrary path mapping for qt. > >We could append 'wt', 'rb', and arbitrary arbitrary text to the >registered path, something like > /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params... > >(any other syntax ideas?) No need for new syntax, I think. The pathInfo or qt or other source resolves to a requestHandler CONFIG name. The handler config is read to determine the handler class name. It also can be consulted (with URL or form-POST params overriding if allowed by the config) to decide which RequestParser to invoke BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER. Once those objects are set up, the request body gets executed. Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly simplify, for example, creating a dozen query handlers which used a particular invariant combination of qt, wt, and rp >The 'standard' RequestParser would: >GET: > fill up SolrParams directly with req.getParameterMap() >if there is a 'post' parameter (post=XXX) > return a stream with XXX as its content >else > empty iterator. >Perhaps add a standard way to reference a remote URI stream. > >POST: > if( multipart ) { > read all form fields into parameter map. This should use the same req.getParameterMap as for GET, which Servlet 2.4 says is suppose to be automatically by the servlet container if the payload is application/x-www-form-urlencoded; in that case the input stream should be null. > return an iterator over the collection of files Collection of streams, per Hoss. >} >else { > no parameters? parse parameters from the URL? /name:value/ > return the body stream As above, this introduces unneeded complexity and should be avoided. >} >DEL: > throw unsupported exception? > > >Maybe each RequestHandler could have a default RequestParser. If we >limited the 'arbitrary path' to one level, this could be used to >generate more RESTful URLs. Consider: > >/myadder//// > >/myadder maps to MyCustomHandler and that gives you >MyCustomRequestBuilder that maps /// to SolrParams I think these are best left for an extra-SOLR layer, especially since SOLR URLs are meant for interprogram communication and not direct use by non-developer end users. For example, for my org's website I have hundreds of Apache mod_rewrite rules which do URL munging such as /journals/abc/7/3/192a.pdf into /journalroot/index.cfm?journal=abc&volume=7&issue=3 &page=192&seq=a&format=pdf Or someone
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Bertrand Delacretaz wrote: With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Yonik Seeley wrote: Brainstorming: - for errors, use HTTP error codes instead of putting it in the XML as now. That doesn't work so well if there are multiple documents to be indexed in a single request. -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On Jan 16, 2007, at 3:20 AM, Bertrand Delacretaz wrote: On 1/16/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: ...I think a DocumentParser registry is a good way to isolate this top level task... With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. +1 that, or HiveMind. It seems a lot of the wheel is being reinvented here, when solid plugin solutions already exist. Erik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
So to understand better: user request -> micro-plugin -> RequestHandler -> ResponseHandler Right? or: HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor -> SolrResponse -> SolrResponseWriter
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On Mon, 2007-01-15 at 12:23 -0800, Chris Hostetter wrote: > : > Right, you're getting at issues of why I haven't committed my CSV handler > yet. > : > It currently handles reading a local file (this is more like an SQL > : > update handler... only a reference to the data is passed). But I also > : > wanted to be able to handle a POST of the data , or even a file > : > upload from a browser. Then I realized that this should be generic... > : > the same should also apply to XML updates, and potential future update > : > formats like JSON. > : > : I do not see the problem here. One just need to add a couple of lines in > : the upload servlet and change the csv plugin to input stream (not local > : file). > > what Yonik and i are worried about is that we don't want the list of all > possible ways for an Update Plugin to get a Stream to be hardcoded in the > UpdateServlet or Solr Core or in the Plugins themselves ... we'd like the > notion of indexing docs expressed as CSV records or XML records or JSON > records to be independent of where the CSV, XML, or JSON data stream came > from ... in the same way that the current RequestHandlers can execute > specific search logic, without needing to worry about what format the > results are going to be returned in. > > > It's not writting code to get the stream from one of N known ways > that's hard -- it's designing an API so we can get the stream from one of > any number of *unknown* ways that can be specified at run time thats > tricky :) > Ok, I am still trying to understand your concept of micro-plugin, but I understand the above and your comments later in this thread that you are looking for a generic stream resolver/producer (or solrSource). On Mon, 2007-01-15 at 12:42 -0800, Chris Hostetter wrote: > i disagree ... it should be possible to create "micro-plugins" (I > think i > called them "UpdateSource" instances in my orriginal suggestion) that > know > about getting streams in various ways, but don't care what format of > data > is found on those streams -- that would be left for the > (Update)RequestHandler (which wouldn't need to know where the data > came > from) > > a JDBC/SQL updater would probably be a very special case -- where the > format and the stream are inheriently related -- in which case a No-Op > UpdateSource could be used that didn't provide any stream, and the > JdbcUpdateRequestHandler would manage it's JDBC streams directly. So to understand better: user request -> micro-plugin -> RequestHandler -> ResponseHandler Right? salu2
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: ...I think a DocumentParser registry is a good way to isolate this top level task... With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. -Bertrand
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: In addition to RequestProcessors, maybe there should be a general : DocumentProcessor : : interface SolrDocumentParser : { : Document parse(ContentStream content); : } : : solrconfig could register "text/html" -> HtmlDocumentParser, and : RequestProcessors could share the same parser. what else would the RequestProcessor do if it was delegating all of the parsing to something else? Parsing is just one task that a RequestProcessor may do. It is the entry point for all kinds of stuff: searching, admin tasks, augment search results with SQL queries, writing uploaded files to the file system. This is where people will do whatever suits their fancy. RequestHandler is probalby better name RequestProcessor, but I think we should choose name that can live peacefully with existing RequestHandler code. I imagine there will be a standard 'Processor' gets a list of streams and processes them into Documents. Since the way these documents are parsed depends totally on the schema, we will need some way to make this user configurable. In addition, consider the case where you want to index a SVN repository. Yes, this could be done in SolrRequestParser that logs in and returns the files as a stream iterator. But this seems like more 'work' then the RequestParser is supposed to do. Not to mention you would need to augment the Document with svn specific attributes. Parsing a PDF file from svn should (be able to) use the same parser if it were uploaded via HTTP POST. I think a DocumentParser registry is a good way to isolate this top level task.
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: > (the trick being that the servlet would need to parse the "st" info out : > of the URL (either from the path or from the QueryString) directly without : > using any of the HttpServletRequest.getParameter*() methods... : : I haven't followed all of the discussion, but wouldn't it be easier to : use the request path, instead of parameters, to select these : RequestParsers? absolutely (hence my comment "either from the path or from the QueryString") ... my point is just that if we go this route, any servlets Solr has (there's no reason we can't have several -- changing the URL struture can be orthoginal to adding update plugins) have to be careful about dealing with the request to determine the plugin to use. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
the one thing that still seems missing is those "micro-plugins" i was [SNIP] interface SolrRequestParser { SolrRequest process( HttpServletRequest req ); } I left out "micro-plugins" because i don't quite have a good answer yet :) This may a place where a custom dispatcher servlet/filter defined in web.xml is the most appropriate solution. I like the SolrRequestParser suggestion. Consider: qt='RequestHandler' wt='ResponseWriter' rp='RequestParser ' (rb='SolrBuilder'?) To avoid possible POST read-ahead stream mungling: qt,wt, and rp should be defined by the URL, not parameters. (We can add special logic to allow /query?qt=xxx) For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people define arbitrary path mapping for qt. We could append 'wt', 'rb', and arbitrary arbitrary text to the registered path, something like /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params... (any other syntax ideas?) The 'standard' RequestParser would: GET: fill up SolrParams directly with req.getParameterMap() if there is a 'post' parameter (post=XXX) return a stream with XXX as its content else empty iterator. Perhaps add a standard way to reference a remote URI stream. POST: if( multipart ) { read all form fields into parameter map. return an iterator over the collection of files } else { no parameters? parse parameters from the URL? /name:value/ return the body stream } DEL: throw unsupported exception? Maybe each RequestHandler could have a default RequestParser. If we limited the 'arbitrary path' to one level, this could be used to generate more RESTful URLs. Consider: /myadder//// /myadder maps to MyCustomHandler and that gives you MyCustomRequestBuilder that maps /// to SolrParams : : Thoughts? one last thought: while the interfaces you outlined would make a lot of sense if we were starting from scratch, there are probably several cases where not having those exact names/APIs doesn't really hurt, and would allow backwards compatibility with more of the current code (and current SolrRequestHandler plugin people have written) ... just something we should keep in mind: we don't want to go hog wild renaming a lot of stuff and alienating our existing "plugin" user base. (nor do we want to make a bunch of unneccessary config file format changes) I totally understand and agree. Perhaps the best approach is to offer a SolrRequestProcessor framework that can sit next to the existing SolrRequestHandler without affecting it much (if at all). For what i have suggested, i *think* it could all be done with simple additions to solrschema.xml that would still work on an unedited 1.1.0 solrconfig.xml If we use a servletFilter for the dispatcher, this can sit next to the current /query?xxx servlet without problem. When the SolrRequestProcessor framework is rock solid, we would @Deprecate SolrRequestHandler and change the default solrconfig.xml to map /query to the new framework. The stuff I *DO* think should get refactored/deprecated ASAP is to extract the constants from the functionality in SolrParams. While we are at it, it may be good to restructure the code to something like: http://issues.apache.org/jira/browse/SOLR-20#action_12464648 ryan
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: interface SolrRequestParser { SolrRequest process( HttpServletRequest req ); } (the trick being that the servlet would need to parse the "st" info out of the URL (either from the path or from the QueryString) directly without using any of the HttpServletRequest.getParameter*() methods... I haven't followed all of the discussion, but wouldn't it be easier to use the request path, instead of parameters, to select these RequestParsers? i.e. solr/update/pdf-parser, solr/update/hssf-parser, solr/update/my-custom-parser, etc. -Bertrand
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: Iterator getContentStreams(); : : Consider the case where you iterate through a local file system. right, a fixed size in memory array can be iterated, but an unbounded stream of objects from an external source can't allways be read into an array effectively -- so when it doubt go with the Iterator (or my favorite: Iterable) : In addition to RequestProcessors, maybe there should be a general : DocumentProcessor : : interface SolrDocumentParser : { : Document parse(ContentStream content); : } : : solrconfig could register "text/html" -> HtmlDocumentParser, and : RequestProcessors could share the same parser. what else would the RequestProcessor do if it was delegating all of the parsing to something else? -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: : I hate to inundate you with more code, but it seems like the best way : to describe a possible interface. ... the one thing that still seems missing is those "micro-plugins" i was talking about that can act independent of the SolrRequestProcessor used to decide where the data streams come from. if you consider the current query request hanlding model, there's "qt" that picks a SolrRequestHandler (what you've called SolrRequestProcessor), and "wt" which independently determines the QueryResponseWriter (aka: SolrResponseWriter) .. i think we need an "st" (stream type) that the servlet uses to pick a "SolrRequestParser" to decide how to generate the SolrRequest and it's underlying ContentStreams interface SolrRequestParser { SolrRequest process( HttpServletRequest req ); } (the trick being that the servlet would need to parse the "st" info out of the URL (either from the path or from the QueryString) directly without using any of the HttpServletRequest.getParameter*() methods which might "read ahead" into the ServletInputStream) : interface SolrRequest : { : SolrParams getParams(); : ContentStream[] getContentStreams(); // Iterator? : long getStartTime(); : } I'm not understanding why that wouldn't make sense as an Iterable ... then it could be an array if the SolrRequestParser wanted, or it could be something more lazy-loaded. : : Thoughts? one last thought: while the interfaces you outlined would make a lot of sense if we were starting from scratch, there are probably several cases where not having those exact names/APIs doesn't really hurt, and would allow backwards compatibility with more of the current code (and current SolrRequestHandler plugin people have written) ... just something we should keep in mind: we don't want to go hog wild renaming a lot of stuff and alienating our existing "plugin" user base. (nor do we want to make a bunch of unneccessary config file format changes) -Hoss