subject:"Re\: Handling disparate data sources in Solr"

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

>
> I don't think i'll have time to look at your new patch today, design wise
> i think you are right, but there was still stuff that needed to be
> refactored out of core.update and into the UpdateHandler wasn't there?
>

Yes, I avoided doing that in an effort to minimize refactoring and
focus just on adding ContentStreams to RequestHandlers.

Sounds like a good idea.  It's easier to review and process in smaller
steps if practical.

I just posted (yet another) update to SOLR-104.  This one moves the
core.update logic into UpdateRequestHander, and adds some glue to make
old request behave as they used to.

Cool!

I also deprecated the exception in SolrQueryResponse.  Handlers should
throw the exception, not put it in the response.  (If you want error
messages, put that in the response, not the exception)

Agreed.  I can't for the life of me remember *why* I did that.
I think it was because I thought ResponseHandlers might format the exception.

>  3) there's a comment in RequestHandlerBase.init about "indexOf" that
> comes form the existing impl in DismaxRequestHandler -- but doesn't match
> the new code ... i also wasn't certain that the change you made matches
> the old semantics for dismax (i don't think we have a unit test for that
> case)

When you get a chance to look at the patch, can you investigate this.
I just copied the code from DismaxRequestHandler and made sure it
passes the tests.  I don't totally understand what that case is doing.

The first iteration of dismax (before we did generic defaults,
invariants, etc for request handlers) took defaults directly from the
init params, and that is what that case is checking for and
replicating if there isn't a "defaults" in the list, it assumes
the entire list is defaults.

It's only needed for dismax since other handlers didn't support
"defaults" until later.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



I don't think i'll have time to look at your new patch today, design wise
i think you are right, but there was still stuff that needed to be
refactored out of core.update and into the UpdateHandler wasn't there?



Yes, I avoided doing that in an effort to minimize refactoring and
focus just on adding ContentStreams to RequestHandlers.

I just posted (yet another) update to SOLR-104.  This one moves the
core.update logic into UpdateRequestHander, and adds some glue to make
old request behave as they used to.

I also deprecated the exception in SolrQueryResponse.  Handlers should
throw the exception, not put it in the response.  (If you want error
messages, put that in the response, not the exception)

It still needs some cleanup and some idea what data/messages should be
returned in the SolrResponse.

The bottom of http://localhost:8983/solr/test.html has a form calling
/update2 with posted XML so you can see the output



a couple of minor comments i had when i read the last patch (but didn't
mention since i was focusing on design issues) ...

 1) why rename the servlets "Legacy*" instead of just marking them deprecated?


In the new version, I got rid of both Servlets and am handling the
'legacy' cases explicitly in the dispatch filter.  This minimizes the
duplicated code and keeps things consisten.



 2) getSourceId and getSoure need to be left in the concrete Handlers so
they get illed in with the correct file version info on checkout.


done.


 3) there's a comment in RequestHandlerBase.init about "indexOf" that
comes form the existing impl in DismaxRequestHandler -- but doesn't match
the new code ... i also wasn't certain that the change you made matches
the old semantics for dismax (i don't think we have a unit test for that
case)


When you get a chance to look at the patch, can you investigate this.
I just copied the code from DismaxRequestHandler and made sure it
passes the tests.  I don't totally understand what that case is doing.



 4) ContentStream.getFieldName() would proabably be more general as
ContentStream.getSourceInfo() ...


done.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On 1/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

>
> So is everyone happy with the way that errors are currently reported?
> If not, now (or right after this is committed), is the time to change
> that.  /solr/select/qt="myhandler"  should be backward compatible, but
> /solr/myhandler doesn't need to be.  Same for the update stuff.
>

In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

  throw new SolrException( 400, "missing parameter: "+p );

This will return 400 with a message "missing parameter: " + p.

Exceptions or SolrExceptions with code=500 || code<100 are sent to
client with status code 500 and a full stack trace.

That all seems ideal to me, but there had been talk in the past about
formatted responses on errors.  Given that even update handlers can
return full responses, I don't see the point of formatted (XML,etc)
response bodies when an exception is thrown.
Just making sure there's a consensus.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



So is everyone happy with the way that errors are currently reported?
If not, now (or right after this is committed), is the time to change
that.  /solr/select/qt="myhandler"  should be backward compatible, but
/solr/myhandler doesn't need to be.  Same for the update stuff.



In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

 throw new SolrException( 400, "missing parameter: "+p );

This will return 400 with a message "missing parameter: " + p.

Exceptions or SolrExceptions with code=500 || code<100 are sent to
client with status code 500 and a full stack trace.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: > The bugaboo is if the POST data is NOT in fact
: > application/x-www-form-urlencoded but the user agent says it is -- as
: > both of you have indicated can be the case when using curl.  Could that
: > be why Yonik thought POST params was broken?
:
: Correct.  That's the format that post.sh in the example sends
: (application/x-www-form-urlencoded) and we ignore it in the update
: handler and always treat the body as binary.
:
: Now if you wanted to add some query args to what we already have, you
: can't use getParameterMap().

I think i mentioned this before, but I think what we should do is make the
stream "guessing" code in the Dispatcher/RequestBuilder very strict, and
make it's decisison about how to treat the post body entirely based on the
Content-Type ... meanwhile the existing (eventually know as "old") way of
doing updates via "/update" to the UpdateServlet can be more lax, and
assume everything is a raw POST of XML.

we can change post.sh to spcify XML as the Content-Type by default,
modify the example schema to have other update handlers registered with
names like "/update/csv" and eventually add an "/update/xml" encouraging
people to use it if they want to send updates as xml dcouments, regardless
of wehter htey want to POST them raw, uplodae them, or identify them by
filename -- as long as they are explicit about their content type.


I think I agree with all that.

A long time ago in this thread, I remember saying that new URLs are an
opportunity to change request/response formats w/o worrying about
backward compatibility.

So is everyone happy with the way that errors are currently reported?
If not, now (or right after this is committed), is the time to change
that.  /solr/select/qt="myhandler"  should be backward compatible, but
/solr/myhandler doesn't need to be.  Same for the update stuff.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


: > The bugaboo is if the POST data is NOT in fact
: > application/x-www-form-urlencoded but the user agent says it is -- as
: > both of you have indicated can be the case when using curl.  Could that
: > be why Yonik thought POST params was broken?
:
: Correct.  That's the format that post.sh in the example sends
: (application/x-www-form-urlencoded) and we ignore it in the update
: handler and always treat the body as binary.
:
: Now if you wanted to add some query args to what we already have, you
: can't use getParameterMap().

I think i mentioned this before, but I think what we should do is make the
stream "guessing" code in the Dispatcher/RequestBuilder very strict, and
make it's decisison about how to treat the post body entirely based on the
Content-Type ... meanwhile the existing (eventually know as "old") way of
doing updates via "/update" to the UpdateServlet can be more lax, and
assume everything is a raw POST of XML.

we can change post.sh to spcify XML as the Content-Type by default,
modify the example schema to have other update handlers registered with
names like "/update/csv" and eventually add an "/update/xml" encouraging
people to use it if they want to send updates as xml dcouments, regardless
of wehter htey want to POST them raw, uplodae them, or identify them by
filename -- as long as they are explicit about their content type.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


: Great!  I just posted an update to SOLR-104 that I hope will make you happy.

Dude ... i can *not* keep up with you.

: If i'm following our discussion correctly, I *think* this takes care
: of all the major issues we have.

I don't think i'll have time to look at your new patch today, design wise
i think you are right, but there was still stuff that needed to be
refactored out of core.update and into the UpdateHandler wasn't there?

a couple of minor comments i had when i read the last patch (but didn't
mention since i was focusing on design issues) ...

 1) why rename the servlets "Legacy*" instead of just marking them deprecated?
 2) getSourceId and getSoure need to be left in the concrete Handlers so
they get illed in with the correct file version info on checkout.
 3) there's a comment in RequestHandlerBase.init about "indexOf" that
comes form the existing impl in DismaxRequestHandler -- but doesn't match
the new code ... i also wasn't certain that the change you made matches
the old semantics for dismax (i don't think we have a unit test for that
case)
 4) ContentStream.getFieldName() would proabably be more general as
ContentStream.getSourceInfo() ... it could stay as it is for files/urls,
but raw posts and multipart posts could have a usefull debuging
description as well.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

At the bottom of this email is a quick and dirty servlet i just tried to
prove to myself that posting with params in the URL and the body worked
fine ...


I tried that by simply posting to the Solr standard request handler
(it echoes params in the example config), and yes, it worked fine. The
problem is if the body should be the stream, and the content-type is
wrong (and we currently send it wrong with curl).


The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
"inspection" based approach for deciding where the streams come from, and
leaving all mention of parsers or "stream.type" out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)


A.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/21/07, J.J. Larrea <[EMAIL PROTECTED]> wrote:

The bugaboo is if the POST data is NOT in fact 
application/x-www-form-urlencoded but the user agent says it is -- as both of 
you have indicated can be the case when using curl.  Could that be why Yonik 
thought POST params was broken?


Correct.  That's the format that post.sh in the example sends
(application/x-www-form-urlencoded) and we ignore it in the update
handler and always treat the body as binary.

Now if you wanted to add some query args to what we already have, you
can't use getParameterMap().

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
"inspection" based approach for deciding where the streams come from, and
leaving all mention of parsers or "stream.type" out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)



Great!  I just posted an update to SOLR-104 that I hope will make you happy.

It moved the various request parsing methods into distinct classes
that could easily be pluggable if that is necessary.  As written, It
supports stream.type="raw|multipart|simple|standard"  We can comment
that out and use 'standard' for everything as a first pass.

I added configuation to solrconfig.xml:
 

I removed LegacySelectServlet and added an explicit check in the
DispatchFilter for paths starting with "/select"  This seems like a
better idea as the logic and expected results are identical.

If i'm following our discussion correctly, I *think* this takes care
of all the major issues we have.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread J.J. Larrea

At 1:20 AM -0800 1/21/07, Chris Hostetter wrote:
>: We need code to do that anyway since getParameterMap() doesn't support
>: getting params from the URL if it's a POST (I believe I tried this in
>: the past and it didn't work).
>
>Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
>are *definitely* mistaken.
>
>getParameterMap will in fact pull out params from both the URL and the
>body if it's a POST -- but only if you have not allready accessed either
>getReader or getInputStream -- this was at the heart of my cumbersome
>preProcess/process API that we all agree now was way too complicated.

The rules are very explicitly laid out in the Servlet 2.4 specification:

-
SRV.4.1.1 When Parameters Are Available
The following are the conditions that must be met before post form data will
be populated to the parameter set:
1. The request is an HTTP or HTTPS request.
2. The HTTP method is POST.
3. The content type is application/x-www-form-urlencoded.
4. The servlet has made an initial call of any of the getParameter family of 
methods on the request object.
If the conditions are not met and the post form data is not included in the
parameter set, the post data must still be available to the servlet via the 
request object's input stream. If the conditions are met, post form data will 
no longer be available for reading directly from the request object's input 
stream.
-

As Hoss notes a POST request can still have GET-style parameters in the URL 
query string, and getParameterMap will return both sets intermixed for a POST 
meeting the above conditions.  And calling getParameterMap won't impede the 
ability to subsequently read the input stream if the conditions are not met: 
"the post data must still be available to the servlet".  So it's theoretically 
valid to simply call getParameterMap and then blindly call getInputStream 
(possibly catching an Exception), or else use the results of getParameterMap to 
decide whether and how to process the input stream.

The bugaboo is if the POST data is NOT in fact 
application/x-www-form-urlencoded but the user agent says it is -- as both of 
you have indicated can be the case when using curl.  Could that be why Yonik 
thought POST params was broken?

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


: > ...i was trying to avoid keeping the parser name out of the query string,
: > so we don't have to do any hack parsing of
: > HttpServletRequest.getQueryString() to get it.
:
: We need code to do that anyway since getParameterMap() doesn't support
: getting params from the URL if it's a POST (I believe I tried this in
: the past and it didn't work).

Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
are *definitely* mistaken.

getParameterMap will in fact pull out params from both the URL and the
body if it's a POST -- but only if you have not allready accessed either
getReader or getInputStream -- this was at the heart of my cumbersome
preProcess/process API that we all agree now was way too complicated.

At the bottom of this email is a quick and dirty servlet i just tried to
prove to myself that posting with params in the URL and the body worked
fine ... i do rememebr reading up on this a few years back and verifying
that it's documented somewhere in the servlet spec, a quick google search
points this this article implying it was solidified in 2.2...

   http://java.sun.com/developer/technicalArticles/Servlets/servletapi/
   (grep for "Nit-picky on Parameters")


: Pluggable request parsers seems needlessly complex, and it gets harder
: to explain it all to someone new.
: Can't we start simple and defer anything like that until there is a real need?

Alas ... i appear to be getting worse at explaining myself in my old age.

What i was trying to say is that this idea i had for expressing
requestParsers as an optional prefix in fron of the requestHandler would
allow us to worry about the things i'm worried about *later* -- if/when
they become a problem (or when i have time to stop whinning, and actually
write the code)

The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
"inspection" based approach for deciding where the streams come from, and
leaving all mention of parsers or "stream.type" out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)



public class TestServlet extends HttpServlet {
  public void doPost(HttpServletRequest request, HttpServletResponse response)
throws Exception {

response.setContentType("text/plain");
java.util.Map params = request.getParameterMap();
for (Object k : params.keySet()) {
  Object v = params.get(k);
  if (v instanceof Object[]) {
for (Object vv : (Object[])v) {
  response.getWriter().println(k.toString() + ":" + vv);
}
  } else {
response.getWriter().println(k.toString() + ":" + v);
  }
}
  }
}

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: I'm on board as long as the URL structure is:
:   ${path/from/solr/config}?stream.type=raw

actually the URL i was suggesting was...

${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val

...i was trying to avoid keeping the parser name out of the query string,
so we don't have to do any hack parsing of
HttpServletRequest.getQueryString() to get it.


We need code to do that anyway since getParameterMap() doesn't support
getting params from the URL if it's a POST (I believe I tried this in
the past and it didn't work).

Aesthetically, having an optional parser in the queryString seems
nicer than in the path.


basically if you have this...

  
  
  


Pluggable request parsers seems needlessly complex, and it gets harder
to explain it all to someone new.
Can't we start simple and defer anything like that until there is a real need?


if they really had a reason to want to force one type of parsing, they
could register it with a differnet prefix.


That is a point.  I'm not sure of the usecases though... it's not safe
to let untrusted people update solr at all, so I don't understand
prohibiting certain types of streams.


  * default URLs stay clean
  * no need for an extra "stream.type" param
  * urls only get ugly if people want them to get ugly because they don't
want to make their clients set the mime type correctly.


The first and last points are also true for a stream.type type of thing.
After all, we will need other parameters for specifying local files,
right?  Or is opening local files up to the RequestHandler again?

Anyway, I'm not too unhappy either way, as long as I can leave out any
explicit "parser" and just get the right thing to happen.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On Sat, 20 Jan 2007, Ryan McKinley wrote:

: Date: Sat, 20 Jan 2007 19:17:16 -0800
: From: Ryan McKinley <[EMAIL PROTECTED]>
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Re: Update Plugins (was Re: Handling disparate data sources in
:     Solr)
:
: >
: > ...what if we bring that idea back, and let people configure it in the
: > solrconfig.xml, using path like names...
: >
: >   
: >   
: >   
: >   
: >
: > ...but don't make it a *public* interface ... make it package protected,
: > or maybe even a private static interface of the Dispatch Filter .. either
: > way, don't instantiate instances of it using the plugin-lib ClassLoader,
: > make sure it comes from the WAR to only uses the ones provided out of hte
: > box.


: I'm on board as long as the URL structure is:
:   ${path/from/solr/config}?stream.type=raw

actually the URL i was suggesting was...

${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val

...i was trying to avoid keeping the parser name out of the query string,
so we don't have to do any hack parsing of
HttpServletRequest.getQueryString() to get it.

basically if you have this...

  
  
  

  
  
  

...then these urls are all valid...

   http://localhost:/solr/raw/update?param=val
  ..uses raw post body for update
   http://localhost:/solr/multi/update?param=val
  ..uses multipart mime for update
   http://localhost:/solr/update?param=val
  ..no requestParser matched path prefix, so default is choosen and
COntent-Type is used to decide where streams come from.

but if instead my config looks like this...

  
  

  
  
  

...then these URLs would fail...

   http://localhost:/solr/raw/update?param=val
   http://localhost:/solr/multi/update?param=val

...because the empty string would match as a parser, but "/raw/update"
and "/multi/update" wouldn't match as requestHandlers (the registration of
"/raw" as a parser would be useless)

this URL would work however...

   http://localhost:/solr/update?param=val
  ..treat all requetss as if they have multi-part mime streams

...i use this only as an example of what i'm describing ... not sa an
example of soemthing we shoudl recommend.

The key to all of this being that we'd check parser names against the URL
prefix in order from shortest to longest, then check the rest of the path
as a requestHandler ... if either of those fail, then the filter would
skip the request.

What we would probably recommended is that people map the "guess" request
parser to "/" so that they could put in all of hte options they want on
buffer sizes and such, then map their requestHandlers without a "/"
prefix, and use content types correctly.

if they really had a reason to want to force one type of parsing, they
could register it with a differnet prefix.

  * default URLs stay clean
  * no need for an extra "stream.type" param
  * urls only get ugly if people want them to get ugly because they don't
want to make their clients set the mime type correctly.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



...what if we bring that idea back, and let people configure it in the
solrconfig.xml, using path like names...

  
  
  
  

...but don't make it a *public* interface ... make it package protected,
or maybe even a private static interface of the Dispatch Filter .. either
way, don't instantiate instances of it using the plugin-lib ClassLoader,
make sure it comes from the WAR to only uses the ones provided out of hte
box.



I'm on board as long as the URL structure is:
 ${path/from/solr/config}?stream.type=raw

and if you are missing the parameter it chooses a good option.

(stream.type can change, just that the parser is configured in the
query string, not he path)

I like it!


Also, this would give us a natural place to configure the max size etc
for multi-part upload

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


(the three of us are online way to much ... for crying out loud it's a
saturday night folks!)

: In my opinion, I don't think we need to worry about it for the
: *default* handler.  That is not a very difficult constraint and, there
: is no one out there expecting to be able to post parameters in the URL
: and the body.  I'm not sure it is worth complicating anything if this
: is the only thing we are trying to avoid.

you'd be suprised the number of people i've run into who expect thta to
work.

: I think the *default* should handle all the cases mentioned without
: the client worrying about different URLs  for the various methods.
:
: The next question is which (if any) of the explicit parsers you think
: are worth including in web.xml?

holy crap, i think i have a solution that will make all of us really
happy...

remember that idea we all really detested of a public plugin interface,
configured in the solrconfig.xml that looked like this...

 public interface RequestParser(
SolrRequest parse(HttpServletRequest req);
 }

...what if we bring that idea back, and let people configure it in the
solrconfig.xml, using path like names...

  
  
  
  

...but don't make it a *public* interface ... make it package protected,
or maybe even a private static interface of the Dispatch Filter .. either
way, don't instantiate instances of it using the plugin-lib ClassLoader,
make sure it comes from the WAR to only uses the ones provided out of hte
box.

then make the dispatcher check each URL first by seeeing if it starts with
the name of any registered requestParser ... if it doesn't then use the
default "UseContentTypeRequestParser" .. *then* it does what the rest of
ryans current Dispatcher does, taking the rest of hte path to pick a
request handler.

the bueaty of this approach, is that if no  tags appear in
the solrconfig.xml, then the URLs look exactly like you guys want, and the
request parsing / stream building semantics are exactly the same as they
are today ... if/when we (or maybe just "i") write those other
RequestParsers people can choose to turn them on (and change their URLs)
if they want, but if they don't they can keep having the really simple
URLs ... OR they could register something like this...

  

...and have really simple URLs, but be garunteed that they allways got
their streams from raw POST bodies.

This would also solve Ryans concern about allowing people to turn off
fetching streams from remote URLs (or from local files, a small concern i
had but hadn't mentioend yet since we had bigger fish to fry)



Thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On 1/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> It would be:
> http://${context}/${path}?stream.type=post

Yes!
Feels like a much more natural place to me than as part of the path of the URL.
Just need to hash out meaningful param names/values?

Oh, and I'm more interested in the semantics of those param/values,
and not what request parser it happens to get mapped to.  I'd vote for
different request parsers being an implementation detail, and keeping
those details (plugability) out of solrconfig.xml for now.

We could always add it later, but it's a lot tougher to remove things.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> >- put everyone
> > understands how to put something in a URL.  if nothing else, think of
> > putting the "parsetype" in the URL as a checksum that the RequestParaser
> > can use to validate it's assumptions -- if it's not there, then it can do
> > all of the intellegent things you think it should do, but if it is there
> > that dictates what it should do.
>
> If it's optional in the args, I could be on board with that.
>

If its optional in the req.getQueryString() I'm in.

Ignore my previous post about
${context}/multipart/asdgadsga

It would be:
http://${context}/${path}?stream.type=post


Yes!
Feels like a much more natural place to me than as part of the path of the URL.
Just need to hash out meaningful param names/values?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



>- put everyone
> understands how to put something in a URL.  if nothing else, think of
> putting the "parsetype" in the URL as a checksum that the RequestParaser
> can use to validate it's assumptions -- if it's not there, then it can do
> all of the intellegent things you think it should do, but if it is there
> that dictates what it should do.

If it's optional in the args, I could be on board with that.



If its optional in the req.getQueryString() I'm in.

Ignore my previous post about
${context}/multipart/asdgadsga

It would be:
http://${context}/${path}?stream.type=post

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



> consider the example you've got on your test.html page: "POST - with query
> string" ... that doesn't obey the typical semantics of a POST with a query
> string ... if you used the methods on HttpServletRequest to get the params
> it would give you all the params it found both in the query strings *and*
> in the post body.

Blech.  I was wondering about that.  Sounds like bad form, but perhaps could be
supported via something like
/solr/foo?postbody=args



In my opinion, I don't think we need to worry about it for the
*default* handler.  That is not a very difficult constraint and, there
is no one out there expecting to be able to post parameters in the URL
and the body.  I'm not sure it is worth complicating anything if this
is the only thing we are trying to avoid.

I think the *default* should handle all the cases mentioned without
the client worrying about different URLs  for the various methods.

The next question is which (if any) of the explicit parsers you think
are worth including in web.xml?

http://${host}/${context}/${path/from/config}  (default)
http://${host}/${context}/params/${path/from/config} (used
getParameterMap() to fill args)
http://${host}/${context}/multipart/${path/from/config} (force
multipart request)
http://${host}/${context}/stream/${path/from/config} (params from URL,
body as stream)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

but the HTTP Client libraries in vaious languages don't allways make it
easy to set Content-type -- and even if they do that doesn't mean the
person using that library knows how to use it properly -


I think we have to go with common usages.  We neither rely on, nor
discard content-type in all cases.
- When it has a charset, believe it.
- When it says form-encoded, only believe it if there aren't args on
the URL (because many clients like curl default to
"application/x-www-form-urlencoded" for a post.


- put everyone
understands how to put something in a URL.  if nothing else, think of
putting the "parsetype" in the URL as a checksum that the RequestParaser
can use to validate it's assumptions -- if it's not there, then it can do
all of the intellegent things you think it should do, but if it is there
that dictates what it should do.


If it's optional in the args, I could be on board with that.


(aren't you the one that convinced me a few years back that it was better
to trust a URL then to trust HTTP Headers? ... because people understand
URLs and put things in them, but they don't allways know what headers to
send .. curl being the great example, it allways sends a Content-TYpe even
if the user doesn't ask it to right?)


Well, for the update server, we do ignore the form-data stuff, but we
don't ignore the charset.


: Multi-part posts will have the content-type set correctly, or it won't work.
: The big use-case I see is browser file upload, and they will set it correctly.

right, but my point is what if i want the multi-part POST body left alone
so my RequestHandler can deal with it as a single stream -- if i set
every header correctly, the "smart" parsing code will parse it -- which is
why sometihng in the URL telling it *not* to parse it is important.


That sounds like a pretty rare corner case.


: We should not preclude wacky handlers from doing things for
: themselves, calling our stuff as utility methods.

how? ... if there is one and only one RequestParser which makes the
SolrRequest before the RequestHandler ever sees it, and parses the post
body because the content-type is multipart/mixed how can a  wacky
handler ever get access to the raw post body?


I wasn't thinking *that* whacky :-)
There are always other options, such as using your own servlet though.
I don't think we should try to solve every case (the whole 80/20
thing).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

Ryan: this patch truely does kick ass ... we can probably simplify a lot
of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
that can be done later.


Much is already done by the looks of it.


i'm stil really not liking the way there is a single SolrRequestBuilder
with a big complicated build method that "guesses" what streams the user
wants.


But I don't need a separate URL to do GET vs POST in HTTP.
It seems like having a different URL for where you put the args would
be hard to explain to people.


  i really feel strongly that even if all the parsing logic is in
the core, even if it's all in one class: a piece of the path should be
used to determine where the streams come from.


If there's a ? in the URL, then it's args, so that could always
safetly  be parsed.  Perhaps a special arg, if present, could override
the default method of getting input streams?


consider the example you've got on your test.html page: "POST - with query
string" ... that doesn't obey the typical semantics of a POST with a query
string ... if you used the methods on HttpServletRequest to get the params
it would give you all the params it found both in the query strings *and*
in the post body.


Blech.  I was wondering about that.  Sounds like bad form, but perhaps could be
supported via something like
/solr/foo?postbody=args

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: To be clear, (with the current implementation in SOLR-104) you would
: have to put this in your solrconfig.xml
:
: 
:
: Notice the preceding '/'.  I think this is a strong indication that
: someone *wants* /select to behave distinctly.

crap ... i totally misread that ... so if people have a requestHandler
registered with a name that doesn't start with a slash, they can't use the
new URL structure and they have to use the old one.

DAMN! ... that is slick dude ... okay, i agree with you, the odds of that
causing problems are pretty fucking low.

I'm still hung up on this "parse" logic thing ... i really think it needs
to be in the path .. or at the very least, there needs to be a way to
specify it in the path to force one behavior or another, and if it's not
in the path then we can guess based on the Content-Type.

Putting it in a query arg would make getting it without contaminating the
POST body kludgy, putting it at the start of the path doesn't work well
for supporting a default if it isn't there, and putting it at the end of
the PATH messes up the nice work you've done letting RequestHandlers have
extra path info for encoding info they want.

H...

What if we did soemthing like this...

   /exec/handler/name:extra/path?param1=val1
   /raw/handler/name:extra/path?param1=val1
   /url/handler/name:extra/path?param1=val1&url=...&url=...
   /file/handler/name:extra/path?param1=val1&file=...&file=...

where "exec" means guess based on the Content-TYpe, "raw" means use the
POST body as a single stream regardless of Content-Type, etc...

thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: I just posted a new patch on SOLR-104.  I think it addresses most of
: the issues we have discussed.  (Its a little difficult to know as it
: has been somewhat circular)   I was going to reply to your points one
: by one, but i think that would just make the discussion more confusing
: then it already is!

Ryan: this patch truely does kick ass ... we can probably simplify a lot
of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
that can be done later.

i'm stil really not liking the way there is a single SolrRequestBuilder
with a big complicated build method that "guesses" what streams the user
wants.   i really feel strongly that even if all the parsing logic is in
the core, even if it's all in one class: a piece of the path should be
used to determine where the streams come from.

consider the example you've got on your test.html page: "POST - with query
string" ... that doesn't obey the typical semantics of a POST with a query
string ... if you used the methods on HttpServletRequest to get the params
it would give you all the params it found both in the query strings *and*
in the post body.

This is a great example of what i was talking about: if i have no
intention of sending a stream, it should be possible for me to send params
in both the URL and in the POST body -- but in other cases i should be
able to POST some raw XML and still have params in the URL.

arguable: we could look at the Content-Type of the request and make the
assumption based on that -- but as i mentioned before, people don't
allways set the Content-TYpe perfectly.  if we used a URL fragment to
determine where the streams should come from we could be a lot more
confident that we know where the stream should come from -- and let the
RequestHandler decide if it wants to trust the ContentType

the multipart/mixed example i gave previously is another example -- your
code here assumes that should be given to the RequsetHandler as multiple
streams -- which is a great assumption to make for fileuploads, but which
gives me no way to POST multipart/mixed mime data that i want given to the
RequestHandler as a single ContentStream (so it can have access to all of
hte mime headers for each part)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



easy thing to deal with just by scoping the URLs .. put something,
ANYTHING, in front of these urls, that isn't "select" or "update" and


I'll let you and Yonik decide this one.  I'm fine either way, but I
really don't see a problem letting people easily override URLs.  I
actually think it is a good thing.




consider the case where a user today has this in his solrconfig...

  



To be clear, (with the current implementation in SOLR-104) you would
have to put this in your solrconfig.xml



Notice the preceding '/'.  I think this is a strong indication that
someone *wants* /select to behave distinctly.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > that scares me ... not only does it rely on the client code sending the
: > correct content-type
:
: Not really... that would perhaps be the default, but the parser (or a
: handler) can make intelligent decisions about that.
:
: If you put the parser in the URL, then there's *that* to be messed up
: by the client.

but the HTTP Client libraries in vaious languages don't allways make it
easy to set Content-type -- and even if they do that doesn't mean the
person using that library knows how to use it properly -- put everyone
understands how to put something in a URL.  if nothing else, think of
putting the "parsetype" in the URL as a checksum that the RequestParaser
can use to validate it's assumptions -- if it's not there, then it can do
all of the intellegent things you think it should do, but if it is there
that dictates what it should do.

(aren't you the one that convinced me a few years back that it was better
to trust a URL then to trust HTTP Headers? ... because people understand
URLs and put things in them, but they don't allways know what headers to
send .. curl being the great example, it allways sends a Content-TYpe even
if the user doesn't ask it to right?)

: Multi-part posts will have the content-type set correctly, or it won't work.
: The big use-case I see is browser file upload, and they will set it correctly.

right, but my point is what if i want the multi-part POST body left alone
so my RequestHandler can deal with it as a single stream -- if i set
every header correctly, the "smart" parsing code will parse it -- which is
why sometihng in the URL telling it *not* to parse it is important.

: We should not preclude wacky handlers from doing things for
: themselves, calling our stuff as utility methods.

how? ... if there is one and only one RequestParser which makes the
SolrRequest before the RequestHandler ever sees it, and parses the post
body because the content-type is multipart/mixed how can a  wacky
handler ever get access to the raw post body?



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > A user should be confident that they can pick anyname they possily want
: > for their plugin, and it won't collide with any future addition we might
: > add to Solr.
:
: But that doesn't seem possible unless we make user plugins
: second-class citizens by scoping them differently.  In the event there
: is a collision in the future, the user could rename one of the
: plugins.

when it comes to URLs, our plugins currently are second class citizens --
plugin names appear in the "qt" or "wt" params -- users can pick any names
they want and they are totally legal, they don't have to worry about any
possibility that a name they pick will collide with a path we have mapped
to a servlet.

Users shouldn't have the change the names of requestHandlers juse because
SOlr adds a new feature with the same name -- changing a requestHandler
name could be a heavy burden for a Solr user to make depending on how many
clients *they* have using that requestHandler with that name.  i wouldn't
make a big deal out of this if it was unavoidable -- but it is such an
easy thing to deal with just by scoping the URLs .. put something,
ANYTHING, in front of these urls, that isn't "select" or "update" and
then put the requestHandler name and we've now protected ourself and our
users.

consider the case where a user today has this in his solrconfig...

  

..with the URL structure you guys are talking about, with the
DispatchFilter matching on /* and interpreting the first part of hte path
as a posisble requestHandler name, that user can't upgrade Solr
because he's relying on the old "/select?qt=select" style URLs to
work ... he has to change the name of his requestHandler and all of his
clients, then upgrade, then change all of his clients againt to take
advantage of the new URL structure (and the new features it provides for
updates)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


I just posted a new patch on SOLR-104.  I think it addresses most of
the issues we have discussed.  (Its a little difficult to know as it
has been somewhat circular)   I was going to reply to your points one
by one, but i think that would just make the discussion more confusing
then it already is!



> (i don't trust HTTP Client code -- but for the sake
> of argument let's assume all clients are perfect) what happens when a
> person wants to send a mim multi-part message *AS* the raw post body -- so
> the RequestHandler gets it as a single ContentStream (ie: single input
> stream, mime type of multipart/mixed) ?

Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.



I don't see it as a big problem because we don't have to deal with
legacy streams yet.  No one is expecting their existing stream code to
work.  The only header values the SOLR-104 code relies on is
'multipart'  I think that is a reasonable constraint since it has to
be implemented properly for commons-file-upload to work.

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


>
> I'm not sure what "it" is in the above sentence ... i believe from the
> context of the rest of hte message you are you refering to
> using a ServletFilter instead of a Servlet -- i honestly have no opinion
> about that either way.

I thought a filter required you to open up the WAR file and change
web.xml, or am I misunderstanding?



If your question is do you need to edit web.xml to change the URL it
will apply to, my suggestion is to may /* to the DispatchFilter and
have it decide weather or not to handle the requests.  With a filter,
you can handle the request directly or pass it up the chain.  This
would allow us to have the URL structures defined by solrconfig.xml
(without a need to edit web.xml)

If your question is about configuring the RequestParser,  Yes, you
would need to edit web.xml

My (our?) reasons for suggesting this are
1) I think we only have one RequestParser that will handle all normal
requests.  Unless you have extreemly specialized needs, this is not
something you would change.
2) Since the RequestParser is tied so closely to HttpServletRequest
and your desired URL structure, it seems appropriate to configure it
in web.xml.  A RequestParser is just a utility class for
servlets/filters
3) We don't want to add RequestParser to 'core' unless it really needs
to be a pluggable interface.  I don't see the need for it just yet.

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Alan Burlison


Chris Hostetter wrote:


: 1) I think it should be a ServletFilter applied to all requests that
: will only process requests with a registered handler.

I'm not sure what "it" is in the above sentence ... i believe from the
context of the rest of hte message you are you refering to
using a ServletFilter instead of a Servlet -- i honestly have no opinion
about that either way.


I thought a filter required you to open up the WAR file and change 
web.xml, or am I misunderstanding?


--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: I have imagined the single default parser handles *all* the cases you
: just mentioned.

A ... a lot of confusing things make more sense now. .. but
some things are more confusing: If there is only one parser, and it
decides what to do based entirely on param names and HTTP headers, then
what's the point of having the parser name be part of the path in your
URL design?


I didn't think it would be part of the URL anymore.


: POST: depending on headers/content type etc you parse the body as a
: single stream, multi-part files or read the params.
:
: It will take some careful design, but I think all the standard cases
: can be handled by a single parser.

that scares me ... not only does it rely on the client code sending the
correct content-type


Not really... that would perhaps be the default, but the parser (or a
handler) can make intelligent decisions about that.

If you put the parser in the URL, then there's *that* to be messed up
by the client.


(i don't trust HTTP Client code -- but for the sake
of argument let's assume all clients are perfect) what happens when a
person wants to send a mim multi-part message *AS* the raw post body -- so
the RequestHandler gets it as a single ContentStream (ie: single input
stream, mime type of multipart/mixed) ?


Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.


This may sound like a completely ridiculous idea, but consider the
situation where someone is indexing email ... they've written a
RequestHandler that knows how to parser multipart mime emails and
convert them to documents, they want to POST them directly to Solr and let
their RequestHandler deal with them as a single entity.


We should not preclude wacky handlers from doing things for
themselves, calling our stuff as utility methods.


..i think life would be a lot simpler if we kept the RequestParser name as
part of hte URL, completely determined by the client (since the client
knows what it's trying to send) ... even if there are only 2 or 3 types of
RequestParsing being done.


Having to do different types of posts to different URLs doesn't seem
optimal, esp if we can do it in one.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/20/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

the thing about Solr, is there really aren't a lot of "defaults" in the
sense you mean ... there is just an example -- people might copy the
example, but if they don't have something in their solrconfig, most things
just aren't there


I expect that most users will fall into that category though.  A
minority use custom request handlers and I expect a vast minority to
use custom update handlers.


A user should be confident that they can pick anyname they possily want
for their plugin, and it won't collide with any future addition we might
add to Solr.


But that doesn't seem possible unless we make user plugins
second-class citizens by scoping them differently.  In the event there
is a collision in the future, the user could rename one of the
plugins.

The same type of collision can happen today with our current request
handler framework, but I don't think it's worth uglifying URLs over.
It will be very rare and there are ways to easily work around it.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



i would relaly feel a lot happier with something like these that you
mentioned...



If it will make you happier, then I think its a good idea!  (even if i
don't see it as a Problem)


:   /solr/dispatch/update/xml
:   /solr/cmd/update/xml
:   /solr/handle/update/xml
:   /solr/do/update/xml

http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params}



(assuming the number of parsers is <3 and solr.war would only have 1) How about:

http://${host}:${port}/${context}/${parser}/${handler/with/optional/slashes}?${params}

Thoughts for the default parser name.  'do' gives me the struts he-be-je-bes :)



we can still handle...

http://${host}:${port}/${context}/select/?qt=${handler}&${params}

..with a really simple ServletFilter (that has no risk of collision, with
the new URL structure one, so it can go anywhere in the FilterChain)



yes.  likewise with /update

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: I have imagined the single default parser handles *all* the cases you
: just mentioned.

A ... a lot of confusing things make more sense now. .. but
some things are more confusing: If there is only one parser, and it
decides what to do based entirely on param names and HTTP headers, then
what's the point of having the parser name be part of the path in your
URL design?

: POST: depending on headers/content type etc you parse the body as a
: single stream, multi-part files or read the params.
:
: It will take some careful design, but I think all the standard cases
: can be handled by a single parser.

that scares me ... not only does it rely on the client code sending the
correct content-type (i don't trust HTTP Client code -- but for the sake
of argument let's assume all clients are perfect) what happens when a
person wants to send a mim multi-part message *AS* the raw post body -- so
the RequestHandler gets it as a single ContentStream (ie: single input
stream, mime type of multipart/mixed) ?

This may sound like a completely ridiculous idea, but consider the
situation where someone is indexing email ... they've written a
RequestHandler that knows how to parser multipart mime emails and
convert them to documents, they want to POST them directly to Solr and let
their RequestHandler deal with them as a single entity.


..i think life would be a lot simpler if we kept the RequestParser name as
part of hte URL, completely determined by the client (since the client
knows what it's trying to send) ... even if there are only 2 or 3 types of
RequestParsing being done.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


:
: This would drop the ':' from my proposed URL and change the scheme to look 
like:
: /parser/path/the/parser/knows/how/to/extract/?params

i was totally okay with the ":" syntax (although we should double check if
":" is actaully a legal unescaped URL character) .. but i'm confused by
this new suggestions ... is "parser" the name of the parser in that
example and "path/the/parser/knows/how/to/extract" data that the parser
may use to build to SolrRequest with? (ie: perhaps the RequestHandler)

would parser names be required to not have slashes in them in that case?



(working with the assumption that most cases can be defined by a
single request parser)

I am/was suggesting that a dispatch servlet/fliter has a single
request parser.  The default request parser will choose the handler
based on names defined in solrconfig.xml.  If someone needs a custom
RequestParser, it would be linked to a new servlet/filter (possibly)
mapped to a distinct prefix.

If it is not possible to handle most standard stream cases with a
single request parser, i will go back to the /path:parser format.

I suggest it is configured in web.xml because that is a configurable
place that is not solrconfg.xml.  I don't think it is or should be a
highly configurable component.



:
: Thank goodness you didn't!  I'm confident you won't let me (or anyone)
: talk you into something like that!  You guys made a lot of good

the point i was trying to make is that if we make a RequestParser
interface with a "parseRequest(HttpServletRequest req)" method, it amouts
to just as much badness -- the key is we can make that interface as long
as all the implimentations are in the SOlr code base where we can keep an
eye on them, and people have to go way, WAY, *WAY* into solr to start
shanging them.




Yes, implementing a RequestParser is more like writing a custom
Servlet then adding a Tokenizer.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



: > then all is fine and dandy ... but what happens if someone tries to
: > configure a plugin with the name "admin" ... now all of the existing admin

: that is exactly what you would expect to happen if you map a handler
: to /admin.  The person configuring solrconfig.xml is saying "Hey, use
: this instead of the default /admin.  I want mine to make sure you are
: logged in using my custom authentication method."  In addition, It may
: be reasonable (sometime in the future) to implement /admin as a
: RequestHandler.  This could be a clean way to address SOLR-58  (xml
: with stylesheets, or JSON, etc...)

yeah i guess that wouldn't be too horrible ... i think what i was trying
to point out was that if we'd roll out these super simple urls containg
just the plugin name and someone did register a plugin overriding the
admin pages, we'd screw them over later when we did get arround to
replacing the admin pages with a plugin if added it as a special override
ServletFilter mapping

: > also: what happens a year from now when we add some completely new
: > Servlet/ServletFilter to Solr, and want to give it a unique URL...
: >
: >   http://host:/solr/bar/

: obviously, I think the default solr settings should be prudent about
: selecting URLs.  The standard configuration should probably map most
: things to /select/xxx or /update/xxx.

the thing about Solr, is there really aren't a lot of "defaults" in the
sense you mean ... there is just an example -- people might copy the
example, but if they don't have something in their solrconfig, most things
just aren't there

: > ...we could put it earlier in the processing chain before the existing
: > ServletFilter, but then we break any users that have registered a plugin
: > with the name "bar".
:
: Even if we move this to have a prefix path, we run into the exact same
: issue when sometime down the line solr has a default handler mapped to
: 'bar'

the point i was trying to make is that the "namespaces" that Solr uses
should be unique -- the piece of the URL path that is used to pick the
Servlet or filter for dispatching the request, should be uniquely
distinguishable from the piece of the URL that is used to lookup a plugin.
A user should be confident that they can pick anyname they possily want
for their plugin, and it won't collide with any future addition we might
add to Solr.

if the new and improved solr URLs (minus host:port/context) are
just /${plugin}/... with a dispatcher that matches on any URL and checks
that path for a plugin matching that name then we have no way of ever
adding any other URL for a new in the future without running the risk that
whatever bsae path we pick for that new features URLs, we might screw over
a user who just so happened to pick that features name when registering a
plugin -- either becuase we put the new feature earlier in the FilterChain
and it circumvents requests the user expects to to that plugin, or because
we put that feature later in the FilterChain and that user doesn't ge to
take advantage of it unless he changes the name he registered the plugin
with (and changes all of his clients)

i would relaly feel a lot happier with something like these that you
mentioned...

:   /solr/dispatch/update/xml
:   /solr/cmd/update/xml
:   /solr/handle/update/xml
:   /solr/do/update/xml

http://${host}:${port}/${context}/do/${parser}/${handler/with/optional/slashes}?${params}

sounds great to me... just as long as we have some constant prefix in
there so that later on we can use something else.

we can still handle...

http://${host}:${port}/${context}/select/?qt=${handler}&${params}

..with a really simple ServletFilter (that has no risk of collision, with
the new URL structure one, so it can go anywhere in the FilterChain)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On 1/20/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

>
> what!? .. really? ... you don't think the ones i mentioned before are
> things we should support out of the box?
>
>   - no stream parser (needed for simple GETs)
>   - single stream from raw post body (needed for current updates
>   - multiple streams from multipart mime in post body (needed for SOLR-85)
>   - multiple streams from files specified in params (needed for SOLR-66)
>   - multiple streams from remote URL specified in params
>

I have imagined the single default parser handles *all* the cases you
just mentioned.

Yes, this is what I had envisioned.
And if we come up with another cool standard one, we can add it and
all the current/older handlers get that additional behavior for free.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > This would give people a relativly easy way to implement 'restful'
: > URLs if they need to.  (but they would have to edit web.xml)
:
: A handler could alternately get the rest of the path (absent params), right?

only if the RequestParser adds it to the SolrRequest as a SolrParam.

: > Unit tests should be handled by execute( handler, req, res )
:
: How does the unit test get the handler?

i think ryans point is that when testing a handler, you should know which
handler you are testing, so construct it and execute it directly.

: > I am proposing we have a single interface to do this:
: >   SolrRequest r = RequestParser.parse( HttpServletRequest  )
:
: That's currently what new SolrServletRequest(HttpServletRequest) does.
: We just need to figure out how to get InputStreams, Readers, etc.

we start by adding "Iterable getStreams()" to the
SolrRequest interface, with a setter on all of the Impls that's not part
of the interface.  then i suspect what we'll see is two classes that look
like this..

  public class NoStreamRequestParser implements RequestParser {
public SolrRequest parse(HttpServletRequest req) {
 return new SolrServletRequest(HttpServletRequest);
}
  }
  public class RawPostStreamRequestParser extends NoStreamRequestParser {
public SolrRequest parse(HttpServletRequest req) {
 ContentStream c = makeContentStream(req.getInputStream())
 SolrServletRequest s = super.parse(req);
 s.setStreams(new SinglItemCollection(c))
 return s;
}
  }

: So, the hander needs to be able to get an InputStream, and HTTP headers.
: Other plugins (CSV) will ask for a Reader and expect the details to be
: ironed out for it.
:
: Method1: come up with ways to expose all this info through an
: interface... a "headers" object could be added to the SolrRequest
: context (see getContext())

this is why Ryan and i have been talking in terms of a "ContentStream"
interface instead of just "InputStream" .. at some point we talked about
the ContentStream having getters for mime type, and charset that might be
null if unknown.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



what!? .. really? ... you don't think the ones i mentioned before are
things we should support out of the box?

  - no stream parser (needed for simple GETs)
  - single stream from raw post body (needed for current updates
  - multiple streams from multipart mime in post body (needed for SOLR-85)
  - multiple streams from files specified in params (needed for SOLR-66)
  - multiple streams from remote URL specified in params



I have imagined the single default parser handles *all* the cases you
just mentioned.

GET: read params from paramMap().  Check thoes params for special
params that send you to one or many remote streams.

POST: depending on headers/content type etc you parse the body as a
single stream, multi-part files or read the params.

It will take some careful design, but I think all the standard cases
can be handled by a single parser.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: The RequestParser is not be part of the core API - It would be a
: helper function for Servlets and Filters that call the core API.  It
: could be configured in web.xml rather then solrconfig.xml.  A
: RequestDispatcher (Servlet or Filter) would be configured with a
: single RequestParser.
:
: The RequestParser would be in charge of taking HttpRequest and determining:
:   1) The RequestHandler
:   2) The SolrRequest (Params & Streams)

This sounds fine to me ... i was going to suggest that having a public API
for RequestParser that people could extend and register intsnces of in the
solrconfig would be better then no public API at all -- but if we do that
we've let the genie out of the bottle, better to be more restrictive about
the internal API, and if/when new usecase come up we can revisit the
decision then.

If the RequestParser is going to pick the RequestHandler, we might as
stick with the current model where the RequestHandler is determined by the
"qt" SolrParam (it just wouldn't neccessarily come from the "qt" param of
the URL, since the RequestParser can decide where everything comes form
it could be from a URL param or it could be from the path) to keep the API
simple right?

interface RequestParser {
  public SolrRequest makeSolrRequest(HttpServletRequest req);
}

I'm curious though why you think RequestParsers should be managed in the
web.xml ... do you mean they would each be a Servlet Filter? ... if we
assume there's going to be a fixed list and they aren't easily extended,
then why not just:
  - have a HashMap of them in a single ServletFilter dispatcher,
  - lookup the one to use pased on the appropriate part of the path
  - let that RequestParser make the SolrRequest
  - continue with common code for all requests regardless of format:
- get RequestHandler from the core by name
- execute RequestHandler
- get output writer by name
- write out response

: It would not be the most 'pluggable' of plugins, but I am still having
: trouble imagining anything beyond a single default RequestParser.

what!? .. really? ... you don't think the ones i mentioned before are
things we should support out of the box?

  - no stream parser (needed for simple GETs)
  - single stream from raw post body (needed for current updates
  - multiple streams from multipart mime in post body (needed for SOLR-85)
  - multiple streams from files specified in params (needed for SOLR-66)
  - multiple streams from remote URL specified in params

: Assuming anything doing *really* complex ways of extracting
: ContentStreams will do it in the Handler not the request parser.  For
: reference see my argument for a seperate DocumentParser interface in:
: 
http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

aggreed ... but that can easily be added later.

: In my view, the default one could be mapped to "/*" and a custom one
: could be mapped to "/mycustomparser/*"
:
: This would drop the ':' from my proposed URL and change the scheme to look 
like:
: /parser/path/the/parser/knows/how/to/extract/?params

i was totally okay with the ":" syntax (although we should double check if
":" is actaully a legal unescaped URL character) .. but i'm confused by
this new suggestions ... is "parser" the name of the parser in that
example and "path/the/parser/knows/how/to/extract" data that the parser
may use to build to SolrRequest with? (ie: perhaps the RequestHandler)

would parser names be required to not have slashes in them in that case?

: > Imagine if 3 years ago, when Yonik and I were first hammering out the API
: > for SolrRequestHandlers, we had picked this...
: >
: >public interface SolrRequestHandlers extends SolrInfoMBean {
: >  public void init(NamedList args);
: >  public void handleRequest(HttpServletRequest req, SolrQueryResponse 
rsp);
: >}
:
: Thank goodness you didn't!  I'm confident you won't let me (or anyone)
: talk you into something like that!  You guys made a lot of good

the point i was trying to make is that if we make a RequestParser
interface with a "parseRequest(HttpServletRequest req)" method, it amouts
to just as much badness -- the key is we can make that interface as long
as all the implimentations are in the SOlr code base where we can keep an
eye on them, and people have to go way, WAY, *WAY* into solr to start
shanging them.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me "I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better" -- most people I work with don't
have the stamina to deal with my design reviews :)



Thank you both!  This is the first time I've taken the time and effort
to contribute to an open source project.  I'm learning the
pace/etiquette etc as I go along :)   Honestly your critique is
refreshing - I'm used to working alone or directing others.

I *think* we are close to something we will all be happy with.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me "I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better" -- most people I work with don't
have the stamina to deal with my design reviews :)

What occured to me as i was *getting* home was that since I seem to be the
only one that's (overly) worried about the RequestParser/HTTP abstraction
-- and since i haven't managed to convince Ryan after all of my badgering
-- it's probably just me being paranoid.

I think in general, the approach you've outlined should work great -- i'll
reply to some of your more recent comments directly.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


First Ryan, thank you for your patience on this *very* long hash
session.  Most wouldn't last that long unless it were a flame war ;-)
And thanks to Hoss, who seems to have the highest read+response
bandwidth of anyone I've ever seen (I'll admit I've only been
selectively reading this thread, with good intentions of coming back
to it).

On 1/19/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.


Agreed... a custom handler opening various streams not covered by the
default will most easily be handled by the handler opening the streams
themselves.


This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)


A handler could alternately get the rest of the path (absent params), right?


Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )


How does the unit test get the handler?


If I had my druthers, It would be:
  res = handler.execute( req )
but that is too big of leap for now :)


Yep... esp since the response writers now need the request for
parameters, for the searcher (streaming docs, etc).


You guys made a lot of good
choices and solr is an amazing platform for it.


I just wish I had known Lucene when I *started* Sol(a)r ;-)


I am proposing we have a single interface to do this:
  SolrRequest r = RequestParser.parse( HttpServletRequest  )


That's currently what new SolrServletRequest(HttpServletRequest) does.
We just need to figure out how to get InputStreams, Readers, etc.


I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class for Servlets and Filters.


Sounds good to as a practical starting point to me.  If we need more
in the future, we can add it then.

USECASE: The XML update plugin using the woodstox XML parser:
Woodstox docs say to give the parser an InputStream (with char
encoding, if available) for best performance.  This is also preferable
since if the char isn't specified, the parser can try to snoop it from
the stream.

So, the hander needs to be able to get an InputStream, and HTTP headers.
Other plugins (CSV) will ask for a Reader and expect the details to be
ironed out for it.

Method1: come up with ways to expose all this info through an
interface... a "headers" object could be added to the SolrRequest
context (see getContext())
Method2: consider it a more special case, have an XML update servlet
that puts that info into the SolrRequest (perhaps via the context
again)

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


(Note: this is different then what i have suggested before.  Treat it
as brainstorming on how to take what i have suggested and mesh it with
your concerns)

What if:

The RequestParser is not be part of the core API - It would be a
helper function for Servlets and Filters that call the core API.  It
could be configured in web.xml rather then solrconfig.xml.  A
RequestDispatcher (Servlet or Filter) would be configured with a
single RequestParser.

The RequestParser would be in charge of taking HttpRequest and determining:
 1) The RequestHandler
 2) The SolrRequest (Params & Streams)

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.  For
reference see my argument for a seperate DocumentParser interface in:
http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

In my view, the default one could be mapped to "/*" and a custom one
could be mapped to "/mycustomparser/*"

This would drop the ':' from my proposed URL and change the scheme to look like:
/parser/path/the/parser/knows/how/to/extract/?params

This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)



: Would that be configured in solrconfig.xml as 

Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )

If I had my druthers, It would be:
 res = handler.execute( req )
but that is too big of leap for now :)



...

A third use case of doing queries with POST might be that you want to use
standard CGI form encoding/multi-part file upload semantics of HTTP to
send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
so then we have "MultiPartMimeRequestParser" ...


I agree with all your use cases.  It just seems like a LOT of complex
overhead to extract the general aspects of translating a
URL+Params+Streams => Handler+Request(Params+Streams)

Again, since the number of 'RequestParsers' is small, it seems overly
complex to have a separate plugin to extract URL, another to extract
the Handler, and another to extract the streams.  Particulary since
the decsiions on how you parse the URL can totally affect the other
aspects.




...i really, really, REALLY don't like the idea that the RequestParser
Impls -- classes users should be free to write on their own and plugin to
Solr using the solrconfig.xml -- are responsible for the URL parsing and
parameter extraction.  Maybe calling them "RequestParser" in my suggested
design is missleading, maybe a better name like "StreamExtractor" would be
better ... but they shouldn't be in charge of doing anything with the URL.



What if it were configured in web.xml, would you feel more comfortable
letting it determine how the URL is parsed and streams are extracted?


Imagine if 3 years ago, when Yonik and I were first hammering out the API
for SolrRequestHandlers, we had picked this...

   public interface SolrRequestHandlers extends SolrInfoMBean {
 public void init(NamedList args);
 public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
   }


Thank goodness you didn't!  I'm confident you won't let me (or anyone)
talk you into something like that!  You guys made a lot of good
choices and solr is an amazing platform for it.

That said, the task at issue is: How do we convert an arbitrary
HttpServletRequest into a SolrRequest.

I am proposing we have a single interface to do this:
 SolrRequest r = RequestParser.parse( HttpServletRequest  )

You are proposing this is broken down further.  Something like:
 Handler h = (the filter) getHandler( req.getPath() )
 SolrParams = (the filter) do stuff to extract the params (using
parser.preProcess())
 ContentStreams = parser.parse( request )

While it is not great to have plugins manipulate the HttpRequest -
someone needs to do it.  In my opinion, the RequestParser's job is to
isolate *everything* *else* from the HttpServletRequest.

Again, since the number of RequestParser is small, it seems ok (to me)



keeping HttpServletRequest out of the API for RequestParsers helps us
future-proof against breaking plugins down the road.



I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class for Servlets and Filters.


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/19/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

All that said, this could just as cleanly map everything to:
  /solr/dispatch/update/xml
  /solr/cmd/update/xml
  /solr/handle/update/xml
  /solr/do/update/xml

thoughts?


That was my original assumption (because I was thinking of using
servlets, not a filter),
but I see little advantage to scoping under additional path elements.
I also agree with the other points you make.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


then all is fine and dandy ... but what happens if someone tries to
configure a plugin with the name "admin" ... now all of the existing admin
pages break.



that is exactly what you would expect to happen if you map a handler
to /admin.  The person configuring solrconfig.xml is saying "Hey, use
this instead of the default /admin.  I want mine to make sure you are
logged in using my custom authentication method."  In addition, It may
be reasonable (sometime in the future) to implement /admin as a
RequestHandler.  This could be a clean way to address SOLR-58  (xml
with stylesheets, or JSON, etc...)



also: what happens a year from now when we add some completely new
Servlet/ServletFilter to Solr, and want to give it a unique URL...

  http://host:/solr/bar/



obviously, I think the default solr settings should be prudent about
selecting URLs.  The standard configuration should probably map most
things to /select/xxx or /update/xxx.


...we could put it earlier in the processing chain before the existing
ServletFilter, but then we break any users that have registered a plugin
with the name "bar".


Even if we move this to have a prefix path, we run into the exact same
issue when sometime down the line solr has a default handler mapped to
'bar'

/solr/dispatcher/bar

But, if it ever becomes a problem, we can add an "excludes" pattern to
the filter-config that would  skip processing even if it maps to a
known handler.



more short term: if there is no prefix that the ervletFilter requires,
then supporting the legacy "http://host:/solr/update"; and
"http://host:/solr/select"; URLs becomes harder,


I don't think /update or /select need to be legacy URLs.  They can
(and should) continue work as they currently do using a new framework.

The reason I was suggesting that the Handler interface adds support to
ask for the default RequestParser and/or ResponseWriter is to support
this exact issue.  (However in the case of path="/select" the filter
would need to get the handler from ?qt=xxx)

- - - - -

All that said, this could just as cleanly map everything to:
 /solr/dispatch/update/xml
 /solr/cmd/update/xml
 /solr/handle/update/xml
 /solr/do/update/xml

thoughts?

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: > > whoa ... hold on a minute, even if we use a ServletFilter do do all of the
: > > dispatching instead of a Servlet we still need a base path right?

: > I thought that's what the filter gave you... the ability to filter any
: > URL to the /solr webapp, and Ryan was doing a lookup on the next
: > element for a request handler.

: yes, this is the beauty of a Filter.  It *can* process the request
: and/or it can pass it along.  There is no problem at all with mapping
: a filter to all requests and a servlet to some paths.  The filter will
: only handle paths declared in solrconfig.xml everything else will be
: handled however it is defined in web.xml

sorry ... i kow that a ServletFilter can look at a request, choose to
process it, or choose to ignore it ... my point was that if we use a
Filter, we still should put in that fiter logic to only look at requests
starting with a fixed prefix.

consider this URL...

  http://host:/solr/foo/

...where "solr" is the webapp name as usual.

if the filter matches on "/*" and then does a lookup in the solrconfig for
"foo" to find the Plugin to use for that request, and ignores the request
and passesit down the chain if one isn't configured with the name "foo"
then all is fine and dandy ... but what happens if someone tries to
configure a plugin with the name "admin" ... now all of the existing admin
pages break.

also: what happens a year from now when we add some completely new
Servlet/ServletFilter to Solr, and want to give it a unique URL...

  http://host:/solr/bar/

...we could put it earlier in the processing chain before the existing
ServletFilter, but then we break any users that have registered a plugin
with the name "bar".

more short term: if there is no prefix that the ervletFilter requires,
then supporting the legacy "http://host:/solr/update"; and
"http://host:/solr/select"; URLs becomes harder, because how do we
safely tell if the remote client is expecting the legacy behavior of those
URLs, or if we are trying to support some plugin configured using the
names "select" and "update" ?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On 1/19/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> whoa ... hold on a minute, even if we use a ServletFilter do do all of the
> dispatching instead of a Servlet we still need a base path right?

I thought that's what the filter gave you... the ability to filter any
URL to the /solr webapp, and Ryan was doing a lookup on the next
element for a request handler.

yes, this is the beauty of a Filter.  It *can* process the request
and/or it can pass it along.  There is no problem at all with mapping
a filter to all requests and a servlet to some paths.  The filter will
only handle paths declared in solrconfig.xml everything else will be
handled however it is defined in web.xml

(As a sidenote, wicket 2.0 replaces their dispatch servlet with a
filter - it makes it MUCH easier to have their app co-exist with other
things in a shared URL structure.)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/19/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

whoa ... hold on a minute, even if we use a ServletFilter do do all of the
dispatching instead of a Servlet we still need a base path right?


I thought that's what the filter gave you... the ability to filter any
URL to the /solr webapp, and Ryan was doing a lookup on the next
element for a request handler.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

: 1) I think it should be a ServletFilter applied to all requests that
: will only process requests with a registered handler.

I'm not sure what "it" is in the above sentence ... i believe from the
context of the rest of hte message you are you refering to
using a ServletFilter instead of a Servlet -- i honestly have no opinion
about that either way.

: 2) I think the RequestParser should take care off parsing
: ContentStreams *and* SolrParams - not just the streams. The dispatch
: servlet/filter should never call req.getParameter().

If that's the case, then the RequestParser is in control of the URL
structure ... except that it's not in control of the path info since
that's how we pick the RequestParser in the first place ... what if
we decide later that we want to change the URL structure -- then every
RequestParser would have to be changed.

: 3) I think the dispatcher picks the Handler and either calls it
: directly or passes it to SolrCore. It does not put "qt" in the
: SolrParams and have SolrCore extract it (again)

that's perfectly fine with me - i only had it that way because that's how
RequestHandler execution currently works, i wanted to leave anything not
directly related to what i was suggestion exactly the way it was currently
in my psuedo code.

: == Arguments for a ServletFilter: ==
: If we implement the dispatcher as a Filter:
: * the URL is totally customizable from solrconfig.xml

can you explain this more ... why does a ServletFilter make the URL more
customizable then an alternative (which i believe is jsut a Servlet)

: If we implement the dispatcher as a Servlet
: * we have to define a 'base' path for each servlet - this would make
: the names longer then then need to be and add potential confusion in
: the configuration.

whoa ... hold on a minute, even if we use a ServletFilter do do all of the
dispatching instead of a Servlet we still need a base path right?
... even if we ignore the current admin pages and assume we're going to
replace them all with new RequestHandlers when we do this, what happens a
year from now when we decide we want to add some new piece of
functionality that needs a differnet Servlet/ServletFilter ... if we've
got a Filter matching on "/*" don't we burn every possible bridge we have
for adding something else latter.

: Consider the servlet 'update' and another servlet 'select' When our
: proposed changes, these could both be the same servlet class
: configured to distinct paths. Now lets say you want to call:
: http://localhost/solr/update/xml?params
: Would that be configured in solrconfig.xml as http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html)
: It seems like we may need a few ways to parse params out of the
: request. The way one handles the parameters directly affects the
: streams. This logic should be contained in a single place.

The intent there is to use a regular CGI form encoded POST body to express
more params then the client feels safe putting in a URL, under teh API
i was suggesting that would be solved with a "No-Op" RequestParser that
has empty preProcess and process methods. when the Servlet (or ServletFilter)
builds the Solrparams (in between calling parser.preProcess and
parser.process) it gets *all* of hte form encoded params from the
HttpServletRequest (because no code has touched the input stream)

an alternative situation in which you might want to "Query using HTTP
POST" is if you had an XmlQPRequestHandler that understood the
xml-query-parser syntax from this contrib...
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/xml-query-parser/
...which expected to read the XML from the ContentStreams of the
SolrRequest, and you wnated to put the XML in the raw POSt body of the
request (the same way our current update POSTs work) but there were other
options XmlQPRequestHandler wanted to get out of the
SolrRequest's SolrParams.

that would be handled by a "RawPostRequestParser" whose process
method would be a No-Op, but the preProcess method would make a
ContentStream out of the InputStream from the HttpServletRequest -- then
the Servlet/ServletFilter would parse the url useing the
HttpServletRequest.getParameter() methods (which are now safe to call
without damanging the InputStream).

(That RawPostRequestParser would be reused along with an XmlUpdateHandler
that we refactor the existing updated logic from the core to
support the legacy /update URLs)

A third use case of doing queries with POST might be that you want to use
standard CGI form encoding/multi-part file upload semantics of HTTP to
send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
so then we have "MultiPartMimeRequestParser" that has a No-Op preProcess
method, and uses the Commons FileUpload code with a
org.apache.commons.fileupload.RequestContext it builds out of the header
info passed to preProcess by the Servlet.

: == The Dispatcher should pick the handler ==

: There is no reason it would need to inject 'qt' int

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


Ok, now i think I get what you are suggesting.  The differences are that:

1) I think it should be a ServletFilter applied to all requests that
will only process requests with a registered handler.
2) I think the RequestParser should take care off parsing
ContentStreams *and* SolrParams - not just the streams.  The dispatch
servlet/filter should never call req.getParameter().
3) I think the dispatcher picks the Handler and either calls it
directly or passes it to SolrCore.  It does not put "qt" in the
SolrParams and have SolrCore extract it (again)


== Arguments for a ServletFilter: ==
If we implement the dispatcher as a Filter:
* the URL is totally customizable from solrconfig.xml
* we have a single Filter to handle all standard requests
* with this single Filter, we can easily handle the existing URL structures
* configured URLs can sit at the 'Top level' next to 'top level' servlets

If we implement the dispatcher as a Servlet
* we have to define a 'base' path for each servlet - this would make
the names longer then then need to be and add potential confusion in
the configuration.

Consider the servlet 'update' and another servlet 'select'  When our
proposed changes, these could both be the same servlet class
configured to distinct paths.  Now lets say you want to call:
 http://localhost/solr/update/xml?params
Would that be configured in solrconfig.xml as http://www.nabble.com/Using-HTTP-Post-for-Queries-tf3039973.html)
It seems like we may need a few ways to parse params out of the
request.  The way one handles the parameters directly affects the
streams.  This logic should be contained in a single place.


== The Dispatcher should pick the handler  ==

In the proposed url scheme: /path/to/handler:parser, the dispatcher
has to decide what handler it is.  If we use a filter, it will look
for a registered handler - if it can't find one, it will not process
the request.

There is no reason it would need to inject 'qt' into the solr params
just so it can be pulled out by SolrCore (using the @depricated
function: solrReq.getQueryType()!)

If the dispatcher is required to put a parameter in SolrParams, we
could not make the RequestParser in charge of filling the SolrParams.
This would require something like your pre/process system.


== Pseudo-Java ==

The real version will do error handling and will need some special
logic to make '/select' behave exactly as it does now.


class SolrFilter implements Filter {
void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain)
{
   String path = req.getServletPath();
   SolrRequestHandler handler = getHandlerFromPath( path );
   if( handler != null ) {
   SolrRequestParser parser = getParserFormPath( path );
   SolrQueryResponse solrRes = new SolrQueryResponse();
   SolrQueryRequest solrReq = parser.parse( request );

   core.execute( handler, solrReq, solrRes );
   return;
   }
   chain.doFilter(request, response);
}
}

Modify core to directly accept the 'handler':

class SolrCore {

 public void execute(SolrRequestHandler handler, SolrQueryRequest
req, SolrQueryResponse rsp) {

   // setup response header and handle request
   final NamedList responseHeader = new NamedList();
   rsp.add("responseHeader", responseHeader);
   handler.handleRequest(req,rsp);
   setResponseHeaderValues(responseHeader,req,rsp);

   log.info(req.getParamString()+ " 0 "+
 (int)(rsp.getEndTime() - req.getStartTime()));
 }

 @Depricated
 public void execute(SolrQueryRequest req, SolrQueryResponse rsp) {
   SolrRequestHandler handler = getRequestHandler(req.getQueryType());
   if (handler==null) {
 log.warning("Unknown Request Handler ...
   }
   this.execute( handler, req, rsp );
 }
}


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > Ah ... this is the one problem with high volume on an involved thread ...
: > i'm sending replies to messages you write after you've already read other
: > replies to other messages you sent and changed your mind :)

: Should we start a new thread?

I don't think it would make a differnece ... we just need to slow down :)

: Ok, now (I think) I see the difference between our ideas.
:
: >From your code, it looks like you want the RequestParser to extract
: 'qt' that defines the RequestHandler.  In my proposal, the
: RequestHandler is selected independent of the RequestParser.

no, no, no ... i'm sorry if i gave that impression ... the RequestParser
*only* worries about getting a streams, it shouldn't have any way of even
*guessing* what RequestHandler is going to be used.

for refrence: http://www.nabble.com/Re%3A-p8438292.html

note that i never mention "qt" .. instead i refer to
"core.execute(solrReq, solrRsp);" doing exactly what it does today ...
core.execute will call getRequestHandler(solrReq.getQueryType()) to pick
the RequestHandler to use.

the Servlet is what creates the SolrRequest object, and puts whatever
SolrParams it wants (including "qt") in that SolrRequest before asking the
SolrCore to take care of it.

: What do you imagine happens in:
: >
: > String p = pickRequestParser(req);

let's use the URL syntax you've been talking about that people seem to
have agreed looks good (assuming i understand correctly) ...

   /servlet/${requesthandler}:${requestparser}?param1=val1¶m2=val2

what i was suggesting was that then the servlet which uses that URL
structure might have a utility method called pickRequestParser that would look 
like...

  private String pickRequestParser(HttpServletRequest req) {
String[] pathParts = req.getPathInfo().split("\:");
if (pathParts.length < 2 || "".equal(pathParts[1]))
  return "default"; // or "standard", or null -- whatever
return pathParts[1];
  }


: If the RequestHandler is defined by the RequestParser,  I would
: suggest something like:

again, i can't emphasis enough that that's not what i was proposing ... i
am in no way shape or form trying to talk you out of the idea that it
should be possible to specify the RequestParser, the RequestHandler, and
the OutputWriter all as part of the URL, and completley independent of
eachother.

the RequestHandler and the OutputWriter could be specified as regular
SolrParams that come from any part of the HTTP request, but the
RequestParser needs to come from some part of the URL thta can be
inspected with out any risk of affecting the raw post stream (ie: no
HttpServletRequest.getParameter() calls)

: I still don't see why:
:
: >
: > // let the parser preprocess the streams if it wants...
: > Iterable s = solrParser.preprocess
: >   (getStreamIno(req),  new Pointer() {
: > InputStream get() {
: >   return req.getInputStream();
: > });
: >
: > Solrparams params = makeSolrRequest(req);
: >
: > // let the parser decide what to do with the existing streams,
: > // or provide new ones
: > Iterable solrParser.process(solrReq, s);
: >
: > // ServletSolrRequest is a basic impl of SolrRequest
: > SolrRequest solrReq = new ServletSolrRequest(params, s);
: >
:
: can not be contained entirely in:
:
:   SolrRequest solrReq = parser.parse( req );

because then the RequestParser would be defining how the URL is getting
parsed -- the makeSolrRequest utility placeholder i described had the
wrong name, i should have called it makeSolrParams ... it would look
something like this in the URL syntax i described above...

  private SolrParams makeSolrParams(HttpServletRequest req) {
// this class already in our code base, used as is
SolrParams p = new ServletSolrParams(req);
String[] pathParts = req.getPathInfo().split("\:");
if ("".equal(pathParts[0]))
  return p;
Map tmp = new HashMap();
tmp.put("qt", pathPaths[0]);
return new DefaultSolrParams(new MapSolrParams(tmp), p);
  }



the nutshell version of everything i'm trying to say is...

 SolrRequest
   - models all info about a request to solr to do something:
 - the key=val params assocaited with that request
 - any streams of data associated with that request
 RequestParser(s)
   - different instances for different sources of streams
   - is given two chances to generate ContentStreams:
 - once using the raw stream from the HTTP request
 - once using the params for the SolrRequest
 SolrSerlvet
   - the only thing with direct access to the HttpServletRequest, shields
 the other interface APIs from from the mechanincs of HTTP
   - dictates the URL structure
 - determines the name of the RequestParser to use
 - lets parser have the raw input stream
 - determines where SolrParams for request come from
 - lets parser have params to make more streams if it wants to.
 SolrCore
   - does all of hte name lookups for processing a SolrRequest:
 -

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



: I was...  then you talked me out of it!  You are correct, the client
: should determine the RequestParser independent of the RequestHandler.

Ah ... this is the one problem with high volume on an involved thread ...
i'm sending replies to messages you write after you've already read other
replies to other messages you sent and changed your mind :)



Should we start a new thread?




Here's a more fleshed out version of the psuedo-java i posted earlier,
with all of my adendums inlined and a few simple metho calls changed to
try and make the purpose more clear...



Ok, now (I think) I see the difference between our ideas.


From your code, it looks like you want the RequestParser to extract

'qt' that defines the RequestHandler.  In my proposal, the
RequestHandler is selected independent of the RequestParser.

What do you imagine happens in:


String p = pickRequestParser(req);



This looks like you would have to have a standard way (per servlet) of
gettting the RequestParser.  How do you invision that?  What would be
the standard way to choose your request parser?


If the RequestHandler is defined by the RequestParser,  I would
suggest something like:

interface SolrRequest
{
 RequestHandler getHandler();
 Iterable getContentStreams();
 SolrParams getParams();
}

interface RequestParser
{
 SolrRequest getRequest( HttpServletRequest req );

 // perhaps remove getHandler() from SolrRequest and add:
 RequestHandler getHandler();
}

And then configure a servlet or filter with the RequestParser


   SolrRequestFilter
   ...
   
 RequestParser
 org.apache.solr.parser.StandardRequestParser
   


Given that the number of RequestParsers is realistically small (as
Yonik mentioned), I think this could be a good solution.

To update my current proposal:
1. Servlet/Filter defines the RequestParser
2. requestParser parses handler & request from HttpServletRequest
3. handled essentially as before

To update the example URLs, defined by the "StandardRequestParser"
 /path/to/handler/?param
where /path/to/handler is the "name" defined in solrconfig.xml

To use a different RequestParser, it would need to be configured in web.xml
 /customparser/whatever/path/i/like


- - - - - - - - - - - - - -

I still don't see why:



// let the parser preprocess the streams if it wants...
Iterable s = solrParser.preprocess
  (getStreamIno(req),  new Pointer() {
InputStream get() {
  return req.getInputStream();
});

Solrparams params = makeSolrRequest(req);

// let the parser decide what to do with the existing streams,
// or provide new ones
Iterable solrParser.process(solrReq, s);

// ServletSolrRequest is a basic impl of SolrRequest
SolrRequest solrReq = new ServletSolrRequest(params, s);



can not be contained entirely in:

 SolrRequest solrReq = parser.parse( req );

assuming the SolrRequest interface includes

 Iterable getContentStreams();

the parser can use use req.getInputStream() however it likes - either
to make params and/or to build ContentStreams

- - - - - - - -

good good
ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



Cool.  I think i need more examples... concrete is good :-)

I don't quite grok your format below... is it one line or two?
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

Is that simply

/${handler}:${parser}?params



yes.  the ${} is just to show what is extracted from the request URI,
not a specific example

Imagine you have a CsvUpdateHander defined in solrconfig.xml with a
"name"="my/update/csv".

The standard RequestParser could extract the parameters and
Iterable for each of the following requests:

POST: /my/update/csv/?separator=,&fields=foo,bar,baz
(body) "10,20,30"

POST:/my/update/csv/
multipart post with 5 files and 6 form fields defining
(unlike the previous example this the handle would get 5 input streams
rather then 1)

GET: /my/update/csv/?post.remoteURL=http://..&separator=,&fields=foo,bar,baz&;...
fill the stream with the content from a remote URL

GET: /my/update/csv/?post.body=bodycontent,&fields=foo,bar,baz&...
use 'bodycontent' as the input stream.  (note, this does not make much
sense for csv, but is a useful example)

POST: /my/update/csv:remoteurls/?separator=,&fields=foo,bar,baz
(body) http://url1,http://url2,http:/url3...
In this case we would use a custom RequestParser ("remoteurls") that
would read the post body and convert it to a stream of content urls.

- - - - - - -

The URL path (everything before the ':') would be entirely defined and
configured by solrconfig.xml  A filter would see if the request path
matches a registered handler - if not it will pass it up the filter
chain.  This would allow custom filters and servlets to co-exist in
the top level URL path.  Consider:

solrconfig.xml
 

web.xml:
 
   MyRestfulDelete
   /mydelete/*
 

POST: /delete?id=AAA   would be sent to DeleteHandler
POST: /mydelete/AAA/ would be sent to MyRestfulDelete

Alternativly, you could have:


solrconfig.xml
 

web.xml:
 
   MyRestfulDelete
   /delete/*
 

POST: /standard/delete?id=AAA   would be sent to DeleteHandler
POST: /delete/AAA/ would be sent to MyRestfulDelete

I am suggesting we do not try have the default request servlet/filter
support extracting parameters from the URL.  I think this is a
reasonable tradeoff to be able to have the request path easily user
configurable using the *existing* plugin configuration.

- - - - - - - -

In a previous email, you mentioned changing the URL structure.  With
this proposal, we would continue to support:
/select?wt=XXX

for the Csv example, you would also be able to call:
GET: /select?qt=/my/update/csv/&post.remoteURL=http://..&sepa...

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: However, I'm not yet convinced the benefits are worth the costs.  If
: the number of RequestParsers remain small, and within the scope of
: being included in the core, that functionality could just be included
: in a single non-pluggable RequestParser.
:
: I'm not convinced is a bad idea either, but I'd like to hear about
: usecases for new RequestParsers (new ways of generically getting an
: input stream)?

I don't really see it being a very high cost ... and even if we can't
imagine any other potential user written RequestParser, we already know of
at least 4 use cases we want to support out of the box for getting
streams:

 1) raw post body (as a single stream)
 2) multi-part post body (file upload, potentially several streams)
 3) local file(s) specified by path (1 or more streams)
 4) remote resource(s) specified by URL(s) (1 or more streams)

...we could put all that logic in a single class with that looks at a
SolrParam to pick what method to use or we could extract each one into
it's own class using a common interface ... either way we can hardcode the
list of viable options if we want to avoid the issue of letting the client
configure them .. but i still think it's worth the effort to talk about
what that common interface might be.

I think my idea of having both a preProcess and a process method in
RequestParser so it can do things before and after the Servlet has
extracted SolrParams from the URL would work in all of the cases we've
thought of.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: I was...  then you talked me out of it!  You are correct, the client
: should determine the RequestParser independent of the RequestHandler.

Ah ... this is the one problem with high volume on an involved thread ...
i'm sending replies to messages you write after you've already read other
replies to other messages you sent and changed your mind :)

: Are you suggesting there would be multiple servlets each with a
: different methods to get the SolrParams from the url?  How does the
: servlet know if it can touch req.getParameter()?

I'm suggesting that their *could* be multiple Servlets with multiple URL
structures ... my worry is not that we need multiple options now, it's
that i don't wnat to cope up with an API for writting plugins that then
has to be throw out down the road when if we want/ened to change the URL

: How would the default servlet fill up SolrParams?

prior to calling RequestParser.preProcess, it would only access very
limited parts of the HttpServletRequest -- the bare minimum it needs to
pick a RequsetParser ... probably just the path, maybe the HTTP Headers --
but if we had a URL structure where we really wanted to specify the
RequestParser in a URL param it could do it using getQueryString

*after* calling RequestParser.preProcess the Servlet can access any part
of the HttpServletRequest (because if the RequestParser wanted to use the
raw POST InputStream it would have, and if it doesn't then it's fair game
to let HttpServletRequest pull data out of it when the Servlet calls
HttpServletRequest.getParameterMap() -- or any of the other
HttpServletRequest methods to build up the SolrParams however it wants
based on the URL structure it wants to use ... then RequestParser.process
can use those SolrParams to get any other streams it may want and add them
to the SolrRequest.

Here's a more fleshed out version of the psuedo-java i posted earlier,
with all of my adendums inlined and a few simple metho calls changed to
try and make the purpose more clear...



// Simple inteface for having a lazy refrence to something
interface Pointer {
  T get();
}

interface RequestParser {
  public init(NamedList nl); // the usual

  /** will be passed the raw input stream from the
   * HttpServletRequest, ... as well as whatever other HttpServletRequest
   * header info we decide its important for the RequestParser to know
   * about the stream, and is safe for Servlets to access and make
   * available to the RequestParser (ie: HTTP method, content-type,
   * content-length, etc...)
   *
   * I'm using a NamedList instance instead of passing the
   * HttpServletRequest to maintain a good abstraction -- only the Serlet
   * know about HTTP, so if we ever want to write an RMI interface to Solr,
   * the same RequestParser plugins will still work ... in practice it
   * might be better to explicitly spell out every piece of info about
   * the stream we want to pass
   *
   * This is the method where a RequestParser which is going to use the
   * raw POST body to build up eithera single stream, or several streams
   * from a multi-part request has the info it needs to do so.
   */
  public Iterable preProcess(NamedList streamInfo,
Pointer s);

  /** garunteed that the second arg will be the result from
   * a previous call to preProcess, and that that Iterable from
   * preProcess will not have been inspected or touched in anyway, nor
   * will any refrences to it be maintained after this call.
   *
   * this is the method where a RequestParser which is going to use
   * request params to open streams from local files, or remote URLs
   * can do so -- a particulararly ambitious RequestParser could use
   * both the raw POST data *and* remote files specified in params
   * because it has the choice of what to do with the
   * Iterable it reutnred from the earlier preProcess call.
   */
  public Iterable process(SolrRequest request,
 Iterable i);

}


class SolrUberServlet extends HttpServlet {

  // servlet specific method which does minimal inspection of
  // req to determine the parser name based on the URL
  private String pickRequestParser(HttpServletRequest req) { ... }

  // extracts just the most crucial info about the HTTP Stream from the
  // HttpServletRequest, so it can be passed to RequestParser.preProcss
  // must be careful not to use anything that might access the stream.
  private NamedLIst getStreamInfo(HttpServletRequest req) { ... }

  // builds the SolrParams for the request using servlet specific URL rules,
  // this method is free to use anything in the HttpServletRequest
  // because it won't be called untill after preProcess
  private SolrParams makeSolrRequestParams(HttpServletRequest req) { ... }

  public service(HttpServletRequest req, HttpServletResponse response) {
SolrCore core = getCore();
Solr(Query)Response solrRsp = new Solr(Query)Response();

String p = pickRequestParser(req)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Yonik Seeley

On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

On 1/18/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> > Yes, this proposal would fix the URL structure to be
> > /path/defined/in/solrconfig:parser?params
> > /${handler}:${parser}
> >
> > I *think* this cleanly handles most cases cleanly and simply.  The
> > only exception is where you want to extract variables from the URL
> > path.
>
> But that's not a hypothetical case, extracting variables from the URL
> path is something I need now (to add metadata about the data in the
> raw post body, like the CSV separator).
>
> POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
> with a body of "10,20,30"
>

Sorry, by "in the URL" I mean "in the URL path." The RequestParser can
extract whatever it likes from getQueryString()

The url you list above could absolutely be handled with the proposed
format.

Cool.  I think i need more examples... concrete is good :-)

I don't quite grok your format below... is it one line or two?
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

Is that simply

/${handler}:${parser}?params

Or is it all one line where you actually have params twice?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

On 1/18/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> Yes, this proposal would fix the URL structure to be
> /path/defined/in/solrconfig:parser?params
> /${handler}:${parser}
>
> I *think* this cleanly handles most cases cleanly and simply.  The
> only exception is where you want to extract variables from the URL
> path.

But that's not a hypothetical case, extracting variables from the URL
path is something I need now (to add metadata about the data in the
raw post body, like the CSV separator).

POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
with a body of "10,20,30"

Sorry, by "in the URL" I mean "in the URL path." The RequestParser can
extract whatever it likes from getQueryString()

The url you list above could absolutely be handled with the proposed
format.  The thing that could not be handled is:
http://localhost:8983/solr/csv/foo/bar/baz/
with body "10,20,30"

> There are pleanty of ways to rewrite RESTfull urls into a
> path+params structure.  If someone absolutly needs RESTfull urls, it
> can easily be implemented with a new Filter/Servlet that picks the
> 'handler' and directly creates a SolrRequest from the URL path.

While being able to customize something is good, having really good
defaults is better IMO :-)  We should also be focused on exactly what
we want our standard update URLs to look like in parallel with the
design of how to support them.

again, i totally agree.  My point is that I don't think we need to
make the dispatch filter handle *all* possible ways someone may want
to structure their request.  It should offer the best defaults
possible.  If that is not sufficient, someone can extend it.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Yonik Seeley


On 1/18/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

Yes, this proposal would fix the URL structure to be
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

I *think* this cleanly handles most cases cleanly and simply.  The
only exception is where you want to extract variables from the URL
path.


But that's not a hypothetical case, extracting variables from the URL
path is something I need now (to add metadata about the data in the
raw post body, like the CSV separator).

POST to http://localhost:8983/solr/csv?separator=,&fields=foo,bar,baz
with a body of "10,20,30"


There are pleanty of ways to rewrite RESTfull urls into a
path+params structure.  If someone absolutly needs RESTfull urls, it
can easily be implemented with a new Filter/Servlet that picks the
'handler' and directly creates a SolrRequest from the URL path.


While being able to customize something is good, having really good
defaults is better IMO :-)  We should also be focused on exactly what
we want our standard update URLs to look like in parallel with the
design of how to support them.

As a site note, with a change of URLs, we get a "free" chance to
change whatever we want about the parameters or response format...
backward compatibility only applies to the original URLs IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



I'm confused by your sentence "A RequestParser converts a
HttpServletRequest to a SolrRequest." .. i thought you were advocating
that the servlet parse the URL to pick a RequestHandler, and then the
RequestHandler dicates the RequestParser?



I was...  then you talked me out of it!  You are correct, the client
should determine the RequestParser independent of the RequestHandler.



: /path/registered/in/solr/config:requestparser?params
:
: If no ':' is in the URL, use 'standard' parser
:
: 1. The URL path determins the RequestHandler
: 2. The URL path determins the RequestParser
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

do you mean the path before hte colon determins the RequestHandler and the
path after the colon determines the RequestParser?


yes, that is my proposal.


fine too ... i was specificly trying to avoid making any design
decissions that required a particular URL structure, in what you propose
we are dictating more then just the "/handler/path:parser" piece of the
URL, we are also dicating that the Parser decides how the rest of the path
and all URL query string data will be interpreted ...


Yes, this proposal would fix the URL structure to be
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

I *think* this cleanly handles most cases cleanly and simply.  The
only exception is where you want to extract variables from the URL
path.  There are pleanty of ways to rewrite RESTfull urls into a
path+params structure.  If someone absolutly needs RESTfull urls, it
can easily be implemented with a new Filter/Servlet that picks the
'handler' and directly creates a SolrRequest from the URL path.  In my
opinion, for this level of customization is reasonable that people
edit web.xml and put in their own servlets and filters.



what i'm proposing is that the Servlet decide how to get the SolrParams
out of an HttpServletRequest, using whatever URL that servlet wants;


I guess I'm not understanding this yet:

Are you suggesting there would be multiple servlets each with a
different methods to get the SolrParams from the url?  How does the
servlet know if it can touch req.getParameter()?

How would the default servlet fill up SolrParams?




I think i'm getting confused ... i thought you were advocating that
RequestParsers be implimented as ServletFilters (or Servlets) ...


Originally I was... but again, you talked me out of it.  (this time
not totally)  I think the /path:parser format is clear and allows for
most everything off the shelf.  If you want to do something different,
that can easily be a custom filter (or servlet)

Essentially, i think it is reasonable for people to skip
'RequestParsers' in a custom servlet and be able to build the
SolrRequest directly.  This level of customization is reasonable to
handle directly with web.xml

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Yonik Seeley


OK, trying to catch up on this huge thread... I think I see why it's
become more complicated than I originally envisioned.

What I originally thought:
1) add a way to get a Reader or InputStream from SolrQueryRequest, and
then reuse it for updates too
2) use the plugin name in the URL
3) write code that could handle multi-part post, or could grab args
from the URL.
4) profit!

I think the main additional complexity is the idea that RequestParser
(#3) be both pluggable and able to be specified in the actual request.
I hadn't considered that, and it's an interesting idea.

Without pluggable RequestParser:
- something like CSV loader would have to check the params for a
"file" param and if so, open the local file themselves

With a pluggable RequestParser:
- the LocalFileRequestParser would be specified in the url (like
/update/csv:local) and it will handle looking for the "file" param and
opening the file.  The CSV plugin can be a little simpler by just
getting a Reader.
- a new way of getting a stream could be developed (a new
RequestParser) and most stream oriented plugins could just use it.

However, I'm not yet convinced the benefits are worth the costs.  If
the number of RequestParsers remain small, and within the scope of
being included in the core, that functionality could just be included
in a single non-pluggable RequestParser.

I'm not convinced is a bad idea either, but I'd like to hear about
usecases for new RequestParsers (new ways of generically getting an
input stream)?

-Yonik

RE: Update Plugins (was Re: Handling disparate data sources in Solr)


: > With all this talk about plugins, registries etc., /me can't help
: > thinking that this would be a good time to introduce the Spring IoC
: > container to manage this stuff.

I don't have a lot of familiarity with spring except for the XML
configuration file used for telling the spring context what objects you
want it to create on startup and what constructor args to pass then and
what methods to call and so on -- with an easy ability to tell it to pass
one object you had it construct as a param to another object you are hving
it construct.

on the whole, it seems really nice, and eventually using it to replace a
lot of the home-growm configuration in SOlr would probably make a lot of
sense ... but i don't think migrating to Spring is neccessary as part of
the current push to support more configurable plugins for updates ... SOlr
already has a pretty decent set of utilities for allowing class instances
to be specified in the xml config file and have configuration arguments
passed to them on initialization .. it's not as fancy as spring and it
doesn't support as many features as spring, but it works well enough that
it should be easy to use with the new plugins we start to add -- switching
to spring right now would probably only complicate the issues, and
probably wouldn't make adding Update plugins any easier.

equally important: adding a few new types of plugins now probably won't
make it any harder to switch to somehting like spring later ... which as i
said, is something i definitely anticipate happening




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: I think the confusion is that (in my view) the RequestParser is the
: *only* object able to touch the stream.  I don't think anything should
: happen between preProcess() and process();  A RequestParser converts a
: HttpServletRequest to a SolrRequest.  Nothing else will touch the
: servlet request.

that makes it the RequestParsers responsibility to dictate the URL format
(if it's the only one that can touch the HttpServletRequest) i was
proposing a method by which the Servlet could determine the URL format --
there could in fact be multiple servlets supporting different URL formats
if we had some need for it -- and the RequestParser could generate streams
based on the raw POST data and/or any streams it wants to find based on
the SolrParams generated from the URL (ie: local files, remote resources,
etc)

I'm confused by your sentence "A RequestParser converts a
HttpServletRequest to a SolrRequest." .. i thought you were advocating
that the servlet parse the URL to pick a RequestHandler, and then the
RequestHandler dicates the RequestParser?

: /path/registered/in/solr/config:requestparser?params
:
: If no ':' is in the URL, use 'standard' parser
:
: 1. The URL path determins the RequestHandler
: 2. The URL path determins the RequestParser
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

do you mean the path before hte colon determins the RequestHandler and the
path after the colon determines the RequestParser? ... that would work
fine too ... i was specificly trying to avoid making any design
decissions that required a particular URL structure, in what you propose
we are dictating more then just the "/handler/path:parser" piece of the
URL, we are also dicating that the Parser decides how the rest of the path
and all URL query string data will be interpreted -- which means if we
have a PostBodyRequestParser and a LocalFileRequestParser and a
RemoteUrlRequestParser and which all use the query stirng params to get
the SolrParams for the request (and in the case of the last two: to know
what file/url to parse) and then we decide that we want to support a URL
structure that is more REST like and uses the path for including
information, now we have to write a new version of all of those
RequestParsers ( subclass of each probably) that knows what our new URL
structure looks like ... even if that never comes up, every RequestParser
(even custom ones written by users to use some crazy proprietery binary
protocols we've never heard of to fetch stream of data has to worry about
extracting the SOlrParams out of the URL.

what i'm proposing is that the Servlet decide how to get the SolrParams
out of an HttpServletRequest, using whatever URL that servlet wants; the
RequestParser decides how to get the ContentStreams needed for that
request -- in a way that can work regardless of wether the stream is
acctually part of the HttpServletRequest, or just refrenced by a param in
the the request; the RequestHandler decides what to do with those params
and streams; and the the ResponseWriter decides how to format the results
produced by the RequestHandler back to the client.

: > : If anyone needs to customize this chain of events, they could easily
: > : write their own Servlet/Filter

: I don't *think* this would happen often, and the people would only do
: it if they are unhappy with the default URL structure -> behavior
: mapping.  I am not suggesting this would be the normal way to
: configure solr.

I think i'm getting confused ... i thought you were advocating that
RequestParsers be implimented as ServletFilters (or Servlets) ... but if
that were the case it wouldn't just be able changing hte URL structure, it
would be able picking new ways to get streams .. but that doesn't seem to
be what you are suggesting, so i'm not sure what i was missunderstanding.



-Hoss

RE: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Cook, Jeryl

Sorry for the "flame" , but I've used spring on 2 large projects and it
worked out great.. you should check out some of the GUIs to help manage
the XML configuration files, if that is reason your team thought it was
a nightmare because of the configuration(we broke ours up to help).. 

Jeryl Cook

-Original Message-
From: Alan Burlison [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 16, 2007 10:52 AM
To: solr-dev@lucene.apache.org
Subject: Re: Update Plugins (was Re: Handling disparate data sources in
Solr)

Bertrand Delacretaz wrote:

> With all this talk about plugins, registries etc., /me can't help
> thinking that this would be a good time to introduce the Spring IoC
> container to manage this stuff.
> 
> More info at http://www.springframework.org/docs/reference/beans.html
> for people who are not familiar with it. It's very easy to use for
> simple cases like the ones we're talking about.

Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


On 1/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I'm not sure i underestand preProcess( ) and what it gets us.

it gets us the abiliity for a RequestParser to be able to pull out the raw
InputStream from the HTTP POST body, and make it available to the
RequestHandler as a ContentStream and/or it can wait untill the servlet
has parsed the URL to get the params and *then* it can generate
ContentStreams based on those param values.

 - preProcess is neccessary to write a RequestParser that can handle the
   current POST raw XML model,
 - process is neccessary to write RequestParsers that can get file names
   or URLs out of escaped query params and fetch them as streams



I think the confusion is that (in my view) the RequestParser is the
*only* object able to touch the stream.  I don't think anything should
happen between preProcess() and process();  A RequestParser converts a
HttpServletRequest to a SolrRequest.  Nothing else will touch the
servlet request.



: 1. The URL path selectes the RequestHandler
: 2. RequestParser = RequestHandler.getRequestParser()  (typically from
: its default params)
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

the problem i see with that, is that the RequestHandler shouldn't have any
say in what RequestParser is used -- ...



got it.  Then i vote we use a syntax like:

/path/registered/in/solr/config:requestparser?params

If no ':' is in the URL, use 'standard' parser

1. The URL path determins the RequestHandler
2. The URL path determins the RequestParser
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response



: If anyone needs to customize this chain of events, they could easily
: write their own Servlet/Filter

this is why i was confused about your Filter comment earlier: if the only
way a user can customize behavior is by writting a Servlet, they can't
specify that servlet in a solr config file -- they'd have to unpack the
war and manually eidt the web.xml ... which makes upgrading a pain.



I don't *think* this would happen often, and the people would only do
it if they are unhappy with the default URL structure -> behavior
mapping.  I am not suggesting this would be the normal way to
configure solr.

The main case where I imagine someone would need to write their own
servlet/filter is if they insist the parameters need to be in the URL.
For example:

 /delete/id/

The URL structure I am proposing could not support this (unless you
had a handler mapped to each id :)

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Chris Hostetter


: I'm not sure i underestand preProcess( ) and what it gets us.

it gets us the abiliity for a RequestParser to be able to pull out the raw
InputStream from the HTTP POST body, and make it available to the
RequestHandler as a ContentStream and/or it can wait untill the servlet
has parsed the URL to get the params and *then* it can generate
ContentStreams based on those param values.

 - preProcess is neccessary to write a RequestParser that can handle the
   current POST raw XML model,
 - process is neccessary to write RequestParsers that can get file names
   or URLs out of escaped query params and fetch them as streams

: 1. The URL path selectes the RequestHandler
: 2. RequestParser = RequestHandler.getRequestParser()  (typically from
: its default params)
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

the problem i see with that, is that the RequestHandler shouldn't have any
say in what RequestParser is used -- the client is hte only one that knows
what type of data they are sending to Solr, they should put information in
the URL that directly picks the RequestParser.

If you think about it in terms of the current POSTing XML model, an
XmlUpdateRequestHandler that reads in our "..." style info
shouldn't know anywhere in it's confiuration where that stream of XML
bytes came from -- when it gets asked to handle the request, all it should
know is that it has some optional params, and an InputStream to work with
... the RequestParsers job is to decide where that input stream came from.

: If anyone needs to customize this chain of events, they could easily
: write their own Servlet/Filter

this is why i was confused about your Filter comment earlier: if the only
way a user can customize behavior is by writting a Servlet, they can't
specify that servlet in a solr config file -- they'd have to unpack the
war and manually eidt the web.xml ... which makes upgrading a pain.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


I'm not sure i underestand preProcess( ) and what it gets us.

I like the model that

1. The URL path selectes the RequestHandler
2. RequestParser = RequestHandler.getRequestParser()  (typically from
its default params)
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response

If anyone needs to customize this chain of events, they could easily
write their own Servlet/Filter


On 1/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


Acctually, i have to amend that ... it occured to me in my slep last night
that calling HttpServletRequest.getInputStream() wasn't safe unless we
*now* the Requestparser wasnts it, and will close it if it's non-null, so
the API for preProcess would need to look more like this...

 interface Pointer {
   T get();
 }
 interface RequestParser {
   ...
   /** the will be passed a "Pointer" to the raw input stream from the
* HttpServletRequest, ... if this method accesses the IputStream
* from the pointer, it is required to close it if it is non-null.
*/
   public Iterable preProcess(SolrParam headers,
 Pointer s);
   ...
 }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Chris Hostetter


Acctually, i have to amend that ... it occured to me in my slep last night
that calling HttpServletRequest.getInputStream() wasn't safe unless we
*now* the Requestparser wasnts it, and will close it if it's non-null, so
the API for preProcess would need to look more like this...

 interface Pointer {
   T get();
 }
 interface RequestParser {
   ...
   /** the will be passed a "Pointer" to the raw input stream from the
* HttpServletRequest, ... if this method accesses the IputStream
* from the pointer, it is required to close it if it is non-null.
*/
   public Iterable preProcess(SolrParam headers,
 Pointer s);
   ...
 }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread J.J. Larrea

At 11:48 PM -0800 1/16/07, Chris Hostetter wrote:
>yeah ... once we have a RequestHandler doing that work, and populating a
>SolrQueryResponse with it's result info, it
>would probably be pretty trivial to make an extremely bare-bones
>LegacyUpdateOutputWRiter that only expected that simple mount of response
>data and wrote it out in the current update response format .. so the
>current SolrUpdateServlet could be completley replaced with a simple url
>mapping...
>
>   /update --> /select?qt=xmlupdate&wt=legacyxmlupdate

Yah!  But in my vision it would be

/update -> qt=update

because pathInfo is "update".  There's no need to remap anything in the URL, 
the existing SolrServlet is ready for dispatch once it:
  - Prepares request params into SolrParams
  - Sets params("qt") to pathInfo
  - Somehow (perhaps with StreamIterator) prepares streams for RequestParser use

I'm still trying to conceptually maintain a separation of concerns between 
handling the details of HTTP (servlet-layer) and handling different payload 
encodings (a different layer, one I believe can be invoked after config is 
read).

The following is "vision" more than "proposal" or "suggestion"...

legacyxml

xml

So when incoming URL comes in:

/update?rp=json

the pipeline which is established is:

SolrServlet ->
solr.JSONStreamRequestParser
|
|- request data carrier e.g. SolrQueryRequest
|
lets.write.this.UpdateRequestHandler
|
|- response data carrier e.g. SolrQueryResponse
|
do.we.really.need.LegacyUpdateOutputWRiter

I expect this is all fairly straightforward, except for one sticky question:

Is there a "universal" format which can efficiently (e.g. lazily, for stream 
input) convey all kinds of different request body encodings, such that the 
RequestHandler has no idea how it was dispatched?

Something to think about...

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison


Ryan McKinley wrote:


In addition, consider the case where you want to index a SVN
repository.  Yes, this could be done in SolrRequestParser that logs in
and returns the files as a stream iterator.  But this seems like more
'work' then the RequestParser is supposed to do.  Not to mention you
would need to augment the Document with svn specific attributes.


This is indeed one of the things I'd like to do - use Solr as a back-end
for OpenGrok (http://www.opensolaris.org/os/project/opengrok/)

--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison


Chris Hostetter wrote:


i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.


+1

--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Chris Hostetter


talking about the URL structure made me realize that the Servlet should
dicate the URL structure and the param parsing, but it should do it after
giving the RequestParser a crack at any streams it wants (actually i think
that may be a direct quote from JJ ... can't remember now) ... *BUT* the
RequestParser may not want to provide a list of streams, untill the params
have been parsed (if for example, one of the params is the name of a file)

so what if the interface for RequestParser looked like this...

  interface RequestParser {
public init(NamedList nl); // the usual
/** will be passed the raw input stream from the
 * HttpServletRequest, ... may need other HttpServletRequest info as
 * SolrParam (ie: method, content-type/content-length, ...but we use
 * a SolrParam instance instead of the HttpServletRequest to
 * maintain an abstraction.
 */
public Iterable preProcess(SolrParam headers,
  InputStream s);
/** garunteed that the second arg will be the result from
 * a previous call to preProcess, and that that Iterable from
 * preProcess will not have been inspected or touched in anyway, nor
 * will any refrences to it be maintained after this call.
 * this method is responsible for calling
 * request.setContentStreams(Iterable i);

  }

...the idea being that many RequestParsers will choose to impliment one or
both of those methods as a NOOP that just returns null but if they want
to impliment both, they have the choice of obliterating the Iterable
returned by preProcess and completely replacing it once they see the
SolrParams in the request

: specifically what i had in mind was something like this...
:
:   class SolrUberServlet extends HttpServlet {
: public service(HttpServletRequest req, HttpServletResponse response) {
:   SolrCore core = getCore();
:   Solr(Query)Response solrRsp = new Solr(Query)Response();
:
:   // servlet specific method which does minimal inspection of
:   // req to determine the parser name
:   String p = pickRequestParser(req);
:
:   // looks up a registered instance (from solrconfig.xml)
:   // matching that name
:   RequestParser solrParser = coreGetParserByName(p);
:

// let the parser preprocess the streams if it wants...
Iterable s = solrParser.preprocess(req.getInputStream())

// build the request using servlet specific URL rules
Solr(Query)Request solrReq = makeSolrRequest(req);

// let the parser decide what to do with the existing streams,
// or provide new ones
solrParser.process(solrReq, s);

:   // does exactly what it does now: picks the RequestHandler to
:   // use based on the params, calls it's handleRequest method
:   core.execute(solrReq, solrRsp)
:
:   // the rest of this is cut/paste from the current SolrServlet.
:   // use SolrParams to pick OutputWriter name, ask core for instance,
:   // have that writer write the results.
:   QueryResponseWriter responseWriter = 
core.getQueryResponseWriter(solrReq);
:   response.setContentType(responseWriter.getContentType(solrReq, 
solrRsp));
:   PrintWriter out = response.getWriter();
:   responseWriter.write(out, solrReq, solrRsp);
:
: }
:   }
:
:
: -Hoss
:



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


data and wrote it out in the current update response format .. so the
current SolrUpdateServlet could be completley replaced with a simple url
mapping...

   /update --> /select?qt=xmlupdate&wt=legacyxmlupdate



Using the filter method above, it could (and i think should) be mapped to:
/update

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


On 1/16/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: >I left out "micro-plugins" because i don't quite have a good answer
: >yet :)  This may a place where a custom dispatcher servlet/filter
: >defined in web.xml is the most appropriate solution.
:
: If the issue is munging HTTPServletRequest information, then a proper
: separation of concerns suggests responsibility should lie with a Servlet
: Filter, as Ryan suggests.

I'm not making sense of this ... i don't see how the micro-plugins (aka:
RequestParsers) could be implimented as Filters and still be plugins that
users could provide ... don't Filters have to be specified in the web.xml


Yes.  I'm suggesting we map a filter to intercept ALL requests, then
see which ones it should handle.

Consider:

public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException
{
 if(request instanceof HttpServletRequest) {
   HttpServletRequest req = (HttpServletRequest) request;
   String path = req.getServletPath();

   SolrRequestHandler handler = core.getRequestHandler( path );
   if( handler != null ) {

HANDLE THE REQUEST

return;
   }
 }

 // Otherwise let the webapp handle the request
 chain.doFilter(request, response);
}



... is there some progromatic way a Servlet or Filter can register other
Servlets/Filters dynamicly when the application is initalized? ... if
users have to extract the solr.war and modify the web.xml to add a
RequestParser they've written, that doesn't seem like much of a plugin :)



You would not need to extract the war, just change the registered handler name.


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?



exactly.



  interface SolrDocumentParser {
public init(NamedList args);
Document parse(SolrParams p, ContentStream content);
  }




yes

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > - Revise the XML-based update code (broken out of SolrCore into a
: > RequestHandler) to use all the above.
:
: +++1, that's been needed forever.

yeah ... once we have a RequestHandler doing that work, and populating a
SolrQueryResponse with it's result info, it
would probably be pretty trivial to make an extremely bare-bones
LegacyUpdateOutputWRiter that only expected that simple mount of response
data and wrote it out in the current update response format .. so the
current SolrUpdateServlet could be completley replaced with a simple url
mapping...

   /update --> /select?qt=xmlupdate&wt=legacyxmlupdate



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Erik Hatcher



On Jan 17, 2007, at 1:41 AM, Chris Hostetter wrote:
: The number of people writing update plugins will be small  
compared to

: the number of users using the external HTTP API (the URL + query
: parameters, and the relationship URL-wise between different update
: formats).  My main concern is making *that* as nice and  
utilitarian as

: possible, and any plugin stuff is implementation and a secondary
: concern IMO.

Agreed, but my point was that we should try to design the internal  
APIs

indepently from the URL structure ... if we have a set of APIs,
it's easy to come up with a URL structure that will map well (we could
theoretically have several URL structures using different servlets)  
but if
we worry too much about what hte URL should look like, we may  
hamstring

the model design.


+1

web.xml allows for servlets to be mapped however desired, and  
cleverly using a servlet filters could add in some other URL mapping  
goodness, or in the extreme must-have-certain-URLs there is always  
mod_rewrite.


I still think a microcontainer is a good way to go for solr.  It's  
exactly what microcontainers were designed for.  While not spring- 
savvy myself (but tinkered with HiveMind via Tapestry a while back),  
I know enough to reiterate that its not heavy or horrible for basic  
IoC which is what is being reinvented in a sense.


Erik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: >I left out "micro-plugins" because i don't quite have a good answer
: >yet :)  This may a place where a custom dispatcher servlet/filter
: >defined in web.xml is the most appropriate solution.
:
: If the issue is munging HTTPServletRequest information, then a proper
: separation of concerns suggests responsibility should lie with a Servlet
: Filter, as Ryan suggests.

I'm not making sense of this ... i don't see how the micro-plugins (aka:
RequestParsers) could be implimented as Filters and still be plugins that
users could provide ... don't Filters have to be specified in the web.xml
... is there some progromatic way a Servlet or Filter can register other
Servlets/Filters dynamicly when the application is initalized? ... if
users have to extract the solr.war and modify the web.xml to add a
RequestParser they've written, that doesn't seem like much of a plugin :)

In general i'm not too worried about what the URL structure looks like ...
i agree it makes the most sense for the RequestParser to be determinede
using the path, but beyond that i don't think it matters much -- the
existing servlet could stay arround as is with a hardcoded use of a
"DefaultRequestParser" that doesn't provide any streams and gets the
params from HttpServletRequest while a new Servlet could get the qt and wt
from the path info as well.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > : In addition to RequestProcessors, maybe there should be a general
: > : DocumentProcessor

: > : interface SolrDocumentParser
: > : {
: > :   Document parse(ContentStream content);
: > : }

: > what else would the RequestProcessor do if it was delegating all of the
: > parsing to something else?

: Parsing is just one task that a RequestProcessor may do.  It is the
: entry point for all kinds of stuff: searching, admin tasks, augment
: search results with SQL queries, writing uploaded files to the file
: system.  This is where people will do whatever suits their fancy.

ah ... i see what you mean.  so DocumentProcessors would be reusable
classes that RequestHandlers/RequestProcessers could use to parse streams
-- but instead of needing to hardcoding class dependencies in the
RequestHandler on specific DocumentProcessors, the RequestHandler could do
a "lookup" on the mime/type of the stream (or any other key it wanted to i
suppose) to parse the stream ... so you could have a
SimpleHtmlDocumentProcesser that you use, and then one day you replace it
with a CompleHtmlDocumentProcessor which you probably have to configure a
bit differnetly but you don't have to recompile your RequestHandler ...
kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?

(i was confused and thinking that picking a DocumentProcessor would be
done by the core independent of picking the RequestHandler --- just like
hte OUtputWriter is)

: In addition, consider the case where you want to index a SVN
: repository.  Yes, this could be done in SolrRequestParser that logs in
: and returns the files as a stream iterator.  But this seems like more
: 'work' then the RequestParser is supposed to do.  Not to mention you
: would need to augment the Document with svn specific attributes.
:
: Parsing a PDF file from svn should (be able to) use the same parser if
: it were uploaded via HTTP POST.

i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.

The DocumentProcessors are the ones that are really going to need a lot of
configuration telling them how to map the chunks of data from the stream
to fields in the schema -- but in the same way that OutputWriters get the
request after the RequestHandler has had a chance to wrap the SolrParams,
it probably makes sense to let the request handler override configuration
for the DocumentProcessors as well (so i can say "normally i want the
HtmlDocumentProcessor to map these HTML elements to these schema fields
... but i have one type of HTML doc that breaks the rules, so i'll use a
seperate RequestHandler to index them, and it will override some of those
field mappings...

  interface SolrDocumentParser {
public init(NamedList args);
Document parse(SolrParams p, ContentStream content);
  }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: > So to understand better:
: >
: > user request -> micro-plugin -> RequestHandler -> ResponseHandler

: or:
:
: HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor ->
: SolrResponse -> SolrResponseWriter


specifically what i had in mind was something like this...

  class SolrUberServlet extends HttpServlet {
public service(HttpServletRequest req, HttpServletResponse response) {
  SolrCore core = getCore();
  Solr(Query)Response solrRsp = new Solr(Query)Response();

  // servlet specific method which does minimal inspection of
  // req to determine the parser name
  String p = pickRequestParser(req);

  // looks up a registered instance (from solrconfig.xml)
  // matching that name
  RequestParser solrParser = coreGetParserByName(p);

  // RequestParser is the only plugin class that knows about
  // HttpServletRequest, it builds up the SolrRequest (aka
  // SolrQueryRequest) which contains the SolrParams and streams
  SolrRequest solrReq = solrParser.parse(req);

  // does exactly what it does now: picks the RequestHandler to
  // use based on the params, calls it's handleRequest method
  core.execute(solrReq, solrRsp)

  // the rest of this is cut/paste from the current SolrServlet.
  // use SolrParams to pick OutputWriter name, ask core for instance,
  // have that writer write the results.
  QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);
  response.setContentType(responseWriter.getContentType(solrReq, solrRsp));
  PrintWriter out = response.getWriter();
  responseWriter.write(out, solrReq, solrRsp);

}
  }


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)


: The number of people writing update plugins will be small compared to
: the number of users using the external HTTP API (the URL + query
: parameters, and the relationship URL-wise between different update
: formats).  My main concern is making *that* as nice and utilitarian as
: possible, and any plugin stuff is implementation and a secondary
: concern IMO.

Agreed, but my point was that we should try to design the internal APIs
indepently from the URL structure ... if we have a set of APIs,
it's easy to come up with a URL structure that will map well (we could
theoretically have several URL structures using different servlets) but if
we worry too much about what hte URL should look like, we may hamstring
the model design.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley


On 1/16/07, J.J. Larrea <[EMAIL PROTECTED]> wrote:

- Revise the XML-based update code (broken out of SolrCore into a 
RequestHandler) to use all the above.


+++1, that's been needed forever.
If one has the time, I'd also advocate moving to StAX (via woodstox
for Java5, but it's built into Java6).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley

On 1/16/07, J.J. Larrea <[EMAIL PROTECTED]> wrote:

>POST:
> if( multipart ) {
>  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says 
is suppose to be automatically by the servlet container if the payload is 
application/x-www-form-urlencoded; in that case the input stream should be null.

Unfortunately, curl puts application/x-www-form-urlencoded in there by
default.  Our current implementation of updates always ignores that
and treats the stream as binary.
An alternative for non-multipart posts could check the URL for args,
and if they are there, treat the body as the input instead of params.

$ curl http://localhost:5000/a/b?foo=bar --data-binary "hi there"

$ nc -l -p 5000
POST /a/b?foo=bar HTTP/1.1
User-Agent: curl/7.15.4 (i686-pc-cygwin) libcurl/7.15.4 OpenSSL/0.9.8d zlib/1.2.
3
Host: localhost:5000
Accept: */*
Content-Length: 8
Content-Type: application/x-www-form-urlencoded

hi there

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley


On 1/15/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the "model" or how updates should be handled in a generic way, what
all of the "Plugin" types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new "SolrServlet2" that made the URL structure work anyway we
want.


The number of people writing update plugins will be small compared to
the number of users using the external HTTP API (the URL + query
parameters, and the relationship URL-wise between different update
formats).  My main concern is making *that* as nice and utilitarian as
possible, and any plugin stuff is implementation and a secondary
concern IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread J.J. Larrea

I'm in frantic deadline mode so I'm just going to throw in some (hopefully) 
short comments...

At 11:02 PM -0800 1/15/07, Ryan McKinley wrote:
>>the one thing that still seems missing is those "micro-plugins" i was
>> [SNIP]
>>
>>  interface SolrRequestParser {
>> SolrRequest process( HttpServletRequest req );
>>  }
>>
>
>
>I left out "micro-plugins" because i don't quite have a good answer
>yet :)  This may a place where a custom dispatcher servlet/filter
>defined in web.xml is the most appropriate solution.

If the issue is munging HTTPServletRequest information, then a proper 
separation of concerns suggests responsibility should lie with a Servlet 
Filter, as Ryan suggests.

For example, while the Servlet 2.4 spec doesn't have specifications for how the 
servlet container can/should "burst" a multipart-MIME payload into separate 
files or streams, there are a number of 3rd party Filters which do this.

The Iterator is a great idea because if each stream is read to 
completion before the next is opened it doesn't impose any limitation on 
individual stream length and doesn't require disk buffering.

(Of course some handlers may require access to more than one stream at a time; 
each time next() is called on the iterator before the current stream is closed, 
the remainder of that stream will have to be buffered in memory or on disk, 
depending on the part length.  Nonetheless that detail can be entirely hidden 
from the handler, as it should be.  I am not sure if any available 
ServletFilter implementations work this way, but it's certainly doable.)

But that detail is irrelevant for now; as I suggest below, using this API lets 
one immediately implement it with only next() value of the entire POST stream; 
that would answer the needs of the existing update request handling code, but 
establish an API to handle multi-part.  Whenever someone wants to write a 
multi-stream handler, they can write or find a better Iterator 
implementation, which would best be cast as a ServletFilter.

>I like the SolrRequestParser suggestion.

Me too.  It answers a hole in my vision for how this can all fit together.

>Consider:
>qt='RequestHandler'
>wt='ResponseWriter'
>rp='RequestParser ' (rb='SolrBuilder'?)
>
>To avoid possible POST read-ahead stream mungling: qt,wt, and rp
>should be defined by the URL, not parameters.  (We can add special
>logic to allow /query?qt=xxx)
>
>For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
>define arbitrary path mapping for qt.
>
>We could append 'wt', 'rb', and arbitrary arbitrary text to the
>registered path, something like
> /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...
>
>(any other syntax ideas?)

No need for new syntax, I think.  The pathInfo or qt or other source resolves 
to a requestHandler CONFIG name.  The handler config is read to determine the 
handler class name.  It also can be consulted (with URL or form-POST params 
overriding if allowed by the  config) to decide which RequestParser to invoke 
BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER.  Once those 
objects are set up, the request body gets executed.

Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly 
simplify, for example, creating a dozen query handlers which used a particular 
invariant combination of qt, wt, and rp

>The 'standard' RequestParser would:
>GET:
> fill up SolrParams directly with req.getParameterMap()
>if there is a 'post' parameter (post=XXX)
>  return a stream with XXX as its content
>else
>  empty iterator.
>Perhaps add a standard way to reference a remote URI stream.
>
>POST:
> if( multipart ) {
>  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says 
is suppose to be automatically by the servlet container if the payload is 
application/x-www-form-urlencoded; in that case the input stream should be null.

>  return an iterator over the collection of files

Collection of streams, per Hoss.

>}
>else {
>  no parameters? parse parameters from the URL? /name:value/
>  return the body stream

As above, this introduces unneeded complexity and should be avoided.

>}
>DEL:
> throw unsupported exception?
>
>
>Maybe each RequestHandler could have a default RequestParser.  If we
>limited the 'arbitrary path' to one level, this could be used to
>generate more RESTful URLs. Consider:
>
>/myadder////
>
>/myadder maps to MyCustomHandler and that gives you
>MyCustomRequestBuilder that maps /// to SolrParams

I think these are best left for an extra-SOLR layer, especially since SOLR URLs 
are meant for interprogram communication and not direct use by non-developer 
end users.  For example, for my org's website I have hundreds of Apache 
mod_rewrite rules which do URL munging such as
/journals/abc/7/3/192a.pdf
into
/journalroot/index.cfm?journal=abc&volume=7&issue=3
&page=192&seq=a&format=pdf

Or someone

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Alan Burlison


Bertrand Delacretaz wrote:


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.


Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.


--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Alan Burlison


Yonik Seeley wrote:


Brainstorming:
- for errors, use HTTP error codes instead of putting it in the XML as now.


That doesn't work so well if there are multiple documents to be indexed 
in a single request.


--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Erik Hatcher



On Jan 16, 2007, at 3:20 AM, Bertrand Delacretaz wrote:

On 1/16/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

...I think a DocumentParser registry is a good way to isolate this  
top level task...


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.


+1   that, or HiveMind.  It seems a lot of the wheel is being  
reinvented here, when solid plugin solutions already exist.


Erik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)



So to understand better:

user request -> micro-plugin -> RequestHandler -> ResponseHandler

Right?



or:

HttpServletRequest -> SolrRequestParser -> SolrRequestProcessor ->
SolrResponse -> SolrResponseWriter

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Thorsten Scherler

On Mon, 2007-01-15 at 12:23 -0800, Chris Hostetter wrote:
> : > Right, you're getting at issues of why I haven't committed my CSV handler 
> yet.
> : > It currently handles reading a local file (this is more like an SQL
> : > update handler... only a reference to the data is passed).  But I also
> : > wanted to be able to handle a POST of the data  , or even a file
> : > upload from a browser.  Then I realized that this should be generic...
> : > the same should also apply to XML updates, and potential future update
> : > formats like JSON.
> :
> : I do not see the problem here. One just need to add a couple of lines in
> : the upload servlet and change the csv plugin to input stream (not local
> : file).
> 
> what Yonik and i are worried about is that we don't want the list of all
> possible ways for an Update Plugin to get a Stream to be hardcoded in the
> UpdateServlet or Solr Core or in the Plugins themselves ... we'd like the
> notion of indexing docs expressed as CSV records or XML records or JSON
> records to be independent of where the CSV, XML, or JSON data stream came
> from ... in the same way that the current RequestHandlers can execute
> specific search logic, without needing to worry about what format the
> results are going to be returned in.
> 
> 
> It's not writting code to get the stream from one of N known ways
> that's hard -- it's designing an API so we can get the stream from one of
> any number of *unknown* ways that can be specified at run time thats
> tricky :)
> 

Ok, I am still trying to understand your concept of micro-plugin, but I
understand the above and your comments later in this thread that you are
looking for a generic stream resolver/producer (or solrSource). 


On Mon, 2007-01-15 at 12:42 -0800, Chris Hostetter wrote:
> i disagree ... it should be possible to create "micro-plugins" (I
> think i
> called them "UpdateSource" instances in my orriginal suggestion) that
> know
> about getting streams in various ways, but don't care what format of
> data
> is found on those streams -- that would be left for the
> (Update)RequestHandler (which wouldn't need to know where the data
> came
> from)
> 
> a JDBC/SQL updater would probably be a very special case -- where the
> format and the stream are inheriently related -- in which case a No-Op
> UpdateSource could be used that didn't provide any stream, and the
> JdbcUpdateRequestHandler would manage it's JDBC streams directly. 

So to understand better:

user request -> micro-plugin -> RequestHandler -> ResponseHandler

Right?

salu2

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Bertrand Delacretaz


On 1/16/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:


...I think a DocumentParser registry is a good way to isolate this top level 
task...


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)