Re: Handling disparate data sources in Solr

Chris Hostetter Sun, 07 Jan 2007 16:22:08 -0800

: > There has been some discussion about adding plugin support for the
: > "update" side of things as well -- at a very simple level this could allow
: > for messages to be sent via JSON, or CSV instead of just XML -- but


: I'm interested in discussing this further.  I've moved the discussion
: onto solr-dev, as suggested.

Currently, the "modularity" of updates is configurable only the
upateHandler -- which decides how instances of "UpdateCommand" will be
handled by the SOlrCore (directly, via a temp index, etc...)

The relevent discussion so far seems to have focused on a two
different aspects of issue related to how SolrCore gets those commands...
  1) parsing different String representations (ie: XML vs JSON vs CSV) of
     the same basic command structure (ie: "add" containing "doc"s,
     containing "field"s)
  2) differnet means of feeding those String commands to Solr (raw POST,
     CGI file upload, local file)

with this thread, a third aspect has been brought up:
  3) Sending Solr more "raw" data and letting a plugin extract the
     individual fields based on rules (IE: parsing a PDF and determing the
     "title" and "body" on the server side)

It seems like these issues could be addressed by modifing the
SolrUpdateServlet to to support to low level query params similar to the
way the SolrServlet looks at "qt" and "wt".  The first Param would be used
to pick an UpdateSource plugin that would have an API like...
  public interface UpdateSource {
     SolrUpdateRequest makeRequest(HttpServletRequest req);
  }

with the SolrUpdateRequest interface looking something like...
  public interface SolrUpdateRequest {
     SolrParams getParams();
     Iterable<java.io.Reader> getRawUpdates();
  }

different out of the box versions of UpdateSource would support building
SolrUpdateRequest objects from HttpServletRequests using...
  1) URL query args and the raw POST body
  2) query args from multipart form input and Readers from file uploads
  3) query args and local filenames specificed in query args
  4) query args and remote URLs specified in query args

The SolrUpdateServlet would then use SolrUpdateRequest.getParams() to
lookup it's second core param for picking an UpdateParser plugin, which
would be responsible for parsing all of those Readers in sequence,
converting them to UpdateCommands, and calling the appropriate methods on
the UpdateHandler.

Out of the box versions of UpdateParser could do the XML parsing currently
done, or JSON parsing, or CSV parsing.  Custom plugins written by users
could do more exotic schema specific parsing: ie, reading raw PDFs and
extracting specific field values.


what do you guys think?


-Hoss

Re: Handling disparate data sources in Solr

Reply via email to