: > : In addition to RequestProcessors, maybe there should be a general
: > : DocumentProcessor

: > : interface SolrDocumentParser
: > : {
: > :   Document parse(ContentStream content);
: > : }

: > what else would the RequestProcessor do if it was delegating all of the
: > parsing to something else?

: Parsing is just one task that a RequestProcessor may do.  It is the
: entry point for all kinds of stuff: searching, admin tasks, augment
: search results with SQL queries, writing uploaded files to the file
: system.  This is where people will do whatever suits their fancy.

ah ... i see what you mean.  so DocumentProcessors would be reusable
classes that RequestHandlers/RequestProcessers could use to parse streams
-- but instead of needing to hardcoding class dependencies in the
RequestHandler on specific DocumentProcessors, the RequestHandler could do
a "lookup" on the mime/type of the stream (or any other key it wanted to i
suppose) to parse the stream ... so you could have a
SimpleHtmlDocumentProcesser that you use, and then one day you replace it
with a CompleHtmlDocumentProcessor which you probably have to configure a
bit differnetly but you don't have to recompile your RequestHandler ...
kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?

(i was confused and thinking that picking a DocumentProcessor would be
done by the core independent of picking the RequestHandler --- just like
hte OUtputWriter is)

: In addition, consider the case where you want to index a SVN
: repository.  Yes, this could be done in SolrRequestParser that logs in
: and returns the files as a stream iterator.  But this seems like more
: 'work' then the RequestParser is supposed to do.  Not to mention you
: would need to augment the Document with svn specific attributes.
:
: Parsing a PDF file from svn should (be able to) use the same parser if
: it were uploaded via HTTP POST.

i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.

The DocumentProcessors are the ones that are really going to need a lot of
configuration telling them how to map the chunks of data from the stream
to fields in the schema -- but in the same way that OutputWriters get the
request after the RequestHandler has had a chance to wrap the SolrParams,
it probably makes sense to let the request handler override configuration
for the DocumentProcessors as well (so i can say "normally i want the
HtmlDocumentProcessor to map these HTML elements to these schema fields
... but i have one type of HTML doc that breaks the rules, so i'll use a
seperate RequestHandler to index them, and it will override some of those
field mappings...

  interface SolrDocumentParser {
    public init(NamedList args);
    Document parse(SolrParams p, ContentStream content);
  }



-Hoss

Reply via email to