: In addition to RequestProcessors, maybe there should be a general : DocumentProcessor : : interface SolrDocumentParser : { : Document parse(ContentStream content); : } : : solrconfig could register "text/html" -> HtmlDocumentParser, and : RequestProcessors could share the same parser.what else would the RequestProcessor do if it was delegating all of the parsing to something else?
Parsing is just one task that a RequestProcessor may do. It is the entry point for all kinds of stuff: searching, admin tasks, augment search results with SQL queries, writing uploaded files to the file system. This is where people will do whatever suits their fancy. RequestHandler is probalby better name RequestProcessor, but I think we should choose name that can live peacefully with existing RequestHandler code. I imagine there will be a standard 'Processor' gets a list of streams and processes them into Documents. Since the way these documents are parsed depends totally on the schema, we will need some way to make this user configurable. In addition, consider the case where you want to index a SVN repository. Yes, this could be done in SolrRequestParser that logs in and returns the files as a stream iterator. But this seems like more 'work' then the RequestParser is supposed to do. Not to mention you would need to augment the Document with svn specific attributes. Parsing a PDF file from svn should (be able to) use the same parser if it were uploaded via HTTP POST. I think a DocumentParser registry is a good way to isolate this top level task.
