: > : In addition to RequestProcessors, maybe there should be a general : > : DocumentProcessor
: > : interface SolrDocumentParser : > : { : > : Document parse(ContentStream content); : > : } : > what else would the RequestProcessor do if it was delegating all of the : > parsing to something else? : Parsing is just one task that a RequestProcessor may do. It is the : entry point for all kinds of stuff: searching, admin tasks, augment : search results with SQL queries, writing uploaded files to the file : system. This is where people will do whatever suits their fancy. ah ... i see what you mean. so DocumentProcessors would be reusable classes that RequestHandlers/RequestProcessers could use to parse streams -- but instead of needing to hardcoding class dependencies in the RequestHandler on specific DocumentProcessors, the RequestHandler could do a "lookup" on the mime/type of the stream (or any other key it wanted to i suppose) to parse the stream ... so you could have a SimpleHtmlDocumentProcesser that you use, and then one day you replace it with a CompleHtmlDocumentProcessor which you probably have to configure a bit differnetly but you don't have to recompile your RequestHandler ... kind of like a binary stream equivilent to the way analyzers can be customized -- is thta kind of what you had in mind? (i was confused and thinking that picking a DocumentProcessor would be done by the core independent of picking the RequestHandler --- just like hte OUtputWriter is) : In addition, consider the case where you want to index a SVN : repository. Yes, this could be done in SolrRequestParser that logs in : and returns the files as a stream iterator. But this seems like more : 'work' then the RequestParser is supposed to do. Not to mention you : would need to augment the Document with svn specific attributes. : : Parsing a PDF file from svn should (be able to) use the same parser if : it were uploaded via HTTP POST. i'm totally on board now ... the RequestParser decides where the streams come from if any (post body, file upload, local file, remote url, etc...); the RequestHandler decides what it wants to do with those streams, and has a library of DocumentProcessors it can pick from to help it parse them if it wants to, then it takes whatever actions it wants, and puts the response information in the existing Solr(Query)Response class, which the core hands off to any of the various OutputWriters to format according to the users wishes. The DocumentProcessors are the ones that are really going to need a lot of configuration telling them how to map the chunks of data from the stream to fields in the schema -- but in the same way that OutputWriters get the request after the RequestHandler has had a chance to wrap the SolrParams, it probably makes sense to let the request handler override configuration for the DocumentProcessors as well (so i can say "normally i want the HtmlDocumentProcessor to map these HTML elements to these schema fields ... but i have one type of HTML doc that breaks the rules, so i'll use a seperate RequestHandler to index them, and it will override some of those field mappings... interface SolrDocumentParser { public init(NamedList args); Document parse(SolrParams p, ContentStream content); } -Hoss