Dirk Rudolph created SOLR-11869:
-----------------------------------

             Summary: Remote streaming UpdateRequestProcessor
                 Key: SOLR-11869
                 URL: https://issues.apache.org/jira/browse/SOLR-11869
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: UpdateRequestProcessors
            Reporter: Dirk Rudolph


When indexing documents from content management systems (or digital asset 
management systems) they usually have fields for metadata given by an editor 
and they in case of pdfs, docx or any other text formats may also contain the 
binary content as well, which might be parsed to plain text using tika. This is 
whats currently supported by the ExtractingRequestHandler. 

We are now facing situations where we are indexing batches of documents using 
the UpdateRequestHandler and want to send the binary content of the documents 
mentioned above as part of the single request to the UpdateRequestHandler. As 
those documents might be of unknown size and its difficult to send streams 
along the wire with javax.json APIs, I though about sending the url to the 
document itself, let solr fetch the document and let it be parsed by tika - 
using a RemoteStreamingUpdateRequestProcessor.  

Example:
{code:json}
{ 
 "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
 "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..."; }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to