[jira] [Commented] (SOLR-11869) Remote streaming UpdateRequestProcessor

David Smiley (JIRA) Thu, 18 Jan 2018 05:47:22 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330528#comment-16330528
 ]


David Smiley commented on SOLR-11869:
-------------------------------------

If you wish to propose that Solr FieldType.createField and related plumbing 
work nicely with a Reader, then I think you should create an issue dedicated to 
that.  Also keep in mind that such a field cannot be "stored", since at the 
Lucene level it's required it be fully materialized to a String or BytesRef.  A 
further consequence of that is atomic-updates are not-possible.

Another thing that could be considered is using a BytesRef as the stored value, 
and wrapping a reader around it for Lucene Analyzer/TokenStream parts.  You 
wouldn't be truly streaming, but the RAM requirements should drop in half since 
you're working with UTF8 (usually 1-byte unicode characters) as opposed to a 
String (UTF16 usually 2-byte unicode characters).  This may have some gotchas, 
like highlighting and stored data retrieval which is anticipating a String from 
Lucene, not raw bytes.  BTW Lucene and Solr have code paths that recognize 
massive bytes<->char[] conversions and avoid over-allocating arrays by first 
computing how big the array on the other side needs to be by doing a 
preliminary pass to count the unicode chars.

> Remote streaming UpdateRequestProcessor
> ---------------------------------------
>
>                 Key: SOLR-11869
>                 URL: https://issues.apache.org/jira/browse/SOLR-11869
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: Dirk Rudolph
>            Priority: Minor
>
> When indexing documents from content management systems (or digital asset 
> management systems) they usually have fields for metadata given by an editor 
> and they in case of pdfs, docx or any other text formats may also contain the 
> binary content as well, which might be parsed to plain text using tika. This 
> is whats currently supported by the ExtractingRequestHandler. 
> We are now facing situations where we are indexing batches of documents using 
> the UpdateRequestHandler and want to send the binary content of the documents 
> mentioned above as part of the single request to the UpdateRequestHandler. As 
> those documents might be of unknown size and its difficult to send streams 
> along the wire with javax.json APIs, I though about sending the url to the 
> document itself, let solr fetch the document and let it be parsed by tika - 
> using a RemoteStreamingUpdateRequestProcessor.  
> Example:
> {code:json}
> { 
>  "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
>  "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..."; }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11869) Remote streaming UpdateRequestProcessor

Reply via email to