[ 
https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629971#comment-16629971
 ] 

Jan Høydahl edited comment on SOLR-12798 at 9/27/18 9:01 AM:
-------------------------------------------------------------

If I understand correctly, you now have a choice in MCF whether to
 # Stream the original binary document to Solr's extracting request handler and 
use Solr's built-in Tika to parse it. 
 In this case there will NOT be a problem since you won't have much metadata as 
request params, just the few you would have configured statically
 # Let MCF do the Tika conversion using Tika Content Extractor 
(https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor)
In this case MCF will have all the various metadata parsed from the docs, that 
it may want to send to Solr, alongside the plain-text parsed version of the 
document.

For 1) you don't have an issue, as you send the binary stream to /extract 
endpoint.

For 2) I wonder why you use {{/extract}} at all, since Tika has already been 
invoked on the MCF side. This seems like an anti-pattern. The best way would be 
to construct a SolrInputDocument on where each {{literal.xyz}} params becomes a 
separate {{xyz}} field, and where the text body is put into a {{content}} field 
(configurable) and everything is sent to {{/update}} as opposed to 
{{/extract}}. In the case of jpg files the body text would of course be empty 
as there is only metadata to be indexed.


was (Author: janhoy):
If I understand correctly, you now have a choice in MCF whether to
 # Stream the original binary document to Solr's extracting request handler and 
use Solr's built-in Tika to parse it. 
In this case there will NOT be a problem since you won't have much metadata as 
request params, just the few you would have configured statically
 # Let MCF do the Tika conversion using Tika Content Extractor 
([https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor)
I|https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor)]n
 this case MCF will have all the various metadata parsed from the docs, that it 
may want to send to Solr, alongside the plain-text parsed version of the 
document.

For 1) you don't have an issue, as you send the binary stream to /extract 
endpoint.

For 2) I wonder why you use {{/extract}} at all, since Tika has already been 
invoked on the MCF side. This seems like an anti-pattern. The best way would be 
to construct a SolrInputDocument on where each {{literal.xyz}} params becomes a 
separate {{xyz}} field, and where the text body is put into a {{content}} field 
(configurable) and everything is sent to {{/update}} as opposed to 
{{/extract}}. In the case of jpg files the body text would of course be empty 
as there is only metadata to be indexed.

> Structural changes in SolrJ since version 7.0.0 have effectively disabled 
> multipart post
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-12798
>                 URL: https://issues.apache.org/jira/browse/SOLR-12798
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: HOT Balloon Trip_Ultra HD.jpg, 
> SOLR-12798-approach.patch, solr-update-request.txt
>
>
> Project ManifoldCF uses SolrJ to post documents to Solr.  When upgrading from 
> SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to 
> SolrJ's HttpSolrClient class that seemingly disable any use of multipart 
> post.  This is critical because ManifoldCF's documents often contain metadata 
> in excess of 4K that therefore cannot be stuffed into a URL.
> The changes in question seem to have been performed by Paul Noble on 
> 10/31/2017, with the introduction of the RequestWriter mechanism.  Basically, 
> if a request has a RequestWriter, it is used exclusively to write the 
> request, and that overrides the stream mechanism completely.  I haven't 
> chased it back to a specific ticket.
> ManifoldCF's usage of SolrJ involves the creation of 
> ContentStreamUpdateRequests for all posts meant for Solr Cell, and the 
> creation of UpdateRequests for posts not meant for Solr Cell (as well as for 
> delete and commit requests).  For our release cycle that is taking place 
> right now, we're shipping a modified version of HttpSolrClient that ignores 
> the RequestWriter when dealing with ContentStreamUpdateRequests.  We 
> apparently cannot use multipart for all requests because on the Solr side we 
> get "pfountz Should not get here!" errors on the Solr side when we do, which 
> generate HTTP error code 500 responses.  That should not happen either, in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to