[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629971#comment-16629971 ]
Jan Høydahl edited comment on SOLR-12798 at 9/27/18 9:01 AM: ------------------------------------------------------------- If I understand correctly, you now have a choice in MCF whether to # Stream the original binary document to Solr's extracting request handler and use Solr's built-in Tika to parse it. In this case there will NOT be a problem since you won't have much metadata as request params, just the few you would have configured statically # Let MCF do the Tika conversion using Tika Content Extractor (https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor) In this case MCF will have all the various metadata parsed from the docs, that it may want to send to Solr, alongside the plain-text parsed version of the document. For 1) you don't have an issue, as you send the binary stream to /extract endpoint. For 2) I wonder why you use {{/extract}} at all, since Tika has already been invoked on the MCF side. This seems like an anti-pattern. The best way would be to construct a SolrInputDocument on where each {{literal.xyz}} params becomes a separate {{xyz}} field, and where the text body is put into a {{content}} field (configurable) and everything is sent to {{/update}} as opposed to {{/extract}}. In the case of jpg files the body text would of course be empty as there is only metadata to be indexed. was (Author: janhoy): If I understand correctly, you now have a choice in MCF whether to # Stream the original binary document to Solr's extracting request handler and use Solr's built-in Tika to parse it. In this case there will NOT be a problem since you won't have much metadata as request params, just the few you would have configured statically # Let MCF do the Tika conversion using Tika Content Extractor ([https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor) I|https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor)]n this case MCF will have all the various metadata parsed from the docs, that it may want to send to Solr, alongside the plain-text parsed version of the document. For 1) you don't have an issue, as you send the binary stream to /extract endpoint. For 2) I wonder why you use {{/extract}} at all, since Tika has already been invoked on the MCF side. This seems like an anti-pattern. The best way would be to construct a SolrInputDocument on where each {{literal.xyz}} params becomes a separate {{xyz}} field, and where the text body is put into a {{content}} field (configurable) and everything is sent to {{/update}} as opposed to {{/extract}}. In the case of jpg files the body text would of course be empty as there is only metadata to be indexed. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > ---------------------------------------------------------------------------------------- > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ > Affects Versions: 7.4 > Reporter: Karl Wright > Assignee: Karl Wright > Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org