[ 
https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631084#comment-16631084
 ] 

Karl Wright commented on SOLR-12798:
------------------------------------

[~janhoy], if you didn't mean that the metadata and content should be sent in 
the content body, then I'm completely missing what your suggestion is.

{quote}
My cURL examples were just to discus what "metadata" might mean in this context.
{quote}

Repositories that are crawled by ManifoldCF have documents that are represented 
as follows:
- A long content stream, binary
- N pairs of name/value data, called metadata, which is fielded data associated 
with the document

If the metadata is extracted in a ManifoldCF pipeline from the content stream, 
it's done via Tika, from a binary stream, which changes the binary content 
stream to a simple text stream, and also supplies more metadata generated as a 
result of the extraction.  In other words, your JSON example is not like 
anything we do at all at this time.

If you want this translated into CURL, you can do it one of two ways:
(1) Put the metadata onto the URL as & parameters, e.g. 
name1=value1&name2=value2 etc, or
(2) Send the metadata as sections in a multipart post.  This too can be set up 
in CURL if you want me to propose an example.  Each section in a multipart post 
has a name, and you can thus transmit a section for every metadata name/value 
pair, as well as one for the content part (which has its own name, that is in 
fact used by SolrCell for metadata of its own.)

Hope this helps.


> Structural changes in SolrJ since version 7.0.0 have effectively disabled 
> multipart post
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-12798
>                 URL: https://issues.apache.org/jira/browse/SOLR-12798
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: HOT Balloon Trip_Ultra HD.jpg, 
> SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, 
> solr-update-request.txt
>
>
> Project ManifoldCF uses SolrJ to post documents to Solr.  When upgrading from 
> SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to 
> SolrJ's HttpSolrClient class that seemingly disable any use of multipart 
> post.  This is critical because ManifoldCF's documents often contain metadata 
> in excess of 4K that therefore cannot be stuffed into a URL.
> The changes in question seem to have been performed by Paul Noble on 
> 10/31/2017, with the introduction of the RequestWriter mechanism.  Basically, 
> if a request has a RequestWriter, it is used exclusively to write the 
> request, and that overrides the stream mechanism completely.  I haven't 
> chased it back to a specific ticket.
> ManifoldCF's usage of SolrJ involves the creation of 
> ContentStreamUpdateRequests for all posts meant for Solr Cell, and the 
> creation of UpdateRequests for posts not meant for Solr Cell (as well as for 
> delete and commit requests).  For our release cycle that is taking place 
> right now, we're shipping a modified version of HttpSolrClient that ignores 
> the RequestWriter when dealing with ContentStreamUpdateRequests.  We 
> apparently cannot use multipart for all requests because on the Solr side we 
> get "pfountz Should not get here!" errors on the Solr side when we do, which 
> generate HTTP error code 500 responses.  That should not happen either, in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to