[ 
https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631860#comment-16631860
 ] 

Shawn Heisey commented on SOLR-12798:
-------------------------------------

bq.  How do you suggest we handle binary data that is meant for SolrCell?

I would suggest that you don't do this.  At all.  Tika is prone to OOM and JVM 
crashes, as [~julienFL] already noted.  When this happens in SolrCell, Solr 
goes down too.  So it's strongly recommended for all users to never use 
SolrCell in production, which in my opinion means that MCF should not be using 
SolrCell.  Tika should be separate, so if it explodes, the Solr server keeps 
running.

That said... I think support for multi-part POST should be first class in 
SolrJ, and I would even say that sending separate parts for parameters and the 
actual body should be what SolrJ *always* does when it's asked to do POST, so 
URL limits aren't exceeded no matter what gets thrown at it.  And we need to 
make sure that multi-part handling on the server side is rock-solid.  (I'm not 
suggesting there's any problems there ... but if any are found, they need 
attention)

It's probably a good idea to support multiple *data* streams as well in SolrJ.  
This would probably require some changes on the server side, and a separate 
Jira issue.

If MCF creates SolrInputDocument objects, it can put everything there.  MCF 
wouldn't need to be concerned about format (the JSON mentioned earlier), only 
one POST part is required, URL parameters are not needed, and the standard 
/update handler can be used, even without a change for this issue.

> Structural changes in SolrJ since version 7.0.0 have effectively disabled 
> multipart post
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-12798
>                 URL: https://issues.apache.org/jira/browse/SOLR-12798
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: HOT Balloon Trip_Ultra HD.jpg, 
> SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, 
> SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt
>
>
> Project ManifoldCF uses SolrJ to post documents to Solr.  When upgrading from 
> SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to 
> SolrJ's HttpSolrClient class that seemingly disable any use of multipart 
> post.  This is critical because ManifoldCF's documents often contain metadata 
> in excess of 4K that therefore cannot be stuffed into a URL.
> The changes in question seem to have been performed by Paul Noble on 
> 10/31/2017, with the introduction of the RequestWriter mechanism.  Basically, 
> if a request has a RequestWriter, it is used exclusively to write the 
> request, and that overrides the stream mechanism completely.  I haven't 
> chased it back to a specific ticket.
> ManifoldCF's usage of SolrJ involves the creation of 
> ContentStreamUpdateRequests for all posts meant for Solr Cell, and the 
> creation of UpdateRequests for posts not meant for Solr Cell (as well as for 
> delete and commit requests).  For our release cycle that is taking place 
> right now, we're shipping a modified version of HttpSolrClient that ignores 
> the RequestWriter when dealing with ContentStreamUpdateRequests.  We 
> apparently cannot use multipart for all requests because on the Solr side we 
> get "pfountz Should not get here!" errors on the Solr side when we do, which 
> generate HTTP error code 500 responses.  That should not happen either, in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to