[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631860#comment-16631860 ]
Shawn Heisey commented on SOLR-12798: ------------------------------------- bq. How do you suggest we handle binary data that is meant for SolrCell? I would suggest that you don't do this. At all. Tika is prone to OOM and JVM crashes, as [~julienFL] already noted. When this happens in SolrCell, Solr goes down too. So it's strongly recommended for all users to never use SolrCell in production, which in my opinion means that MCF should not be using SolrCell. Tika should be separate, so if it explodes, the Solr server keeps running. That said... I think support for multi-part POST should be first class in SolrJ, and I would even say that sending separate parts for parameters and the actual body should be what SolrJ *always* does when it's asked to do POST, so URL limits aren't exceeded no matter what gets thrown at it. And we need to make sure that multi-part handling on the server side is rock-solid. (I'm not suggesting there's any problems there ... but if any are found, they need attention) It's probably a good idea to support multiple *data* streams as well in SolrJ. This would probably require some changes on the server side, and a separate Jira issue. If MCF creates SolrInputDocument objects, it can put everything there. MCF wouldn't need to be concerned about format (the JSON mentioned earlier), only one POST part is required, URL parameters are not needed, and the standard /update handler can be used, even without a change for this issue. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > ---------------------------------------------------------------------------------------- > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ > Affects Versions: 7.4 > Reporter: Karl Wright > Assignee: Karl Wright > Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org