[ https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052059#comment-16052059 ]
Mark Miller commented on SOLR-9824: ----------------------------------- Yeah, it's meant to go to 6.x. It was baking on 7. Tomas has likely fixed the main issue I was concerned about before backporting. This is not resolved because it's still intended to be backported, just a complicated change that needed some baking time to make sure no major bugs were introduced. > Documents indexed in bulk are replicated using too many HTTP requests > --------------------------------------------------------------------- > > Key: SOLR-9824 > URL: https://issues.apache.org/jira/browse/SOLR-9824 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 6.3 > Reporter: David Smiley > Assignee: Mark Miller > Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, > SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, > SOLR-9824-tflobbe.patch > > > This takes awhile to explain; bear with me. While working on bulk indexing > small documents, I looked at the logs of my SolrCloud nodes. I noticed that > shards would see an /update log message every ~6ms which is *way* too much. > These are requests from one shard (that isn't a leader/replica for these docs > but the recipient from my client) to the target shard leader (no additional > replicas). One might ask why I'm not sending docs to the right shard in the > first place; I have a reason but it's besides the point -- there's a real > Solr perf problem here and this probably applies equally to > replicationFactor>1 situations too. I could turn off the logs but that would > hide useful stuff, and it's disconcerting to me that so many short-lived HTTP > requests are happening, somehow at the bequest of DistributedUpdateProcessor. > After lots of analysis and debugging and hair pulling, I finally figured it > out. > In SOLR-7333 ([~tpot]) introduced an optimization called > {{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will > poll with a '0' timeout to the internal queue, so that it can close the > connection without it hanging around any longer than needed. This part makes > sense to me. Currently the only spot that has the smarts to set this flag is > {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the > last document. So if a shard received docs in a javabin stream (but not > other formats) one would expect the _last_ document to have this flag. > There's even a test. Docs without this flag get the default poll time; for > javabin it's 25ms. Okay. > I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send > javabin data in a batch, the intended efficiencies of SOLR-7333 would apply. > I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW > DistributedUpdateProcessor uses CUSC too). CUSC uses the RequestWriter > (defaulting to javabin) to send each document separately without any leading > marker or trailing marker. For the XML format by comparison, there is a > leading and trailing marker (<stream> ... </stream>). Since there's no outer > container for the javabin unmarshalling to detect the last document, it marks > _every_ document as {{req.lastDocInBatch()}}! Ouch! -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org