[ https://issues.apache.org/jira/browse/SOLR-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990572#comment-16990572 ]
Yonik Seeley commented on SOLR-14013: ------------------------------------- OK, found the primary issue.... it's an N^2 bug. h3. Background: SOLR-12885 changed SolrDocument (among other things) to make it so that when used through certain interfaces, gets would auto-check-and-convert keys & values of type Utf8CharSequence to String. This happened in methods getFieldValueMap.get(), getFieldValueMap.keySet(), remove() and getValues() h3. Issues: SolrDocument does not auto-convert through many other of its methods, such as .get(), .put(), .keySet(), so depending on this anywhere is extremely fragile and will break if you change how you access SolrDocument SolrInputField has some of the same issues as SolrDocument, the mere act of doing a .get() on a multi-valued field (which should be O(1)) scans the entire list for CharSequence and if it finds one, creates a new list and iterates over the whole thing again to convert each element. h3. Client side indexing: And it's worse, because it looks like this auto-check-and-convert logic is even triggered when the SolrJ is using JavaBinCodec to send documents... so even if some field values were Utf8CharSequence to begin with, they would still be converted to String before being converted back to utf8 by JavaBinCodec! h3. Server side indexing: Then on the server side, JavaBinCodec parses String values as Utf8CharSequence, and we start going through the update processor chain. FieldMutatingUpdateProcessor (used in our _default config to remove blank values) asks each SolrInputField for its value, which again triggers iteration over the complete list. Also, for *any* string values (single valued too), FieldMutatingUpdateProcessor replaces those Utf8CharSequence objects with String objects (destroying any attempted re-serializing optimization) Then comes NestedUpdateProcessorFactory, which triggers the auto-check-and-convert *twice*, because getValue() returned a pointer previously, which would have been optimized away. Both lines below iterate over all values, *before* the actual iteration by the explicit "for" loop: {code:java} boolean isSingleVal = !(field.getValue() instanceof Collection); for(Objectval: field) { {code} then isAtomicUpdate(), and then finally writeSolrInputDocument() to convert to JavaBin for the transaction log both trigger the extra iterate-over-all-values with each inspection. If FieldMutatingUpdateProcessor hadn't overwritten Utf8CharSequence already, all of these accesses would have also triggered a new collection creation each time (and an additional iteration to create the new collection) for every multi-valued string field. h3. Server side query: On the query side, we get a Lucene Document, and then convert it into a SolrDocument. Binary ResponseWriter uses MaskCharSeqSolrDocument which inherits from SolrDocument to do the auto-convert-on-access stuff more thoroughly. {code:java} for (IndexableField f : doc.getFields()) { final String fname = f.name(); if (null == fieldNamesNeeded || fieldNamesNeeded.contains(fname) ) { // Make sure multivalued fields are represented as lists Object existing = out.get(fname); {code} For multi-valued fields, what we get back from lucene is actually a flat list of all the values in the whole document. We need to collect all values with the same field into a list. So if there are 1000 values in a field, the outer loop executes over 1000 times. Then in the inner loop we retrieve any existing value for the field by calling "out.get(fname)", which triggers the auto-convert-on-access which scans all the values so far (on average 1000/2), and hence we have our O(N^2) behavior that the original poster reported. h3. Other: It took a really long time to review some of this code (and I've only reviewed some), often because a lack of comments around non-obvious things. I thought there might be lifetime/sharing bugs with BytesBlock for example, until I realized that strings are appended in the block rather than placed at the start. Same issue for FastInputStream.readDirectUtf8... since it looked like it was sharing the internal buffer, I thought there was a possible lifetime issue there. A single line comment in both of those cases would have saved me quite a bit. Actually... looking at it again, there still may be a subtle sharing bug in this new FastInputStream.readDirectUtf8. I can't say I quite understand the logic behind for when you can't share the internal buffer. {code:java} if (in !=null || end < pos + len) return false; {code} You can only share the buffer when the bytes you want are right up at the end of the buffer? I'm not sure I understand the logic around that, but ChannelFastInputStream (used by TransactionLog) derives from FastInputStream and doesn't have an input stream, so if it got unlucky and did a string read at the end of a buffer, there would likely be data corruption. > javabin performance regressions > ------------------------------- > > Key: SOLR-14013 > URL: https://issues.apache.org/jira/browse/SOLR-14013 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 7.7 > Reporter: Yonik Seeley > Assignee: Yonik Seeley > Priority: Major > Attachments: test.json > > > As noted by [~rrockenbaugh] in SOLR-13963, javabin also recently became > orders of magnitude slower in certain cases since v7.7. The cases identified > so far include large numbers of values in a field. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org