[ 
https://issues.apache.org/jira/browse/SOLR-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990572#comment-16990572
 ] 

Yonik Seeley commented on SOLR-14013:
-------------------------------------

OK, found the primary issue.... it's an N^2 bug.
h3. Background: 
SOLR-12885 changed SolrDocument (among other things) to make it so
 that when used through certain interfaces, gets would auto-check-and-convert
 keys & values of type Utf8CharSequence to String. This happened in methods
 getFieldValueMap.get(), getFieldValueMap.keySet(), remove() and getValues()

h3. Issues: 
 SolrDocument does not auto-convert through many other of its methods,
 such as .get(), .put(), .keySet(), so depending on this anywhere is extremely
 fragile and will break if you change how you access SolrDocument

SolrInputField has some of the same issues as SolrDocument, the mere act of
 doing a .get() on a multi-valued field (which should be O(1)) scans the entire
 list for CharSequence and if it finds one, creates a new list and iterates over
 the whole thing again to convert each element.

h3. Client side indexing:
 And it's worse, because it looks like this auto-check-and-convert logic is even
 triggered when the SolrJ is using JavaBinCodec to send documents... so even if
 some field values were Utf8CharSequence to begin with, they would still be 
converted to
 String before being converted back to utf8 by JavaBinCodec!

h3. Server side indexing:
 Then on the server side, JavaBinCodec parses String values as 
Utf8CharSequence, and
 we start going through the update processor chain. 
FieldMutatingUpdateProcessor (used
 in our _default config to remove blank values) asks each SolrInputField for its
 value, which again triggers iteration over the complete list. Also, for *any*
 string values (single valued too), FieldMutatingUpdateProcessor replaces those
 Utf8CharSequence objects with String objects (destroying any attempted 
re-serializing
 optimization)

Then comes NestedUpdateProcessorFactory, which triggers the
 auto-check-and-convert *twice*, because getValue() returned a pointer
 previously, which would have been optimized away. Both lines below iterate
 over all values, *before* the actual iteration by the explicit "for" loop:
{code:java}
  boolean isSingleVal = !(field.getValue() instanceof Collection); 
  for(Objectval: field) {
{code}
then isAtomicUpdate(), and then finally writeSolrInputDocument() to convert to
 JavaBin for the transaction log both trigger the extra
 iterate-over-all-values with each inspection. If FieldMutatingUpdateProcessor
 hadn't overwritten Utf8CharSequence already, all of these accesses would have
 also triggered a new collection creation each time (and an additional 
iteration to
 create the new collection) for every multi-valued string field.

h3. Server side query:
 On the query side, we get a Lucene Document, and then convert it into a
 SolrDocument. Binary ResponseWriter uses MaskCharSeqSolrDocument which
 inherits from SolrDocument to do the auto-convert-on-access stuff more
 thoroughly.
{code:java}
 
    for (IndexableField f : doc.getFields()) {
      final String fname = f.name();
      if (null == fieldNamesNeeded || fieldNamesNeeded.contains(fname) ) {
        // Make sure multivalued fields are represented as lists
        Object existing = out.get(fname);
{code}
For multi-valued fields, what we get back from lucene is actually a flat list
 of all the values in the whole document. We need to collect all values
 with the same field into a list. So if there are 1000 values in a
 field, the outer loop executes over 1000 times. Then in the inner loop we
 retrieve any existing value for the field by calling "out.get(fname)", which
 triggers the auto-convert-on-access which scans all the values so far (on 
average 1000/2),
 and hence we have our O(N^2) behavior that the original poster reported.

h3. Other:
It took a really long time to review some of this code (and I've only reviewed
 some), often because a lack of comments around non-obvious things. I thought
 there might be lifetime/sharing bugs with BytesBlock for example, until I
 realized that strings are appended in the block rather than placed at the 
start.

Same issue for FastInputStream.readDirectUtf8... since it looked like it
 was sharing the internal buffer, I thought there was a possible lifetime issue
 there. A single line comment in both of those cases would have saved me quite
 a bit.

Actually... looking at it again, there still may be a subtle sharing bug in
 this new FastInputStream.readDirectUtf8. I can't say I quite understand the
 logic behind for when you can't share the internal buffer.
{code:java}
 
  if (in !=null || end < pos + len) return false; 
{code}
You can only share the buffer when the bytes you want are right up at the end 
of the buffer?
 I'm not sure I understand the logic around that, but ChannelFastInputStream 
(used by
 TransactionLog) derives from FastInputStream and doesn't have an input stream,
 so if it got unlucky and did a string read at the end of a buffer, there would
 likely be data corruption.

> javabin performance regressions
> -------------------------------
>
>                 Key: SOLR-14013
>                 URL: https://issues.apache.org/jira/browse/SOLR-14013
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.7
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>            Priority: Major
>         Attachments: test.json
>
>
> As noted by [~rrockenbaugh] in SOLR-13963, javabin also recently became 
> orders of magnitude slower in certain cases since v7.7.  The cases identified 
> so far include large numbers of values in a field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to