Hi solr users,

While tracking down a severe performance regression doing partial updates
when upgrading from Solr 6 to solr 9.5.0, I discovered the following
unexpected behavior.

In my schema.xml file, I have the following fields (among many others):

<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="name" type="text_en_splitting_tight" indexed="true"
stored="true" omitNorms="true"/>

<dynamicField name="playlist_index_*" type="int" .../> <!-- The int field
type has docValues="true" -->

The unexpected impact on partial update performance depends on whether the
dynamic field above is stored (with `docValues="true"`). The index in
question contains about 30 million documents and is 15GB in size, and there
are a large number of distinct `playlist_index_*`` field names (more than
100K). The purpose of that dynamic field is to support sorting, with
queries sometimes specifying a sort like `playlist_index_3141 asc` to sort
the results by that field.

For all the cases below, I added to the existing index the following 25 new
documents:

[
  {"id": "12345678901234567891", "name": "."},
  {"id": "12345678901234567892", "name": "."},
  {"id": "12345678901234567893", "name": "."},
  {"id": "12345678901234567894", "name": "."},
  {"id": "12345678901234567895", "name": "."},
  {"id": "12345678901234567896", "name": "."},
  {"id": "12345678901234567897", "name": "."},
  {"id": "12345678901234567898", "name": "."},
  {"id": "12345678901234567899", "name": "."},
  {"id": "12345678901234567900", "name": "."},
  {"id": "12345678901234567901", "name": "."},
  {"id": "12345678901234567902", "name": "."},
  {"id": "12345678901234567903", "name": "."},
  {"id": "12345678901234567904", "name": "."},
  {"id": "12345678901234567995", "name": "."},
  {"id": "12345678901234567996", "name": "."},
  {"id": "12345678901234567997", "name": "."},
  {"id": "12345678901234567998", "name": "."},
  {"id": "12345678901234567999", "name": "."},
  {"id": "12345678901234568000", "name": "."},
  {"id": "12345678901234568001", "name": "."},
  {"id": "12345678901234568002", "name": "."},
  {"id": "12345678901234568003", "name": "."},
  {"id": "12345678901234568004", "name": "."},
  {"id": "12345678901234568005", "name": "."}
]

and did both a soft and hard commit before continuing.

Then I did a partial update with `commitWithin=600000` to update each of
those documents:

[
  {"id": "12345678901234567891", "name": {"set": "1"}},
  {"id": "12345678901234567892", "name": {"set": "2"}},
  {"id": "12345678901234567893", "name": {"set": "3"}},
  {"id": "12345678901234567894", "name": {"set": "4"}},
  {"id": "12345678901234567895", "name": {"set": "5"}},
  {"id": "12345678901234567896", "name": {"set": "6"}},
  {"id": "12345678901234567897", "name": {"set": "7"}},
  {"id": "12345678901234567898", "name": {"set": "8"}},
  {"id": "12345678901234567899", "name": {"set": "9"}},
  {"id": "12345678901234567900", "name": {"set": "10"}},
  {"id": "12345678901234567901", "name": {"set": "11"}},
  {"id": "12345678901234567902", "name": {"set": "12"}},
  {"id": "12345678901234567903", "name": {"set": "13"}},
  {"id": "12345678901234567904", "name": {"set": "14"}},
  {"id": "12345678901234567995", "name": {"set": "15"}},
  {"id": "12345678901234567996", "name": {"set": "16"}},
  {"id": "12345678901234567997", "name": {"set": "17"}},
  {"id": "12345678901234567998", "name": {"set": "18"}},
  {"id": "12345678901234567999", "name": {"set": "19"}},
  {"id": "12345678901234568000", "name": {"set": "20"}},
  {"id": "12345678901234568001", "name": {"set": "21"}},
  {"id": "12345678901234568002", "name": {"set": "22"}},
  {"id": "12345678901234568003", "name": {"set": "23"}},
  {"id": "12345678901234568004", "name": {"set": "24"}},
  {"id": "12345678901234568005", "name": {"set": "25"}}
]

The time it takes to perform the update varies drastically depending on
whether the `playlist_index_*` dynamic field is stored:

- 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true"
 stored="true"  docValues="true"/>
- 0.016s: <dynamicField name="playlist_index_*" type="int" indexed="false"
stored="true"  docValues="true"/>
- 8.850s: <dynamicField name="playlist_index_*" type="int" indexed="false"
stored="false" docValues="true"/>
- 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true"
 stored="false" docValues="true"/>

The surprise is the poor performance of the last two, when stored is false
and so the partial update may need to use the docValues to generate the new
doc, compared to the first two when stored is true.

When I profiled the code for the last of the four settings above, I saw
that the majority of the time was spent in
`org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentBase,
int, Set, DocValuesIteratorCache)` and further down in the callees under
that hot spot there is a call to
`org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I
added some print statements to in order to see which fields `newEntry` was
being called for. There were many thousands of calls to the
`playlist_index_*` fields, with each call being a different field name, and
no other fields.

My schema does have other dynamic fields, but none of them resulted in
calls to `newEntry`. What is distinct about this dynamic field is that it's
the only one that is a numeric field type and the only one that has more
than a few hundred or so distinct field names.

Does this seem like expected behavior that solr has to make so many
`newEntry` calls for that dynamic field to perform the update that it
seriously impacts update performance?

Thanks for your time,
Calvin

Reply via email to