Hi Rahul,

Thanks for the suggestion. I was using `solr.NRTCachingDirectoryFactory` in
all the versions I've tested, which was the default in the example
solrconfig:

  <directoryFactory name="DirectoryFactory"

class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>

I'll give `solr.MMapDirectoryFactory` a try later though.

Thanks,
Calvin

p.s. Christine, I am following up on your suggestion, but it's taking a
while, since I have to build the index from scratch each time. More later...


On Tue, Feb 20, 2024 at 8:35 PM Rahul Goswami <rahul196...@gmail.com> wrote:

> Did you happen to change the DirectoryFactory in solrconfig to
> SimpleFSDirectoryFactory or NIOFSDirecotryFactory by any chance? Default is
> Mmap which is much more performant for atomic updates (and also practical,
> especially given the small(ish) size of your index).
>
> -Rahul
>
> On Tue, Feb 20, 2024 at 4:52 AM Christine Poerschke (BLOOMBERG/ LONDON) <
> cpoersc...@bloomberg.net> wrote:
>
> > Hello Calvin,
> >
> > Thank you for this wonderful issue write-up!
> >
> > You mention upgrading from Solr 6 to 9.5 versions and I wonder if it
> might
> > be practical or insightful to assess for some versions in between too
> e.g.
> > 9.5/9.4/9.3 going backwards or 6/7/8/9.0 going forward or some sort of
> > binary search variant.
> >
> > Best wishes,
> > Christine
> >
> > From: users@solr.apache.org At: 02/16/24 17:59:59 UTCTo:
> > solr-u...@lucene.apache.org
> > Subject: Partial update slowness with a stored="false" dynamic field and
> > lots of distinct field names
> >
> > Hi solr users,
> >
> > While tracking down a severe performance regression doing partial updates
> > when upgrading from Solr 6 to solr 9.5.0, I discovered the following
> > unexpected behavior.
> >
> > In my schema.xml file, I have the following fields (among many others):
> >
> > <field name="id" type="string" indexed="true" stored="true"
> > required="true"/>
> > <field name="_version_" type="long" indexed="false" stored="false"/>
> > <field name="name" type="text_en_splitting_tight" indexed="true"
> > stored="true" omitNorms="true"/>
> >
> > <dynamicField name="playlist_index_*" type="int" .../> <!-- The int field
> > type has docValues="true" -->
> >
> > The unexpected impact on partial update performance depends on whether
> the
> > dynamic field above is stored (with `docValues="true"`). The index in
> > question contains about 30 million documents and is 15GB in size, and
> there
> > are a large number of distinct `playlist_index_*`` field names (more than
> > 100K). The purpose of that dynamic field is to support sorting, with
> > queries sometimes specifying a sort like `playlist_index_3141 asc` to
> sort
> > the results by that field.
> >
> > For all the cases below, I added to the existing index the following 25
> new
> > documents:
> >
> > [
> >   {"id": "12345678901234567891", "name": "."},
> >   {"id": "12345678901234567892", "name": "."},
> >   {"id": "12345678901234567893", "name": "."},
> >   {"id": "12345678901234567894", "name": "."},
> >   {"id": "12345678901234567895", "name": "."},
> >   {"id": "12345678901234567896", "name": "."},
> >   {"id": "12345678901234567897", "name": "."},
> >   {"id": "12345678901234567898", "name": "."},
> >   {"id": "12345678901234567899", "name": "."},
> >   {"id": "12345678901234567900", "name": "."},
> >   {"id": "12345678901234567901", "name": "."},
> >   {"id": "12345678901234567902", "name": "."},
> >   {"id": "12345678901234567903", "name": "."},
> >   {"id": "12345678901234567904", "name": "."},
> >   {"id": "12345678901234567995", "name": "."},
> >   {"id": "12345678901234567996", "name": "."},
> >   {"id": "12345678901234567997", "name": "."},
> >   {"id": "12345678901234567998", "name": "."},
> >   {"id": "12345678901234567999", "name": "."},
> >   {"id": "12345678901234568000", "name": "."},
> >   {"id": "12345678901234568001", "name": "."},
> >   {"id": "12345678901234568002", "name": "."},
> >   {"id": "12345678901234568003", "name": "."},
> >   {"id": "12345678901234568004", "name": "."},
> >   {"id": "12345678901234568005", "name": "."}
> > ]
> >
> > and did both a soft and hard commit before continuing.
> >
> > Then I did a partial update with `commitWithin=600000` to update each of
> > those documents:
> >
> > [
> >   {"id": "12345678901234567891", "name": {"set": "1"}},
> >   {"id": "12345678901234567892", "name": {"set": "2"}},
> >   {"id": "12345678901234567893", "name": {"set": "3"}},
> >   {"id": "12345678901234567894", "name": {"set": "4"}},
> >   {"id": "12345678901234567895", "name": {"set": "5"}},
> >   {"id": "12345678901234567896", "name": {"set": "6"}},
> >   {"id": "12345678901234567897", "name": {"set": "7"}},
> >   {"id": "12345678901234567898", "name": {"set": "8"}},
> >   {"id": "12345678901234567899", "name": {"set": "9"}},
> >   {"id": "12345678901234567900", "name": {"set": "10"}},
> >   {"id": "12345678901234567901", "name": {"set": "11"}},
> >   {"id": "12345678901234567902", "name": {"set": "12"}},
> >   {"id": "12345678901234567903", "name": {"set": "13"}},
> >   {"id": "12345678901234567904", "name": {"set": "14"}},
> >   {"id": "12345678901234567995", "name": {"set": "15"}},
> >   {"id": "12345678901234567996", "name": {"set": "16"}},
> >   {"id": "12345678901234567997", "name": {"set": "17"}},
> >   {"id": "12345678901234567998", "name": {"set": "18"}},
> >   {"id": "12345678901234567999", "name": {"set": "19"}},
> >   {"id": "12345678901234568000", "name": {"set": "20"}},
> >   {"id": "12345678901234568001", "name": {"set": "21"}},
> >   {"id": "12345678901234568002", "name": {"set": "22"}},
> >   {"id": "12345678901234568003", "name": {"set": "23"}},
> >   {"id": "12345678901234568004", "name": {"set": "24"}},
> >   {"id": "12345678901234568005", "name": {"set": "25"}}
> > ]
> >
> > The time it takes to perform the update varies drastically depending on
> > whether the `playlist_index_*` dynamic field is stored:
> >
> > - 0.017s: <dynamicField name="playlist_index_*" type="int" indexed="true"
> >  stored="true"  docValues="true"/>
> > - 0.016s: <dynamicField name="playlist_index_*" type="int"
> indexed="false"
> > stored="true"  docValues="true"/>
> > - 8.850s: <dynamicField name="playlist_index_*" type="int"
> indexed="false"
> > stored="false" docValues="true"/>
> > - 8.867s: <dynamicField name="playlist_index_*" type="int" indexed="true"
> >  stored="false" docValues="true"/>
> >
> > The surprise is the poor performance of the last two, when stored is
> false
> > and so the partial update may need to use the docValues to generate the
> new
> > doc, compared to the first two when stored is true.
> >
> > When I profiled the code for the last of the four settings above, I saw
> > that the majority of the time was spent in
> >
> >
> `org.apache.solr.search.SolrDocumentFetcher.decorateDocValueFields(SolrDocumentB
> > ase,
> > int, Set, DocValuesIteratorCache)` and further down in the callees under
> > that hot spot there is a call to
> > `org.apache.solr.search.DocValuesIteratorCache.newEntry(String)` that I
> > added some print statements to in order to see which fields `newEntry`
> was
> > being called for. There were many thousands of calls to the
> > `playlist_index_*` fields, with each call being a different field name,
> and
> > no other fields.
> >
> > My schema does have other dynamic fields, but none of them resulted in
> > calls to `newEntry`. What is distinct about this dynamic field is that
> it's
> > the only one that is a numeric field type and the only one that has more
> > than a few hundred or so distinct field names.
> >
> > Does this seem like expected behavior that solr has to make so many
> > `newEntry` calls for that dynamic field to perform the update that it
> > seriously impacts update performance?
> >
> > Thanks for your time,
> > Calvin
> >
> >
> >
>

Reply via email to