Thanks David for the info! -John
On Wed, May 27, 2020 at 8:03 PM David Smiley <dsmi...@apache.org> wrote: > John: you may benefit from more eagerly merging small segments on commit. > At Salesforce we have a *ton* of indexes, and we reduced the segment count > in half from the default. The large number of fields was a positive factor > in this being a desirable trade-off. You might look at this recent issue > https://issues.apache.org/jira/browse/LUCENE-8962 which isn't released > but I show in it (with PRs to code) how to accomplish it without hacking on > Lucene itself. You may find this conference presentation I gave with my > colleagues interesting, which touch on this: > https://youtu.be/hqeYAnsxPH8?t=855 > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Wed, May 27, 2020 at 5:21 PM John Wang <john.w...@gmail.com> wrote: > >> Thanks Adrien! >> >> It is surprising to learn this is an invalid use case and that Lucene is >> planning to get rid of memory accounting... >> >> In our test, there are indeed many fields. From our test, with 1000 >> numeric doc values fields, and 5 million docs in 1 segment. (We will have >> many segments in our production use case.) >> >> We found the memory usage by accounting for the elements in the maps vs >> the default behavior is 363456 to 59216, almost a 600% difference. >> >> We have deployments with much more than 1000 fields, so I don't think >> that is extreme. >> >> Our use case: >> >> We will have many segments/readers, and we found opening them at query >> time is expensive. So we are caching them. >> >> Since we don't know the data ahead of the time, we are using the reader's >> accounted memory as the cache size. >> >> We found the reader's accounting is unreliable, and dug into it and found >> this. >> >> If we should not be using this, what would be the correct way to handle >> this? >> >> Thank you >> >> -John >> >> >> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote: >> >>> A couple major versions ago, Lucene required tons of heap memory to keep >>> a reader open, e.g. norms were on heap and so on. To my knowledge, the only >>> thing that is now kept in memory and is a function of maxDoc is live docs, >>> all other codec components require very little memory. I'm actually >>> wondering that we should remove memory accounting on readers. When Lucene >>> used tons of memory we could focus on the main contributors to memory usage >>> and be mostly correct. But now given how little memory Lucene is using it's >>> quite hard to figure out what are the main contributing factors to memory >>> usage. And it's probably not that useful either, why is it important to >>> know how much memory something is using if it's not much? >>> >>> So I'd be curious to know more about your use-case for reader caching. >>> Would we break your use-case if we removed memory accounting on readers? >>> Given the lines that you are pointing out, I believe that you either have >>> many fields or many segments if these maps are using lots of memory? >>> >>> >>> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> We have a reader cache that depends on the memory usage for each >>>> reader. We found the calculation of reader size for doc values to be under >>>> counting. >>>> >>>> See line: >>>> >>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69 >>>> >>>> Looks like the memory estimate is only using the shallow size of the >>>> class, and does not include the objects stored in the maps: >>>> >>>> >>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55 >>>> >>>> We made a local patch and saw a significant difference in reported size. >>>> >>>> Please let us know if this is the right thing to do, we are happy to >>>> contribute our patch. >>>> >>>> Thanks >>>> >>>> -John >>>> >>> >>> >>> -- >>> Adrien >>> >>