I opened https://issues.apache.org/jira/browse/LUCENE-9387.
On Thu, May 28, 2020 at 2:41 PM Michael McCandless < luc...@mikemccandless.com> wrote: > +1 to remove Accountable from Lucene's reader classes. Let's open an > issue and discuss there? > > In the past, when we added Accountable, Lucene's Codec/LeafReaders used > quite a bit of heap, and the implementation was much closer to correct (as > measured by %tg difference). > > But now that we've moved nearly everything off heap, the relatively small > amount of RAM in use is really not worth tracking. > > I think John's use case, carefully relying on the precisely correct number > when aggregated across many readers with many fields and maybe many > segments, highlights how dangerous it is for Lucene to claim it is > reporting heap usage when it can be substantially off (as seen by %tg > difference). > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, May 28, 2020 at 2:51 AM Adrien Grand <jpou...@gmail.com> wrote: > >> To be clear, there is no plan to remove RAM accounting from readers yet, >> this is just something that I have been thinking about recently, so your >> use-case caught my attention. >> >> Given how low the memory usage is nowadays, I believe that it would be >> extremely hard to make sure that RAM estimates are wrong by less than 2x in >> most cases. This is due to the fact that everything counts now, so things >> that we historically ignored for memory usage such as the maps you pointed >> out can no longer be ignored. Unfortunately, these maps are not the only >> thing that we ignored that we should no longer ignore: there are also field >> infos, segment infos, BufferedIndexInput buffers, ... not to mention things >> that we cache in threadlocals, as we would now need to count the number of >> threads that have a cache entry. >> >> I'd suggest looking into opportunities to better contain the number of >> fields and segments as David suggested, and caching a fixed number of >> readers at most instead of relying on RAM accounting. >> >> >> On Wed, May 27, 2020 at 11:21 PM John Wang <john.w...@gmail.com> wrote: >> >>> Thanks Adrien! >>> >>> It is surprising to learn this is an invalid use case and that Lucene is >>> planning to get rid of memory accounting... >>> >>> In our test, there are indeed many fields. From our test, with 1000 >>> numeric doc values fields, and 5 million docs in 1 segment. (We will have >>> many segments in our production use case.) >>> >>> We found the memory usage by accounting for the elements in the maps vs >>> the default behavior is 363456 to 59216, almost a 600% difference. >>> >>> We have deployments with much more than 1000 fields, so I don't think >>> that is extreme. >>> >>> Our use case: >>> >>> We will have many segments/readers, and we found opening them at query >>> time is expensive. So we are caching them. >>> >>> Since we don't know the data ahead of the time, we are using the >>> reader's accounted memory as the cache size. >>> >>> We found the reader's accounting is unreliable, and dug into it and >>> found this. >>> >>> If we should not be using this, what would be the correct way to handle >>> this? >>> >>> Thank you >>> >>> -John >>> >>> >>> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote: >>> >>>> A couple major versions ago, Lucene required tons of heap memory to >>>> keep a reader open, e.g. norms were on heap and so on. To my knowledge, the >>>> only thing that is now kept in memory and is a function of maxDoc is live >>>> docs, all other codec components require very little memory. I'm actually >>>> wondering that we should remove memory accounting on readers. When Lucene >>>> used tons of memory we could focus on the main contributors to memory usage >>>> and be mostly correct. But now given how little memory Lucene is using it's >>>> quite hard to figure out what are the main contributing factors to memory >>>> usage. And it's probably not that useful either, why is it important to >>>> know how much memory something is using if it's not much? >>>> >>>> So I'd be curious to know more about your use-case for reader caching. >>>> Would we break your use-case if we removed memory accounting on readers? >>>> Given the lines that you are pointing out, I believe that you either have >>>> many fields or many segments if these maps are using lots of memory? >>>> >>>> >>>> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> We have a reader cache that depends on the memory usage for each >>>>> reader. We found the calculation of reader size for doc values to be under >>>>> counting. >>>>> >>>>> See line: >>>>> >>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69 >>>>> >>>>> Looks like the memory estimate is only using the shallow size of the >>>>> class, and does not include the objects stored in the maps: >>>>> >>>>> >>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55 >>>>> >>>>> We made a local patch and saw a significant difference in reported >>>>> size. >>>>> >>>>> Please let us know if this is the right thing to do, we are happy to >>>>> contribute our patch. >>>>> >>>>> Thanks >>>>> >>>>> -John >>>>> >>>> >>>> >>>> -- >>>> Adrien >>>> >>> >> >> -- >> Adrien >> > -- Adrien