Re: ram estimate for docvalues is incorrect

Michael McCandless Thu, 28 May 2020 05:42:24 -0700

+1 to remove Accountable from Lucene's reader classes.  Let's open an issue
and discuss there?


In the past, when we added Accountable, Lucene's Codec/LeafReaders used
quite a bit of heap, and the implementation was much closer to correct (as
measured by %tg difference).

But now that we've moved nearly everything off heap, the relatively small
amount of RAM in use is really not worth tracking.

I think John's use case, carefully relying on the precisely correct number
when aggregated across many readers with many fields and maybe many
segments, highlights how dangerous it is for Lucene to claim it is
reporting heap usage when it can be substantially off (as seen by %tg
difference).

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 28, 2020 at 2:51 AM Adrien Grand <[email protected]> wrote:

> To be clear, there is no plan to remove RAM accounting from readers yet,
> this is just something that I have been thinking about recently, so your
> use-case caught my attention.
>
> Given how low the memory usage is nowadays, I believe that it would be
> extremely hard to make sure that RAM estimates are wrong by less than 2x in
> most cases. This is due to the fact that everything counts now, so things
> that we historically ignored for memory usage such as the maps you pointed
> out can no longer be ignored. Unfortunately, these maps are not the only
> thing that we ignored that we should no longer ignore: there are also field
> infos, segment infos, BufferedIndexInput buffers, ... not to mention things
> that we cache in threadlocals, as we would now need to count the number of
> threads that have a cache entry.
>
> I'd suggest looking into opportunities to better contain the number of
> fields and segments as David suggested, and caching a fixed number of
> readers at most instead of relying on RAM accounting.
>
>
> On Wed, May 27, 2020 at 11:21 PM John Wang <[email protected]> wrote:
>
>> Thanks Adrien!
>>
>> It is surprising to learn this is an invalid use case and that Lucene is
>> planning to get rid of memory accounting...
>>
>> In our test, there are indeed many fields. From our test, with 1000
>> numeric doc values fields, and 5 million docs in 1 segment. (We will have
>> many segments in our production use case.)
>>
>> We found the memory usage by accounting for the elements in the maps vs
>> the default behavior is 363456 to 59216, almost a 600% difference.
>>
>> We have deployments with much more than 1000 fields, so I don't think
>> that is extreme.
>>
>> Our use case:
>>
>> We will have many segments/readers, and we found opening them at query
>> time is expensive. So we are caching them.
>>
>> Since we don't know the data ahead of the time, we are using the reader's
>> accounted memory as the cache size.
>>
>> We found the reader's accounting is unreliable, and dug into it and found
>> this.
>>
>> If we should not be using this, what would be the correct way to handle
>> this?
>>
>> Thank you
>>
>> -John
>>
>>
>> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <[email protected]> wrote:
>>
>>> A couple major versions ago, Lucene required tons of heap memory to keep
>>> a reader open, e.g. norms were on heap and so on. To my knowledge, the only
>>> thing that is now kept in memory and is a function of maxDoc is live docs,
>>> all other codec components require very little memory. I'm actually
>>> wondering that we should remove memory accounting on readers. When Lucene
>>> used tons of memory we could focus on the main contributors to memory usage
>>> and be mostly correct. But now given how little memory Lucene is using it's
>>> quite hard to figure out what are the main contributing factors to memory
>>> usage. And it's probably not that useful either, why is it important to
>>> know how much memory something is using if it's not much?
>>>
>>> So I'd be curious to know more about your use-case for reader caching.
>>> Would we break your use-case if we removed memory accounting on readers?
>>> Given the lines that you are pointing out, I believe that you either have
>>> many fields or many segments if these maps are using lots of memory?
>>>
>>>
>>> On Wed, May 27, 2020 at 9:52 PM John Wang <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a reader cache that depends on the memory usage for each
>>>> reader. We found the calculation of reader size for doc values to be under
>>>> counting.
>>>>
>>>> See line:
>>>>
>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>>>
>>>> Looks like the memory estimate is only using the shallow size of the
>>>> class, and does not include the objects stored in the maps:
>>>>
>>>>
>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>>>
>>>> We made a local patch and saw a significant difference in reported size.
>>>>
>>>> Please let us know if this is the right thing to do, we are happy to
>>>> contribute our patch.
>>>>
>>>> Thanks
>>>>
>>>> -John
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>

Re: ram estimate for docvalues is incorrect

Reply via email to