Re: ram estimate for docvalues is incorrect

Adrien Grand Thu, 28 May 2020 07:02:35 -0700

I opened https://issues.apache.org/jira/browse/LUCENE-9387.


On Thu, May 28, 2020 at 2:41 PM Michael McCandless <
[email protected]> wrote:

> +1 to remove Accountable from Lucene's reader classes.  Let's open an
> issue and discuss there?
>
> In the past, when we added Accountable, Lucene's Codec/LeafReaders used
> quite a bit of heap, and the implementation was much closer to correct (as
> measured by %tg difference).
>
> But now that we've moved nearly everything off heap, the relatively small
> amount of RAM in use is really not worth tracking.
>
> I think John's use case, carefully relying on the precisely correct number
> when aggregated across many readers with many fields and maybe many
> segments, highlights how dangerous it is for Lucene to claim it is
> reporting heap usage when it can be substantially off (as seen by %tg
> difference).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 28, 2020 at 2:51 AM Adrien Grand <[email protected]> wrote:
>
>> To be clear, there is no plan to remove RAM accounting from readers yet,
>> this is just something that I have been thinking about recently, so your
>> use-case caught my attention.
>>
>> Given how low the memory usage is nowadays, I believe that it would be
>> extremely hard to make sure that RAM estimates are wrong by less than 2x in
>> most cases. This is due to the fact that everything counts now, so things
>> that we historically ignored for memory usage such as the maps you pointed
>> out can no longer be ignored. Unfortunately, these maps are not the only
>> thing that we ignored that we should no longer ignore: there are also field
>> infos, segment infos, BufferedIndexInput buffers, ... not to mention things
>> that we cache in threadlocals, as we would now need to count the number of
>> threads that have a cache entry.
>>
>> I'd suggest looking into opportunities to better contain the number of
>> fields and segments as David suggested, and caching a fixed number of
>> readers at most instead of relying on RAM accounting.
>>
>>
>> On Wed, May 27, 2020 at 11:21 PM John Wang <[email protected]> wrote:
>>
>>> Thanks Adrien!
>>>
>>> It is surprising to learn this is an invalid use case and that Lucene is
>>> planning to get rid of memory accounting...
>>>
>>> In our test, there are indeed many fields. From our test, with 1000
>>> numeric doc values fields, and 5 million docs in 1 segment. (We will have
>>> many segments in our production use case.)
>>>
>>> We found the memory usage by accounting for the elements in the maps vs
>>> the default behavior is 363456 to 59216, almost a 600% difference.
>>>
>>> We have deployments with much more than 1000 fields, so I don't think
>>> that is extreme.
>>>
>>> Our use case:
>>>
>>> We will have many segments/readers, and we found opening them at query
>>> time is expensive. So we are caching them.
>>>
>>> Since we don't know the data ahead of the time, we are using the
>>> reader's accounted memory as the cache size.
>>>
>>> We found the reader's accounting is unreliable, and dug into it and
>>> found this.
>>>
>>> If we should not be using this, what would be the correct way to handle
>>> this?
>>>
>>> Thank you
>>>
>>> -John
>>>
>>>
>>> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <[email protected]> wrote:
>>>
>>>> A couple major versions ago, Lucene required tons of heap memory to
>>>> keep a reader open, e.g. norms were on heap and so on. To my knowledge, the
>>>> only thing that is now kept in memory and is a function of maxDoc is live
>>>> docs, all other codec components require very little memory. I'm actually
>>>> wondering that we should remove memory accounting on readers. When Lucene
>>>> used tons of memory we could focus on the main contributors to memory usage
>>>> and be mostly correct. But now given how little memory Lucene is using it's
>>>> quite hard to figure out what are the main contributing factors to memory
>>>> usage. And it's probably not that useful either, why is it important to
>>>> know how much memory something is using if it's not much?
>>>>
>>>> So I'd be curious to know more about your use-case for reader caching.
>>>> Would we break your use-case if we removed memory accounting on readers?
>>>> Given the lines that you are pointing out, I believe that you either have
>>>> many fields or many segments if these maps are using lots of memory?
>>>>
>>>>
>>>> On Wed, May 27, 2020 at 9:52 PM John Wang <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a reader cache that depends on the memory usage for each
>>>>> reader. We found the calculation of reader size for doc values to be under
>>>>> counting.
>>>>>
>>>>> See line:
>>>>>
>>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>>>>
>>>>> Looks like the memory estimate is only using the shallow size of the
>>>>> class, and does not include the objects stored in the maps:
>>>>>
>>>>>
>>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>>>>
>>>>> We made a local patch and saw a significant difference in reported
>>>>> size.
>>>>>
>>>>> Please let us know if this is the right thing to do, we are happy to
>>>>> contribute our patch.
>>>>>
>>>>> Thanks
>>>>>
>>>>> -John
>>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>
>>
>> --
>> Adrien
>>
>

-- 
Adrien

Re: ram estimate for docvalues is incorrect

Reply via email to