Thanks David for the info!

-John

On Wed, May 27, 2020 at 8:03 PM David Smiley <dsmi...@apache.org> wrote:

> John: you may benefit from more eagerly merging small segments on commit.
> At Salesforce we have a *ton* of indexes, and we reduced the segment count
> in half from the default.  The large number of fields was a positive factor
> in this being a desirable trade-off.  You might look at this recent issue
> https://issues.apache.org/jira/browse/LUCENE-8962 which isn't released
> but I show in it (with PRs to code) how to accomplish it without hacking on
> Lucene itself.  You may find this conference presentation I gave with my
> colleagues interesting, which touch on this:
> https://youtu.be/hqeYAnsxPH8?t=855
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 27, 2020 at 5:21 PM John Wang <john.w...@gmail.com> wrote:
>
>> Thanks Adrien!
>>
>> It is surprising to learn this is an invalid use case and that Lucene is
>> planning to get rid of memory accounting...
>>
>> In our test, there are indeed many fields. From our test, with 1000
>> numeric doc values fields, and 5 million docs in 1 segment. (We will have
>> many segments in our production use case.)
>>
>> We found the memory usage by accounting for the elements in the maps vs
>> the default behavior is 363456 to 59216, almost a 600% difference.
>>
>> We have deployments with much more than 1000 fields, so I don't think
>> that is extreme.
>>
>> Our use case:
>>
>> We will have many segments/readers, and we found opening them at query
>> time is expensive. So we are caching them.
>>
>> Since we don't know the data ahead of the time, we are using the reader's
>> accounted memory as the cache size.
>>
>> We found the reader's accounting is unreliable, and dug into it and found
>> this.
>>
>> If we should not be using this, what would be the correct way to handle
>> this?
>>
>> Thank you
>>
>> -John
>>
>>
>> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote:
>>
>>> A couple major versions ago, Lucene required tons of heap memory to keep
>>> a reader open, e.g. norms were on heap and so on. To my knowledge, the only
>>> thing that is now kept in memory and is a function of maxDoc is live docs,
>>> all other codec components require very little memory. I'm actually
>>> wondering that we should remove memory accounting on readers. When Lucene
>>> used tons of memory we could focus on the main contributors to memory usage
>>> and be mostly correct. But now given how little memory Lucene is using it's
>>> quite hard to figure out what are the main contributing factors to memory
>>> usage. And it's probably not that useful either, why is it important to
>>> know how much memory something is using if it's not much?
>>>
>>> So I'd be curious to know more about your use-case for reader caching.
>>> Would we break your use-case if we removed memory accounting on readers?
>>> Given the lines that you are pointing out, I believe that you either have
>>> many fields or many segments if these maps are using lots of memory?
>>>
>>>
>>> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a reader cache that depends on the memory usage for each
>>>> reader. We found the calculation of reader size for doc values to be under
>>>> counting.
>>>>
>>>> See line:
>>>>
>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>>>
>>>> Looks like the memory estimate is only using the shallow size of the
>>>> class, and does not include the objects stored in the maps:
>>>>
>>>>
>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>>>
>>>> We made a local patch and saw a significant difference in reported size.
>>>>
>>>> Please let us know if this is the right thing to do, we are happy to
>>>> contribute our patch.
>>>>
>>>> Thanks
>>>>
>>>> -John
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>

Reply via email to