Re: Needs help reviewing on Lucene PostingsFormat memory improvement

Anh Dũng Bùi Thu, 08 Feb 2024 05:18:03 -0800

Thanks Mike for the reply!

> Read-time for Lucene90BlockTreePostingsFormat was already off-heap?  And
your PR changes write-time to do so as well?


Yeah that's the idea. I changed just the Terms Writer to be off-heap.
Thanks, let's monitor it after the merge.

> Maybe building the synonyms FST (SynonymMap.Builder) would be a good
place for off-heap writing too?

This is a good idea. I see there's one on-going PR that tackles this
already: https://github.com/apache/lucene/pull/13054. I'm excited to see
the feature is rolling out to different parts of Lucene.

> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
a work in progres) would likely strongly benefit from streaming FST
building, since its FSTs will be much larger than the Lucene90BlockTree
since it stores all terms (not just the sampled prefix/index) in a single
FST for the segment.

I can try to fork this PR and convert to off-heap writing as well.

Regards,
Anh Dung Bui

On Thu, Feb 8, 2024 at 7:43 AM Michael McCandless <[email protected]>
wrote:

> Hi Anh Dũng Bùi,
>
> Thank you for tackling these and being so gently patient/persisting!
> Sorry for the delay.  I will try to review them soon.  The off-heap
> (streaming?) building of FSTs is really a massive improvement to Lucene,
> inspired by Tantivy's FST implementation:
> https://blog.burntsushi.net/transducers/
>
> Read-time for Lucene90BlockTreePostingsFormat was already off-heap?  And
> your PR changes write-time to do so as well?  This will reduce RAM pressure
> during indexing which is great.  And some Lucene usages generate incredibly
> large FSTs (I'm looking at you HathiTrust!). I don't think we need to
> explicitly measure any performance impact before merging?, but let's watch
> the nightly benchy to see if there is any measurable impact?
>
> And, yes, Lucene90BlockTreePostingsFormat is the default.  You find the
> default codec from Codec.getDefault() and then trace downwards to all its
> sources.
>
> Maybe building the synonyms FST (SynonymMap.Builder) would be a good place
> for off-heap writing too?
>
> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
> a work in progres) would likely strongly benefit from streaming FST
> building, since its FSTs will be much larger than the Lucene90BlockTree
> since it stores all terms (not just the sampled prefix/index) in a single
> FST for the segment.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 1, 2024 at 10:40 PM Anh Dũng Bùi <[email protected]> wrote:
>
>> Hi Lucene devs!
>>
>> I have 2 PRs to optimize Lucene PostingsFormat
>> (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
>> feature to stream the FST to IndexOutput directly, bypassing the on-heap
>> writing:
>> - https://github.com/apache/lucene/pull/12980
>> - https://github.com/apache/lucene/pull/12985
>>
>> It would be great if someone can help reviewing. I also have some general
>> questions:
>> - How do I measure the memory improvement impact in Lucene?
>> - Is Lucene90BlockTreePostingsFormat the main index format used in
>> Lucene? If not, what is the main format?
>> - Are there other places worth using the new streaming FST feature?
>>
>> Thank you!
>> Anh Dung Bui
>>
>

Re: Needs help reviewing on Lucene PostingsFormat memory improvement

Reply via email to