I think there is a certain beauty (of tape-backed storage flavor...) in
existing abstractions and I wouldn't change them unless absolutely
necessary (FST construction isn't the dominant cost in indexing). Also,
random seeks all over the place may be really problematic in certain
scenarios (as is opening a written-to file for reading, as Robert
mentioned).

> Failing that, our plan B is to wastefully duplicate the byte[] slices
from the already written bytes into our own private (heap resident, boo)
copy, which would use quite a bit more RAM while building the FST, and make
less minimal FSTs for a given RAM budget.

Well, this node cache doesn't have to be on heap... It can be a plain
temporary file (with full random access). It's a scratch-only structure
which you can delete after the fst is written. It does add I/O overhead but
doesn't interfere with the rest of the code in Lucene. Perhaps, instead of
changing IndexInput and IndexOutput, one could start with a plain temp file
(NIO API)?

I also think that the tradeoffs presented in graphs on the fst-node-cache
issue are not so bad at all. Yes, the FST is not minimal, but the
construction-space vs output size is quite all right to me.

Dawid

On Thu, Oct 19, 2023 at 3:50 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an IndexInput
> on a file until the file is fully written and closed via IndexOutput.  We
> enforce this in tests, and some of our core Directory implementations
> demand this (e.g. caching the file's length on opening an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a
> single file.  We just don't expose this IO semantics to Lucene, but could
> we allow random-access reads with append-only writes on one file?  Is there
> a strong reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write
> off-heap (directly to disk), enabling creating arbitrarily large FSTs with
> bounded heap, matching how FSTs can now be read off-heap, and it would be
> much much more RAM efficient if we could read/append the same file at once.
>
> Full gory details context: inspired by how Tantivy
> <https://github.com/quickwit-oss/tantivy> (awesome and fast Rust search
> engine!) writes its FSTs <https://blog.burntsushi.net/transducers/>, over
> in this issue <https://github.com/apache/lucene/issues/12543> and PR
> <https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>,
> we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST
> building to immediately stream the FST to disk, instead of buffering the
> whole thing in RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up heap,
> and symmetrically matches how we can now read FSTs off-heap, plus FST
> building is already (mostly) append-only. This would also allow removing
> some of the crazy abstractions we have for writing FST bytes into RAM
> (FSTStore, BytesStore).  It would enable interesting things like a Codec
> whose term dictionary is stored entirely in an FST
> <https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks back
> and reads previously written bytes, to share suffixes and create a minimal
> (or near minimal) FST.  So if IndexInput could read those bytes, even as
> the FST is still appending to IndexOutput, it would "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices from
> the already written bytes into our own private (heap resident, boo) copy,
> which would use quite a bit more RAM while building the FST, and make less
> minimal FSTs for a given RAM budget.  I haven't measured the added wasted
> RAM if we have to go this route but I fear it is sizable in practice, i.e.
> it strongly negates the whole idea of writing an FST off-heap since its
> effectively storing a possibly large portion of the FST in many duplicated
> byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow
> opening an IndexInput on a still appending IndexOutput, since most
> filesystems are fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Reply via email to