I think there is a certain beauty (of tape-backed storage flavor...) in existing abstractions and I wouldn't change them unless absolutely necessary (FST construction isn't the dominant cost in indexing). Also, random seeks all over the place may be really problematic in certain scenarios (as is opening a written-to file for reading, as Robert mentioned).
> Failing that, our plan B is to wastefully duplicate the byte[] slices from the already written bytes into our own private (heap resident, boo) copy, which would use quite a bit more RAM while building the FST, and make less minimal FSTs for a given RAM budget. Well, this node cache doesn't have to be on heap... It can be a plain temporary file (with full random access). It's a scratch-only structure which you can delete after the fst is written. It does add I/O overhead but doesn't interfere with the rest of the code in Lucene. Perhaps, instead of changing IndexInput and IndexOutput, one could start with a plain temp file (NIO API)? I also think that the tradeoffs presented in graphs on the fst-node-cache issue are not so bad at all. Yes, the FST is not minimal, but the construction-space vs output size is quite all right to me. Dawid On Thu, Oct 19, 2023 at 3:50 PM Michael McCandless < luc...@mikemccandless.com> wrote: > Hi Team, > > Today, Lucene's Directory abstraction does not allow opening an IndexInput > on a file until the file is fully written and closed via IndexOutput. We > enforce this in tests, and some of our core Directory implementations > demand this (e.g. caching the file's length on opening an IndexInput). > > Yet, most filesystems will easily allow simultaneous read/append of a > single file. We just don't expose this IO semantics to Lucene, but could > we allow random-access reads with append-only writes on one file? Is there > a strong reason that we don't allow this? > > Quick TL/DR context: we are trying to enable FST compilation to write > off-heap (directly to disk), enabling creating arbitrarily large FSTs with > bounded heap, matching how FSTs can now be read off-heap, and it would be > much much more RAM efficient if we could read/append the same file at once. > > Full gory details context: inspired by how Tantivy > <https://github.com/quickwit-oss/tantivy> (awesome and fast Rust search > engine!) writes its FSTs <https://blog.burntsushi.net/transducers/>, over > in this issue <https://github.com/apache/lucene/issues/12543> and PR > <https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>, > we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST > building to immediately stream the FST to disk, instead of buffering the > whole thing in RAM and then writing to disk. > > This would allow building arbitrarily large FSTs without using up heap, > and symmetrically matches how we can now read FSTs off-heap, plus FST > building is already (mostly) append-only. This would also allow removing > some of the crazy abstractions we have for writing FST bytes into RAM > (FSTStore, BytesStore). It would enable interesting things like a Codec > whose term dictionary is stored entirely in an FST > <https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy). > > The wrinkle is that, while the FST is building, it sometimes looks back > and reads previously written bytes, to share suffixes and create a minimal > (or near minimal) FST. So if IndexInput could read those bytes, even as > the FST is still appending to IndexOutput, it would "just work". > > Failing that, our plan B is to wastefully duplicate the byte[] slices from > the already written bytes into our own private (heap resident, boo) copy, > which would use quite a bit more RAM while building the FST, and make less > minimal FSTs for a given RAM budget. I haven't measured the added wasted > RAM if we have to go this route but I fear it is sizable in practice, i.e. > it strongly negates the whole idea of writing an FST off-heap since its > effectively storing a possibly large portion of the FST in many duplicated > byte[] fragments (in the NodeHash). > > So ... could we somehow relax Lucene's Directory semantics to allow > opening an IndexInput on a still appending IndexOutput, since most > filesystems are fine with this? > > Mike McCandless > > http://blog.mikemccandless.com >