what will happen on windows? sorry, could not resist.
On Thu, Oct 19, 2023 at 9:48 AM Michael McCandless <luc...@mikemccandless.com> wrote: > > Hi Team, > > Today, Lucene's Directory abstraction does not allow opening an IndexInput on > a file until the file is fully written and closed via IndexOutput. We > enforce this in tests, and some of our core Directory implementations demand > this (e.g. caching the file's length on opening an IndexInput). > > Yet, most filesystems will easily allow simultaneous read/append of a single > file. We just don't expose this IO semantics to Lucene, but could we allow > random-access reads with append-only writes on one file? Is there a strong > reason that we don't allow this? > > Quick TL/DR context: we are trying to enable FST compilation to write > off-heap (directly to disk), enabling creating arbitrarily large FSTs with > bounded heap, matching how FSTs can now be read off-heap, and it would be > much much more RAM efficient if we could read/append the same file at once. > > Full gory details context: inspired by how Tantivy (awesome and fast Rust > search engine!) writes its FSTs, over in this issue and PR, we (thank you > Dzung Bui / @dungba88!) are trying to fix Lucene's FST building to > immediately stream the FST to disk, instead of buffering the whole thing in > RAM and then writing to disk. > > This would allow building arbitrarily large FSTs without using up heap, and > symmetrically matches how we can now read FSTs off-heap, plus FST building is > already (mostly) append-only. This would also allow removing some of the > crazy abstractions we have for writing FST bytes into RAM (FSTStore, > BytesStore). It would enable interesting things like a Codec whose term > dictionary is stored entirely in an FST (also inspired by Tantivy). > > The wrinkle is that, while the FST is building, it sometimes looks back and > reads previously written bytes, to share suffixes and create a minimal (or > near minimal) FST. So if IndexInput could read those bytes, even as the FST > is still appending to IndexOutput, it would "just work". > > Failing that, our plan B is to wastefully duplicate the byte[] slices from > the already written bytes into our own private (heap resident, boo) copy, > which would use quite a bit more RAM while building the FST, and make less > minimal FSTs for a given RAM budget. I haven't measured the added wasted RAM > if we have to go this route but I fear it is sizable in practice, i.e. it > strongly negates the whole idea of writing an FST off-heap since its > effectively storing a possibly large portion of the FST in many duplicated > byte[] fragments (in the NodeHash). > > So ... could we somehow relax Lucene's Directory semantics to allow opening > an IndexInput on a still appending IndexOutput, since most filesystems are > fine with this? > > Mike McCandless > > http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org