Re: Could we allow an IndexInput to read from a still writing IndexOutput?

Robert Muir Thu, 19 Oct 2023 11:17:40 -0700

what will happen on windows?

sorry, could not resist.


On Thu, Oct 19, 2023 at 9:48 AM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an IndexInput on 
> a file until the file is fully written and closed via IndexOutput.  We 
> enforce this in tests, and some of our core Directory implementations demand 
> this (e.g. caching the file's length on opening an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a single 
> file.  We just don't expose this IO semantics to Lucene, but could we allow 
> random-access reads with append-only writes on one file?  Is there a strong 
> reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write 
> off-heap (directly to disk), enabling creating arbitrarily large FSTs with 
> bounded heap, matching how FSTs can now be read off-heap, and it would be 
> much much more RAM efficient if we could read/append the same file at once.
>
> Full gory details context: inspired by how Tantivy (awesome and fast Rust 
> search engine!) writes its FSTs, over in this issue and PR, we (thank you 
> Dzung Bui / @dungba88!) are trying to fix Lucene's FST building to 
> immediately stream the FST to disk, instead of buffering the whole thing in 
> RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up heap, and 
> symmetrically matches how we can now read FSTs off-heap, plus FST building is 
> already (mostly) append-only. This would also allow removing some of the 
> crazy abstractions we have for writing FST bytes into RAM (FSTStore, 
> BytesStore).  It would enable interesting things like a Codec whose term 
> dictionary is stored entirely in an FST (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks back and 
> reads previously written bytes, to share suffixes and create a minimal (or 
> near minimal) FST.  So if IndexInput could read those bytes, even as the FST 
> is still appending to IndexOutput, it would "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices from 
> the already written bytes into our own private (heap resident, boo) copy, 
> which would use quite a bit more RAM while building the FST, and make less 
> minimal FSTs for a given RAM budget.  I haven't measured the added wasted RAM 
> if we have to go this route but I fear it is sizable in practice, i.e. it 
> strongly negates the whole idea of writing an FST off-heap since its 
> effectively storing a possibly large portion of the FST in many duplicated 
> byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow opening 
> an IndexInput on a still appending IndexOutput, since most filesystems are 
> fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Could we allow an IndexInput to read from a still writing IndexOutput?

Reply via email to