On Mon, Mar 16, 2009 at 05:51:22AM -0400, Michael McCandless wrote:
> Remind me again how KS/Lucy does the deletes? It must buffer if
> you've buffered add docs? Or is it not possible to delete a recently
> added doc?
The present implementation in KS proceeds commit by commit. You can't delete
a doc that hasn't been committed yet.
Lucy need not adopt that model; we could go the buffering route. However,
that path has implementation drawbacks. A common technique when you are
either adding or updating but don't know which is to always delete based on a
primary key field, then add:
while ( my $doc = give_me_another_doc() ) {
$indexer->delete_by_term( field => 'key', term => $doc->{key} );
$indexer->add_doc($doc);
}
If you use an app with that loop to add a very large number of docs, the
buffered deletes will take up a lot of space in RAM.
I think in Lucene, you flush more often to a new segment. In KS that only
happens once per indexing session. Lucy is ostensibly supposed to model its
internals after the KS design because of the OO overhead streamlining, so we'd
have to find a way to address that issue.
> > Claiming a new segment directory and committing a new master file
> > (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
> > synchronization: if those ops fail because your process lost out in
> > the race condition, you just retry. The only time we have true
> > synchronization requirements is during merging.
>
> Wouldn't you need to synchronize here? If two writers try to suddenly
> write to the same snapshot_XXX.json, is that really safe (ie, one
> always wins, and the other always loses and can reliably detect that
> it lost)?
You're right; I hadn't thought things through well enough. There's no C API
for a rename() op with the semantics of POSIX open() with O_EXCL. No big
deal, though -- we can use a commit.lock file for synchronizing those two ops.
> I think this (reader ORs the multiple deleted docs) would be workable;
> though it makes realtime search challenging.
Yeah. I think if we cap the deletion percentage at 10% and merge any segment
that exceeds that rate, plus apply deletions at the Scorer_Collect level,
tombstones won't absolutely murder us. But there's some work to do there.
> > > Ugh, lock starvation. Really the OS should provide a FIFO lock
> > > queue of some sort.
Heh, I think I just figured out how to solve the lock starvation issue. :)
When the consolidator completes, it will attempt once to acquire write.lock.
If it fails, it will leave behind the consolidate.lock file, plus write a
"consolidate.done" file containing the name of the newly added segment and
maybe some other info, then exit.
Whenever one of the write processes finishes, it will look for
"consolidate.done". If it's found, the write process will pick up where the
consolidator left off, finishing all merge-related operations and then
releasing consolidate.lock.
> How, in general, are you approaching threads with Lucy? It seems
> a shame to forego threads entirely.
We've never formally addressed the issue of threads. The approach I've taken
so far with KS is to try to keep my options open. I've tried to keep most
classes thread-safe when it's not too much work. I think the overall
architecture should still be compatible with threads. The only gaping holes
are that KS lacks a lock-free hash for storing VTables and perhaps interned
strings, and that the reference-counting GC model might be a synchronization
bottleneck.
However, I'm of two minds about this. Threads are a PITA, and thanks to mmap
and the OS cache, we might be able to use OS processes as our exclusive
concurrency model. Launching any number of search processes won't be a
problem, and it looks like we might be able to pull off multiple write
processes as well.
There's a theoretical search-time performance issue on the horizon relating to
threads: it would be nice to have a large index divided up into several
segments of equal size, allowing us to dedicate one thread to each segment and
finish a single search faster. But even that is theoretically achievable
using multiple processes under a MultiSearcher.
It would be nice, though, if we could make Lucy thread-safe as a library, so
that we don't break other people's threading models. That way, you could,
say, run several indexers and one consolidator within a single multi-threaded
app -- realizing the benefit of threads even if no individual Lucy object
knows how to exploit them.
> > PS: FYI, your messages today have premature line-wrapping issues --
> > your original text, not just the quotes.
>
> GRRR. Thanks for pointing it out. Why can't everything just
> work? Is this email better?
Yes, it was perfect. Thanks. :)
Marvin Humphrey