Re: real time updates

Michael McCandless Sat, 14 Mar 2009 02:52:16 -0700


Marvin Humphrey wrote:

On Fri, Mar 13, 2009 at 12:48:11PM -0700, Nathan Kurz wrote:
[moved from private to lucy-dev in case others are interested]
KS has a slow best-case write speed, but I'm not worried aboutthat. Theproblem me and the Lucene folks are trying to address under thetopicheading of "real-time indexing" is *worst-case* write speed: mostof thetime you're fine, but every once in a while you trigger a largemerge and
you wait a loooooong time.  That problem has a different solution.
What is the actual case that would trigger such a problem?
The "big merge" problem is easy to trigger: repeatedly open anindexer, add asmall number of documents, and close it. Each time you do that, anew segmentis added, and search-time performance degrades. Every once in awhile, the
segment merging algorithm decides that it needs to perform a big
consolidation, and you have to wait while it finishes.
Solving the big-merge problem requires setting up a backgroundconsolidationroutine that allows new segments to be written while it's running.We'll needto make some mods to write locking, commit locking, segment naming,snapshotfile naming, and merging policies; I'm currently working on anIndexManager
class which bundles all of those concerns together.
Lucene does background merging already using a single process withmultiplethreads. I'm thinking that we'd try a background process approachinstead, sothat additional write processes could be launched which add newsegments while
the background merger churns.


Even w/ background merging, which allows new segments to be written &
reopened in a reader even while the big merge is running in the BG,
Lucene still has the challenge of warming a reader on the [large]
newly merged segment before using the reader "for real".  Such warming
cannot block newly added segments, either.

I think for Lucene the solution is simple: don't allow the merge to be
committed until a reader has been warmed on that segment (assuming
you've opened a reader from IndexWriter).

We also have a second major problem: the amount of time it takes toloadcertain caches in IndexReader from disk into process memory --especially sortcaches. This one we bix by 1) designing index file formats that canbemmap'd, and 2) implementing "segment-centric sorted search", whichusesper-segment mmap'd caches rather than a per-index cache which has toberebuilt from scratch in process memory each time the index ismodified.


Lucene (trunk) is now doing segment-centric sort caches, but still
since we load the full FieldCache into RAM (slowly -- by uninverting
the field), warming is still costly.  But as long as we "background"
the warming, it's relatively OK.

Lastly, we have a problem regarding writing out deletions, minor incomparisonto the other two but still worth thinking over: the size of thedeletions fileswe write out scales with the size of the index rather than number ofdeletionswe make. If you delete a single document, but that document happensto residein a segment with 1 million documents, we have to write out a new 1MB bit
vector file.

I thought we had that licked with the "tombstones" approach, but the
performance of iterated deletions turns out to be worse thanexpected, evenbefore we get a priority queue involved to merge multiple tombstonerows.


With Lucene we intend to carry the deletions in RAM, not through disk;
this mean we can leave our on-disk format as the "relatively"
inefficient write-the-whole-vector-out-each-time approach we have
now.

Still, we have the same challenge in RAM: how to make an efficient
transaactional bit set so that if there are N readers all opened after
different add/deletes, we don't need a full 1 MB bit set for each of
them.  We need an incremental copy-on-write solution (eg only the
"page" that's change gets copied when a new deletion arrives).  We
need this for changes to norms too.  Some discussions have happened
here:

  https://issues.apache.org/jira/browse/LUCENE-1526

Would the equivalent of row-level-locking allow you to
modify the index files in real time, instead of adding addenda and
tombstones?   I'm not necessarily suggesting that the default index
format should do this, but that it might be worth considering whether
a proposed API would support such a real-time format.


I absolutely agree that deletions support should be pluggable.

How does this interface look?

 package MyDelWriter;
 use base qw( Lucy::Index::DeletionsWriter );

 ...

 package MyDelReader;
 use base qw( Lucy::Index::DeletionsReader );

 ...

 package MyArchitecture;
 use base qw( Lucy::Architecture );

 sub make_deletions_writer {
   shift;
   return MyDelWriter->new(@_);
 }

 sub make_deletions_reader {
   shift;
   return MyDelReader->new(@_);
 }

 package MySchema;
 use base qw( Lucy::Schema );

 sub make_architecture {
   return MyArchitecture->new;
 }

That's the interface I've worked into KS over the last few weeks.

The default deletions implementation uses BitVecDelWriter/BitVecDelReader. Itbehaves slightly differently from Lucene and old KS; new bit vectorfilesbelong to new segments but reference old ones (so "seg_8/deletions-seg_2.bv"

contains deletions which should be applied against segment 2).

But then does deletions-seg_2.bv contain all deletes for seg_2? Inwhichcase this is just like the "generation" Lucene increments & tacks onwhen itsaves a del; just a different naming scheme. Or is it somehowincremental?


Mike

Re: real time updates

Reply via email to