On 11/9/09 9:00 AM, Michael McCandless wrote:
Alas, I don't have any benchmarks offhand... if you want to run one,
you should be able to hardwire flushDocStores=true in
IndexWriter.doFlushInternal?  I think that'd turn off the sharing
without breaking things (run the tests to be sure ;) ).


Yes, I'm pretty sure that works. I think I've even done that in the LUCENE-1879 patch (which works with Lucene 2.4).
Btw: I'm not trying to say it's
required to remove them for parallel indexing. It'd be just be simpler
without them. You can think about a segmented parallel index as a matrix of
segments. And about the shared doc stores as merging multiple cells in a
single row or column of a spreadsheet. It'd be a bit easier if that wasn't
possible and it always was a true matrix.
I agree, not sharing the stores would make things simpler.  Wouldn't
the parallel indexes be able to "privately" share their own stores?
Ie, how the sharing happens need not be in sync across the main&
parallel indexes?


I think that should be ok with parallel indexing, as long as we can always select all corresponding segments from *all* parallel indexes for a merge to keep the docIds in sync.

That actually leads me to another question: Let's say you have three segments a, b, c. b and c share the same doc store. You perform deletes on a and b. Then you call expungeDeletes(). Normally that call should only merge a and b, because c doesn't have any deletes. But b and c have to participate in the same merge, because they share the same doc store, right? So would it merge all three segments?

If that's the case (that b and c must be part of the same merge) then it would make the parallel indexing more difficult. The reason is that if two parallel indexes 1 and 2 can decide on their own how to share e.g. doc stores across segments, then we might come into a situation where 1a and 1b share the same doc store, and 2b and 2c share the same doc store. Then if index 1 needs to merge 1a and 1b, it can't assume that this merge is allowed. There would have to be someone on top of the whole thing who decides that all three segments need to be merged at the same time, because b is connected to a and c in the two parallel indexes. I wouldn't like such a restriction very much.

We could think about allowing merges like ab->d, even if b,c share the same doc store. That would mean to copy the b part of the shared bc doc store into the new segment d. Then until c gets deleted the stored docs of b would be on disk twice and require more disk space temporarily.

Well maybe there is already a solution for all this in the code and I'm just not aware of it?

 Michael


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to