On 11/9/09 9:00 AM, Michael McCandless wrote:
Alas, I don't have any benchmarks offhand... if you want to run one,
you should be able to hardwire flushDocStores=true in
IndexWriter.doFlushInternal? I think that'd turn off the sharing
without breaking things (run the tests to be sure ;) ).
Yes, I'm pretty sure that works. I think I've even done that in the
LUCENE-1879 patch (which works with Lucene 2.4).
Btw: I'm not trying to say it's
required to remove them for parallel indexing. It'd be just be simpler
without them. You can think about a segmented parallel index as a matrix of
segments. And about the shared doc stores as merging multiple cells in a
single row or column of a spreadsheet. It'd be a bit easier if that wasn't
possible and it always was a true matrix.
I agree, not sharing the stores would make things simpler. Wouldn't
the parallel indexes be able to "privately" share their own stores?
Ie, how the sharing happens need not be in sync across the main&
parallel indexes?
I think that should be ok with parallel indexing, as long as we can
always select all corresponding segments from *all* parallel indexes for
a merge to keep the docIds in sync.
That actually leads me to another question: Let's say you have three
segments a, b, c. b and c share the same doc store. You perform deletes
on a and b. Then you call expungeDeletes(). Normally that call should
only merge a and b, because c doesn't have any deletes. But b and c have
to participate in the same merge, because they share the same doc store,
right? So would it merge all three segments?
If that's the case (that b and c must be part of the same merge) then it
would make the parallel indexing more difficult. The reason is that if
two parallel indexes 1 and 2 can decide on their own how to share e.g.
doc stores across segments, then we might come into a situation where 1a
and 1b share the same doc store, and 2b and 2c share the same doc store.
Then if index 1 needs to merge 1a and 1b, it can't assume that this
merge is allowed. There would have to be someone on top of the whole
thing who decides that all three segments need to be merged at the same
time, because b is connected to a and c in the two parallel indexes. I
wouldn't like such a restriction very much.
We could think about allowing merges like ab->d, even if b,c share the
same doc store. That would mean to copy the b part of the shared bc doc
store into the new segment d. Then until c gets deleted the stored docs
of b would be on disk twice and require more disk space temporarily.
Well maybe there is already a solution for all this in the code and I'm
just not aware of it?
Michael
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org