[
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12428478 ]
Ning Li commented on LUCENE-528:
--------------------------------
In an email thread titled "LUCENE-528 and 565", I described a weakness of the
proposed solution:
"I'm totally for a version of addIndexes() where optimize() is not always
called. However, with the one proposed in the patch, we could end up with an
index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc.
while Lucene desires the reverse. Or we could have a sandwich index where:
segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of
these will occur if you use addIndexesNoOpt() carefully, there should be a more
robust merge policy."
Here is an alternative solution which merges segements so that the docCount of
segment i is at least twice as big as the docCount of segment i+1. If we are
willing to make it a bit more complicated, we can take merge factor into
consideration.
public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException
{
for (int i = 0; i < dirs.length; i++) {
SegmentInfos sis = new SegmentInfos(); // read infos from dir
sis.read(dirs[i]);
for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
}
}
int start = 0;
int docCountFromStart = docCount();
while (start < segmentInfos.size()) {
int end;
int docCountToMerge = 0;
if (docCountFromStart <= minMergeDocs) {
// if the total docCount of the remaining segments
// is lte minMergeDocs, merge all of them
end = segmentInfos.size() - 1;
docCountToMerge = docCountFromStart;
}
else {
// otherwise, merge some segments so that the docCount
// of these segments is at least half of the remaining
for (end = start; end < segmentInfos.size(); end++) {
docCountToMerge += segmentInfos.info(end).docCount;
if (docCountToMerge >= docCountFromStart / 2) {
break;
}
}
}
mergeSegments(start, end + 1);
start++;
docCountFromStart -= docCountToMerge;
}
}
> Optimization for IndexWriter.addIndexes()
> -----------------------------------------
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
> Priority: Minor
> Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to
> optimize the index both before and after adding the segments. When you have
> a very large index, to which you are adding batches of small updates, these
> calls to optimize make using addIndexes() impossible. It makes parallel
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on
> the newly added documents. It will try to avoid calling mergeSegments until
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works
> correctly if people are interested. I gave it a different name because it
> has very different performance characteristics which can make querying take
> longer.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]