My testing experience has shown around 100 to be good for things like Wikipedia, etc. That is an interesting point to think about in regards to paying the cost once optimize is undertaken and may be worth exploring more. I also wonder how partial optimizes may help.

The Javadocs say:
Determines how often segment indices are merged by addDocument(). With
   * smaller values, less RAM is used while indexing, and searches on
* unoptimized indices are faster, but indexing speed is slower. With larger * values, more RAM is used during indexing, and while searches on unoptimized * indices are slower, indexing is faster. Thus larger values (> 10) are best * for batch index creation, and smaller values (< 10) for indices that are
   * interactively maintained.
   *
   * <p>Note that this method is a convenience method: it
   * just calls mergePolicy.setMergeFactor as long as
   * mergePolicy is an instance of [EMAIL PROTECTED] LogMergePolicy}.
   * Otherwise an IllegalArgumentException is thrown.</p>
   *
   * <p>This must never be less than 2.  The default value is 10.

I'd like to append to the last line to say something like:
Empirical testing suggests a maximum value around 100, but this depends on the collection. Really large values (>>> 100) are discouraged.



On Dec 18, 2007, at 12:10 AM, Doron Cohen wrote:

On Dec 18, 2007 2:38 AM, Mark Miller <[EMAIL PROTECTED]> wrote:

For the data that I normally work with (short articles), I found that
the sweet spot was around 80-120. I actually saw a slight decrease going above that...not sure if that held forever though. That was testing on
an earlier release  (I think 2.1?). However, if you want to test
searching it would seem that you are going to want to optimize the
index. I have always found that whatever I save by changing the merge
factor is paid back when you optimize. I have not "scientifically"
tested this, but found it to be the case in every speed test I ran. This
is an interesting thing to me for this test. Do you test with a full
optimize for indexing? If you don't, can you really test the search
performance with the advantage of a full optimize? So, if you are going to optimize, why mess with the merge factor? It may still play a small
role, but at best I think its a pretty weak lever.


I had similar experience - set merge factor  to ~maxint and optimized
at the end, and "felt" like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster to
merge on the fly?

(One comment - it is important to set back merge factor to a reasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)


- Mark

Grant Ingersoll wrote:
I did hear back from the authors.  Some of the issues were based on
values chosen for mergeFactor (10,000) I think, but there also seemed to be some questions about parsing the TREC collection. It was split
out into individual files, as opposed to trying to stream in the
documents like we do with Wikipedia, so I/O overhead may be an issue. At the time, 1.9.1 did not have much TREC support, so splitting files
is probably the easiest way to do it.  There indexing code was based
off the demo and some LIA reading.

They thought they would try Lucene again when 2.3 comes out. From our
end, I think we need to improve the docs around mergeFactor.  We
generally just say bigger is better, but my understanding is there is
definitely a limit to this (100??  Maybe 1000) so we should probably
suggest that in the docs.  And, of course, I think the new
contrib/benchmark has support for reading TREC (although I don't know
if it handles streaming it) such that I think it shouldn't be a
problem this time around.


Yes it does streaming - TREC compressed files are read with GZIPInputStream "on demand" - next doc's text is read/parsed only when the indexer requests
it,
and the indexable document is created, no doc files are created on disk.



At any rate, I think we are for the most part doing the right things. Anyone have any thoughts on advice about an upper bound for mergeFactor?

Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:

On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:

+1 I have been thinking about this too. Solr clearly demonstrates the benefits of this kind of approach, although even it doesn't make it seamless for users in the sense that they still need to divvy up
the docs on the app side.

Would be nice if this layer also took care of searchers/readers
refreshing & warming.

Solr has well-tested code that provides all this functionality and
more (except for automatically spawning extra indexing threads, which I agree would be a useful addition). It does heavily depend on 1.5's
java.util.concurrent package, though.  Many people seem like using
Solr as an embedded library layer on top of Lucene to do it all
in-process, as well.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to