Re: O/S Search Comparisons

Grant Ingersoll Tue, 18 Dec 2007 17:25:42 -0800

My testing experience has shown around 100 to be good for things likeWikipedia, etc. That is an interesting point to think about inregards to paying the cost once optimize is undertaken and may beworth exploring more. I also wonder how partial optimizes may help.


The Javadocs say:

Determines how often segment indices are merged by addDocument().With

   * smaller values, less RAM is used while indexing, and searches on

* unoptimized indices are faster, but indexing speed is slower.With larger* values, more RAM is used during indexing, and while searches onunoptimized* indices are slower, indexing is faster. Thus larger values (>10) are best* for batch index creation, and smaller values (< 10) for indicesthat are

   * interactively maintained.
   *
   * <p>Note that this method is a convenience method: it
   * just calls mergePolicy.setMergeFactor as long as
   * mergePolicy is an instance of [EMAIL PROTECTED] LogMergePolicy}.
   * Otherwise an IllegalArgumentException is thrown.</p>
   *
   * <p>This must never be less than 2.  The default value is 10.


I'd like to append to the last line to say something like:

Empirical testing suggests a maximum value around 100, but thisdepends on the collection. Really large values (>>> 100) arediscouraged.




On Dec 18, 2007, at 12:10 AM, Doron Cohen wrote:

On Dec 18, 2007 2:38 AM, Mark Miller <[EMAIL PROTECTED]> wrote:
For the data that I normally work with (short articles), I found that
the sweet spot was around 80-120. I actually saw a slight decreasegoingabove that...not sure if that held forever though. That was testingon
an earlier release  (I think 2.1?). However, if you want to test
searching it would seem that you are going to want to optimize the
index. I have always found that whatever I save by changing the merge
factor is paid back when you optimize. I have not "scientifically"
tested this, but found it to be the case in every speed test I ran.This
is an interesting thing to me for this test. Do you test with a full
optimize for indexing? If you don't, can you really test the search
performance with the advantage of a full optimize? So, if you aregoingto optimize, why mess with the merge factor? It may still play asmall
role, but at best I think its a pretty weak lever.
I had similar experience - set merge factor  to ~maxint and optimized
at the end, and "felt" like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be fasterto
merge on the fly?
(One comment - it is important to set back merge factor to areasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)
- Mark

Grant Ingersoll wrote:
I did hear back from the authors.  Some of the issues were based on
values chosen for mergeFactor (10,000) I think, but there alsoseemedto be some questions about parsing the TREC collection. It wassplit
out into individual files, as opposed to trying to stream in the
documents like we do with Wikipedia, so I/O overhead may be anissue.At the time, 1.9.1 did not have much TREC support, so splittingfiles
is probably the easiest way to do it.  There indexing code was based
off the demo and some LIA reading.
They thought they would try Lucene again when 2.3 comes out. Fromour
end, I think we need to improve the docs around mergeFactor.  We
generally just say bigger is better, but my understanding is thereis
definitely a limit to this (100??  Maybe 1000) so we should probably
suggest that in the docs.  And, of course, I think the new
contrib/benchmark has support for reading TREC (although I don'tknow
if it handles streaming it) such that I think it shouldn't be a
problem this time around.
Yes it does streaming - TREC compressed files are read withGZIPInputStream"on demand" - next doc's text is read/parsed only when the indexerrequests
it,
and the indexable document is created, no doc files are created ondisk.
At any rate, I think we are for the most part doing the rightthings.Anyone have any thoughts on advice about an upper bound formergeFactor?
Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
+1 I have been thinking about this too. Solr clearlydemonstratesthe benefits of this kind of approach, although even it doesn'tmakeit seamless for users in the sense that they still need todivvy up
the docs on the app side.
Would be nice if this layer also took care of searchers/readers
refreshing & warming.
Solr has well-tested code that provides all this functionality and
more (except for automatically spawning extra indexing threads,whichI agree would be a useful addition). It does heavily depend on1.5's
java.util.concurrent package, though.  Many people seem like using
Solr as an embedded library layer on top of Lucene to do it all
in-process, as well.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: O/S Search Comparisons

Reply via email to