Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-06 Thread Doug Cutting
Chuck Williams wrote:
I've got about 30k documents and have 3 indexing scenarios:
1.   Full indexing and optimize
2.   Incremental indexing and optimize
3.   Parallel incremental indexing without optimize
Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.
 

Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:
mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
1000 is too big of a mergeFactor for any practical purpose.
I don't see a point in using different mergeFactors in cases 1 and 2. 
If you're going to optimize before you search, then you want the fastest 
batch indexing mode.  I would use something like 50 for both cases 1 and 2.

For case 3, where unoptimized search performance is very important, I 
would use something smaller than 10.  For Technorati's blog search, 
which incrementally maintains a Lucene index with millions of documents, 
I used a mergeFactor of 2 in order to maximize search performance. 
Indexing performance on a single CPU is still adequate to keep up with 
the rate of change of today's blogosphere.

minMergeDocs:  1000 during Full indexing, 10 during incremental indexing
I see no reason to lower this when indexing incrementally.  1000 is a 
good value for high performance indexing when RAM is plentiful and 
documents are not too large.

maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing
1000 seems low to me, as it will result in too many segments, slowing 
search.  Here one should select the largest value that can be merged in 
the maximum time delay permitted in your application between a new 
document arriving and it appearing in search results.  So how up-to-date 
must your index be?  If it's okay for it to ocassionally be a few 
minutes out of date, then you can probably safely increase this to at 
least tens or hundreds of thousands, perhaps even millions.  When 
incrementally indexing, the most recently added segments stay cached in 
RAM by the filesystem.  So, on a system with a gigabyte of RAM that's 
dedicated to incremental indexing, you might safely set maxMergeDocs to 
account for a few hundred megabytes of index without encountering slow, 
i/o-bound merges.

Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?
It should not cause problems to use different mergeFactors at different 
times.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-04 Thread John Wang
We've found something interesting about mergeFactors.

We are indexing a million documents with a batch of 1000.
We first set the mergeFactor to 1000.

What we found is at every 10th commit, we see a significant spike in
indexing time.

The reason is that the indexer is trying to merge the segments every
10th commit, e.g 10*mergeFactor, since the mergeFactor is large, the
merge time is also long.

The example given in the previous email thread indexes identical
documents, merge time is very fast since no new terms are introduced
as indexing proceeds. Hence it may hide this overhead.

We found mergeFactor=100 worked well for our application.

Cheers

-John

On Fri, 3 Dec 2004 16:38:34 -0800 (PST), Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> In my experiments with mergeFactor I found the point of diminishing/no
> returns.  If I remember correctly, I hit the limit at mergeFactor of
> 50.
> 
> But here is something from Lucene in Action that you can use to play
> with various index tuning factors and see their effect on indexing
> performance.  It's simple, and if you want to test all 3 of your
> scenarios, you will have to modify it.
> 
> package lia.indexing;
> 
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> 
> /**
> *
> */
> public class IndexTuningDemo {
> 
>  public static void main(String[] args) throws Exception {
>int docsInIndex  = Integer.parseInt(args[0]);
> 
>// create an index called 'index-dir' in a temp directory
>Directory dir = FSDirectory.getDirectory(
>  System.getProperty("java.io.tmpdir", "tmp") +
>  System.getProperty("file.separator") + "index-dir", true);
>Analyzer analyzer = new SimpleAnalyzer();
>IndexWriter writer = new IndexWriter(dir, analyzer, true);
> 
>// set variables that affect speed of indexing
>writer.mergeFactor   = Integer.parseInt(args[1]);
>writer.maxMergeDocs  = Integer.parseInt(args[2]);
>writer.minMergeDocs  = Integer.parseInt(args[3]);
>writer.infoStream= System.out;
> 
>System.out.println("Merge factor:   " + writer.mergeFactor);
>System.out.println("Max merge docs: " + writer.maxMergeDocs);
>System.out.println("Min merge docs: " + writer.minMergeDocs);
> 
>long start = System.currentTimeMillis();
>for (int i = 0; i < docsInIndex; i++) {
>  Document doc = new Document();
>  doc.add(Field.Text("fieldname", "Bibamus"));
>  writer.addDocument(doc);
>}
>writer.close();
>long stop = System.currentTimeMillis();
>System.out.println("Time: " + (stop - start) + " ms");
>  }
> }
> 
> Otis
> 
> 
> 
> 
> --- Chuck Williams <[EMAIL PROTECTED]> wrote:
> 
> > I'm wondering what values of mergeFactor, minMergeDocs and
> > maxMergeDocs
> > people have found to yield the best performance for different
> > configurations.  Is there a repository of this information anywhere?
> >
> >
> >
> > I've got about 30k documents and have 3 indexing scenarios:
> >
> > 1.   Full indexing and optimize
> >
> > 2.   Incremental indexing and optimize
> >
> > 3.   Parallel incremental indexing without optimize
> >
> >
> >
> > Search performance is critical.  For both cases 1 and 2, I'd like the
> > fastest possible indexing time.  For case 3, I'd like minimal pauses
> > and
> > no noticeable degradation in search performance.
> >
> >
> >
> > Based on reading the code (including the javadocs comments), I'm
> > thinking of values along these lines:
> >
> >
> >
> > mergeFactor:  1000 during Full indexing, and during optimize (for
> > both
> > cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
> >
> > minMergeDocs:  1000 during Full indexing, 10 during incremental
> > indexing
> >
> > maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
> > incremental indexing
> >
> >
> >
> > Do these values seem reasonable?  Are there better settings before I
> > start experimenting?
> >
> >
> >
> > Since mergeFactor is used in both addDocument() and optimize(), I'm
> > thinking of using two different values in case 2:  10 during the
> > incremental indexing, and then 1000 during the optimize.  Is changing
> > the value like this going to cause a problem?
> >
> >
> > Thanks for any advice,
> >
> >
> >
> > Chuck
> >
> >
> >
> >
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Otis Gospodnetic
In my experiments with mergeFactor I found the point of diminishing/no
returns.  If I remember correctly, I hit the limit at mergeFactor of
50.

But here is something from Lucene in Action that you can use to play
with various index tuning factors and see their effect on indexing
performance.  It's simple, and if you want to test all 3 of your
scenarios, you will have to modify it.

package lia.indexing;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 *
 */
public class IndexTuningDemo {

  public static void main(String[] args) throws Exception {
int docsInIndex  = Integer.parseInt(args[0]);

// create an index called 'index-dir' in a temp directory
Directory dir = FSDirectory.getDirectory(
  System.getProperty("java.io.tmpdir", "tmp") +
  System.getProperty("file.separator") + "index-dir", true);
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(dir, analyzer, true);

// set variables that affect speed of indexing
writer.mergeFactor   = Integer.parseInt(args[1]);
writer.maxMergeDocs  = Integer.parseInt(args[2]);
writer.minMergeDocs  = Integer.parseInt(args[3]);
writer.infoStream= System.out;

System.out.println("Merge factor:   " + writer.mergeFactor);
System.out.println("Max merge docs: " + writer.maxMergeDocs);
System.out.println("Min merge docs: " + writer.minMergeDocs);

long start = System.currentTimeMillis();
for (int i = 0; i < docsInIndex; i++) {
  Document doc = new Document();
  doc.add(Field.Text("fieldname", "Bibamus"));
  writer.addDocument(doc);
}
writer.close();
long stop = System.currentTimeMillis();
System.out.println("Time: " + (stop - start) + " ms");
  }
}


Otis


--- Chuck Williams <[EMAIL PROTECTED]> wrote:

> I'm wondering what values of mergeFactor, minMergeDocs and
> maxMergeDocs
> people have found to yield the best performance for different
> configurations.  Is there a repository of this information anywhere?
> 
>  
> 
> I've got about 30k documents and have 3 indexing scenarios:
> 
> 1.   Full indexing and optimize
> 
> 2.   Incremental indexing and optimize
> 
> 3.   Parallel incremental indexing without optimize
> 
>  
> 
> Search performance is critical.  For both cases 1 and 2, I'd like the
> fastest possible indexing time.  For case 3, I'd like minimal pauses
> and
> no noticeable degradation in search performance.
> 
>  
> 
> Based on reading the code (including the javadocs comments), I'm
> thinking of values along these lines:
> 
>  
> 
> mergeFactor:  1000 during Full indexing, and during optimize (for
> both
> cases 1 and 2); 10 during incremental indexing (cases 2 and 3)
> 
> minMergeDocs:  1000 during Full indexing, 10 during incremental
> indexing
> 
> maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
> incremental indexing
> 
>  
> 
> Do these values seem reasonable?  Are there better settings before I
> start experimenting?
> 
>  
> 
> Since mergeFactor is used in both addDocument() and optimize(), I'm
> thinking of using two different values in case 2:  10 during the
> incremental indexing, and then 1000 during the optimize.  Is changing
> the value like this going to cause a problem?
> 
> 
> Thanks for any advice,
> 
>  
> 
> Chuck
> 
>  
> 
>  
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Recommended values for mergeFactor, minMergeDocs, maxMergeDocs

2004-12-03 Thread Chuck Williams
I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs
people have found to yield the best performance for different
configurations.  Is there a repository of this information anywhere?

 

I've got about 30k documents and have 3 indexing scenarios:

1.   Full indexing and optimize

2.   Incremental indexing and optimize

3.   Parallel incremental indexing without optimize

 

Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.

 

Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:

 

mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)

minMergeDocs:  1000 during Full indexing, 10 during incremental indexing

maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing

 

Do these values seem reasonable?  Are there better settings before I
start experimenting?

 

Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?


Thanks for any advice,

 

Chuck