subject:"Most efficient way to index 14M documents \(out of memory\/file handles\)"

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Kevin A. Burton

Doug Cutting wrote:
Julien,
Thanks for the excellent explanation.
I think this thread points to a documentation problem. We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (>100) are not recommended, since they can run into file 
handle limitations with FSDirectory. The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an "Expert" parameter to 
discourage folks playing with it, as it is such a common source of 
problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory. This parameter is unfortunately 
poorly named. It should really be called something like maxBufferedDocs.
I'd like to see something like this done...
BTW.. I'm willing to add it to the wiki in the interim.
This conversation has happened a few times now...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting

Julien,
Thanks for the excellent explanation.
I think this thread points to a documentation problem.  We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (>100) are not recommended, since they can run into file 
handle limitations with FSDirectory.  The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an "Expert" parameter to discourage 
folks playing with it, as it is such a common source of problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory.  This parameter is unfortunately 
poorly named.  It should really be called something like maxBufferedDocs.

Doug
Julien Nioche wrote:
It is not surprising that you run out of file handles with such a large
mergeFactor.
Before trying more complex strategies involving RAMDirectories and/or
splitting your indexation on several machines, I reckon you should try
simple things like using a low mergeFactor (eg: 10) combined with a higher
minMergeDocs (ex: 1000) and optimize only at the end of the process.
By setting a higher value to minMergeDocs, you'll index and merge with a
RAMDirectory. When the limit is reached (ex 1000) a segment is written in
the FS. MergeFactor controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.
Combining theses parameters should be enough to achieve good performance.
The good point of using minMergeDocs is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RamDirectory). At the
same time keeping your mergeFactor low limits the risks of too many handles
problem.
- Original Message - 
From: "Kevin A. Burton" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, July 07, 2004 7:44 AM
Subject: Most efficient way to index 14M documents (out of memory/file
handles)


I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file
handles.  this takes TIME and of course is linear to the size of the
index so it just gets slower by the time I complete.  It starts to crawl
at about 3M documents.
2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find
it in the archives, the FAQ or the wiki.
I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
every 50k documents.
Does it make sense to just create a new IndexWriter for every 50k docs
and then do one big optimize() at the end?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   NewsMonster - http://www.newsmonster.org/
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Julien Nioche

It is not surprising that you run out of file handles with such a large
mergeFactor.

Before trying more complex strategies involving RAMDirectories and/or
splitting your indexation on several machines, I reckon you should try
simple things like using a low mergeFactor (eg: 10) combined with a higher
minMergeDocs (ex: 1000) and optimize only at the end of the process.

By setting a higher value to minMergeDocs, you'll index and merge with a
RAMDirectory. When the limit is reached (ex 1000) a segment is written in
the FS. MergeFactor controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.

Combining theses parameters should be enough to achieve good performance.
The good point of using minMergeDocs is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RamDirectory). At the
same time keeping your mergeFactor low limits the risks of too many handles
problem.


- Original Message - 
From: "Kevin A. Burton" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, July 07, 2004 7:44 AM
Subject: Most efficient way to index 14M documents (out of memory/file
handles)


> I'm trying to burn an index of 14M documents.
>
> I have two problems.
>
> 1.  I have to run optimize() every 50k documents or I run out of file
> handles.  this takes TIME and of course is linear to the size of the
> index so it just gets slower by the time I complete.  It starts to crawl
> at about 3M documents.
>
> 2.  I eventually will run out of memory in this configuration.
>
> I KNOW this has been covered before but for the life of me I can't find
> it in the archives, the FAQ or the wiki.
>
> I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
> every 50k documents.
>
> Does it make sense to just create a new IndexWriter for every 50k docs
> and then do one big optimize() at the end?
>
> Kevin
>
> -- 
>
> Please reply using PGP.
>
> http://peerfear.org/pubkey.asc
>
> NewsMonster - http://www.newsmonster.org/
>
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting

A mergeFactor of 5000 is a bad idea.  If you want to index faster, try 
increasing minMergeDocs instead.  If you have lots of memory this can 
probably be 5000 or higher.

Also, why do you optimize before you're done?  That only slows things. 
Perhaps you have to do it because you've set mergeFactor to such an 
extreme value?  I do not recommend a merge factor higher than 100.

Doug
Kevin A. Burton wrote:
I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki.
I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Harald Kirsch

On Tue, Jul 06, 2004 at 10:44:40PM -0700, Kevin A. Burton wrote:
> I'm trying to burn an index of 14M documents.
> 
> I have two problems.
> 
> 1.  I have to run optimize() every 50k documents or I run out of file 
> handles.  this takes TIME and of course is linear to the size of the 
> index so it just gets slower by the time I complete.  It starts to crawl 
> at about 3M documents.

Recently I indexed roughly this many documents. I separated the whole
thing first into 100 jobs (we happen to have that many machines in the
cluster.-) each indexing its share into its own index. I used
mergeFactor=100 and only optimized just before closing the index.

Then I merged them all into one index simply by

  writer.mergeFactor = 150; 
  writer.addIndexes(dirs);

I was surprised myself that it went through easily within under two
hours for each of the 101 indexes. The documents have, however, only
three fields.

  Maybe this helps,
  Harald.

-- 

Harald Kirsch | [EMAIL PROTECTED] | +44 (0) 1223/49-2593

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:
A colleague of mine found the fastest way to index was to use a RAMDirectory, letting it grow
to a pre-defined maximum size, then merging it to a new temporary file-based index to
flush it. Repeat this, creating new directories for all the file based indexes then perform 
a merge into one index once all docs are indexed.

I haven't managed to test this for myself but my colleague  says he noticed a 
considerable speed up by merging once at the end with this approach so you may want
to give it a try. (This was with Lucene 1.3)
I can confirm that this approach works quite well - I use it myself in 
some applications, both with Lucene 1.3 and 1.4. The disadvantage is of 
course that the memory consumption goes up, so you have to be careful to 
  cap the max size of RAMDirectory according to your max heap size limits.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread markharw00d

A colleague of mine found the fastest way to index was to use a RAMDirectory, letting 
it grow
to a pre-defined maximum size, then merging it to a new temporary file-based index to
flush it. Repeat this, creating new directories for all the file based indexes then 
perform 
a merge into one index once all docs are indexed.

I haven't managed to test this for myself but my colleague  says he noticed a 
considerable speed up by merging once at the end with this approach so you may want
to give it a try. (This was with Lucene 1.3)

I know Lucene uses RAMDirectory internally but the mergeFactor that controls the size 
of the cache
is also used to determine the number of files used when flushed (which can get out of 
control).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-06 Thread Nader Henein

Here's the thread you want :
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1722573
Nader Henein
Kevin A. Burton wrote:
I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to 
crawl at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't 
find it in the archives, the FAQ or the wiki.
I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Most efficient way to index 14M documents (out of memory/file handles)

2004-07-06 Thread Kevin A. Burton

I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki. 

I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Re: Most efficient way to index 14M documents (out of memory/file handles)

Most efficient way to index 14M documents (out of memory/file handles)

9 matches

Site Navigation

Mail list logo

Footer information