On 10/11/06, peter <[EMAIL PROTECTED]> wrote:
> We've had somewhat of a similar situation ourselves, where we are indexing
> about a million records to an index, and each record can be somewhat large.
>
> Now..what happened on our side was that the index files (very similar in
> structure to what you have below) came up to a 2 gig limit and stopped
> there..and the indexer started crashing each time it hit this limit.
>
> On your side, I don't see your index file sizes really that large.  I think
> the compiling with large file support only really kicks in when you hit this
> 2 gig size limit.

Hi Peter,
Did you manage to compile Ferret successfully with large-file support yourself?

> Couple of thoughts that might help:
> 1.  On our side, to keep size down, I would optimize the index at every
> 100,000 documents.  The optimize call also flushes the index.

You can also just call Index#flush to flush the index without having
to optimize. Or IndexWriter#commit. Actually they should both be
commit so I'm going to alias commit to flush in the Index class in the
next version.

> 2.  Make sure you close the index once you index your data.  Small
> thing..but just making sure.
>
> 3.  With the index being this large, we actually have two copies, one for
> searching against an already optimized index, and the other copy doing the
> indexing.  This way, no items are being searched on while the indexing is
> taking place.

This shouldn't be necessary. Whatever version of the index you open
the IndexReader on will be the version of the index that you are
searching, even when it's files are deleted it will hold on to the
file handles so the data should still be available. The operating
system won't be able to use that disk space until you close the
IndexReader (or Searcher).

> 4.  One neat thing that I learned with indexing large items, was that I
> don't have to actually store everything.  I can have a field set to
> tokenize, but not store, so that it can be searched..but I don't need it to
> be displayed in the search results per say..I don't actually store it, so I
> was able to keep my index size down.

Very good tip. You should also set :term_vector to :no unless you are
using term-vectors.

Cheers,
Dave
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to