Re: [Ferret-talk] Indexing problem 10.9/10.10

peter Wed, 11 Oct 2006 11:48:05 -0700

Hey Dave!

Yes..we actually compiled with large-file support, and things seem to be
working just fine.  And in the end, once I figured out that I can tokenize a
large bit of text, and not have to actually store it, we were able to have
the optimized index only be about 1 gig at the end, so large-file support
never became an issue, even though we did compile it that way, just in case.


With the two copies thing, we actually have two boxes in our cluster, each
with a copy of the index used for searching, but only one copy used for
indexing.  That way, each box we have in the cluster can search locally,
while the "indexing" box can index away, and update the copies when it's
done.

Oh..and I do turn off :term_vector for most of my fields..thanks for the
tip.

By the way, thanks for all the hard work you do in getting this product the
best it can be.

> From: "David Balmain" <[EMAIL PROTECTED]>
> Reply-To: [email protected]
> Date: Wed, 11 Oct 2006 15:16:58 +0900
> To: [email protected]
> Subject: Re: [Ferret-talk] Indexing problem 10.9/10.10
> 
> On 10/11/06, peter <[EMAIL PROTECTED]> wrote:
>> We've had somewhat of a similar situation ourselves, where we are indexing
>> about a million records to an index, and each record can be somewhat large.
>> 
>> Now..what happened on our side was that the index files (very similar in
>> structure to what you have below) came up to a 2 gig limit and stopped
>> there..and the indexer started crashing each time it hit this limit.
>> 
>> On your side, I don't see your index file sizes really that large.  I think
>> the compiling with large file support only really kicks in when you hit this
>> 2 gig size limit.
> 
> Hi Peter,
> Did you manage to compile Ferret successfully with large-file support
> yourself?
> 
>> Couple of thoughts that might help:
>> 1.  On our side, to keep size down, I would optimize the index at every
>> 100,000 documents.  The optimize call also flushes the index.
> 
> You can also just call Index#flush to flush the index without having
> to optimize. Or IndexWriter#commit. Actually they should both be
> commit so I'm going to alias commit to flush in the Index class in the
> next version.
> 
>> 2.  Make sure you close the index once you index your data.  Small
>> thing..but just making sure.
>> 
>> 3.  With the index being this large, we actually have two copies, one for
>> searching against an already optimized index, and the other copy doing the
>> indexing.  This way, no items are being searched on while the indexing is
>> taking place.
> 
> This shouldn't be necessary. Whatever version of the index you open
> the IndexReader on will be the version of the index that you are
> searching, even when it's files are deleted it will hold on to the
> file handles so the data should still be available. The operating
> system won't be able to use that disk space until you close the
> IndexReader (or Searcher).
> 
>> 4.  One neat thing that I learned with indexing large items, was that I
>> don't have to actually store everything.  I can have a field set to
>> tokenize, but not store, so that it can be searched..but I don't need it to
>> be displayed in the search results per say..I don't actually store it, so I
>> was able to keep my index size down.
> 
> Very good tip. You should also set :term_vector to :no unless you are
> using term-vectors.
> 
> Cheers,
> Dave
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
> 

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Indexing problem 10.9/10.10

Reply via email to