Thanks for the tips, things seem happier now.  Yeah, the size of each
document (number of tokens) is actually quite small in my case - I
think this is just case  of me messing up the flush/optimize/close
tactics.



On 10/10/06, peter <[EMAIL PROTECTED]> wrote:
> We've had somewhat of a similar situation ourselves, where we are indexing
> about a million records to an index, and each record can be somewhat large.
>
> Now..what happened on our side was that the index files (very similar in
> structure to what you have below) came up to a 2 gig limit and stopped
> there..and the indexer started crashing each time it hit this limit.
>
> On your side, I don't see your index file sizes really that large.  I think
> the compiling with large file support only really kicks in when you hit this
> 2 gig size limit.
>
> Couple of thoughts that might help:
> 1.  On our side, to keep size down, I would optimize the index at every
> 100,000 documents.  The optimize call also flushes the index.
>
> 2.  Make sure you close the index once you index your data.  Small
> thing..but just making sure.
>
> 3.  With the index being this large, we actually have two copies, one for
> searching against an already optimized index, and the other copy doing the
> indexing.  This way, no items are being searched on while the indexing is
> taking place.
>
> 4.  One neat thing that I learned with indexing large items, was that I
> don't have to actually store everything.  I can have a field set to
> tokenize, but not store, so that it can be searched..but I don't need it to
> be displayed in the search results per say..I don't actually store it, so I
> was able to keep my index size down.
>
>
>
> > From: "Ben Lee" <[EMAIL PROTECTED]>
> > Reply-To: [email protected]
> > Date: Tue, 10 Oct 2006 18:35:35 -0700
> > To: [email protected]
> > Subject: [Ferret-talk] Indexing problem 10.9/10.10
> >
> > Sorry if this is a repost-  I wasn't sure if the www.ruby-forum.com
> > list works for postings.
> > I've been having trouble with indexing a large amount of documents(2.4M).
> >
> >
> > Essentially, I have one process that is following the tutorial
> > dumping documents to an index stored on the file system.  If I open the
> > index with another process, and run the size() method it is stuck at
> > a number of documents much smaller than the number I've added to the index.
> >
> > Eg. 290k -- when the indexer process has already gone through 1 M.
> >
> > Additionally, if I search, I don't get results past an
> > even smaller number of docs (22k) . I've tried the two latest ferret 
> > releases.
> >
> >
> > Does this listing of the index directory look right?
> >
> > -rw-------  1 blee blee 3.8M Oct 10 17:06 _v.fdt
> > -rw-------  1 blee blee  51K Oct 10 17:06 _v.fdx
> > -rw-------  1 blee blee  12M Oct 10 16:49 _u.cfs
> > -rw-------  1 blee blee   97 Oct 10 16:49 fields
> >
> > -rw-------  1 blee blee   78 Oct 10 16:49 segments
> > -rw-------  1 blee blee  11M Oct 10 16:23 _t.cfs
> > -rw-------  1 blee blee  11M Oct 10 15:56 _s.cfs
> > -rw-------  1 blee blee  15M Oct 10 15:11 _r.cfs
> > -rw-------  1 blee blee  13M Oct 10 14:48 _q.cfs
> >
> > -rw-------  1 blee blee  14M Oct 10 14:37 _p.cfs
> > -rw-------  1 blee blee  13M Oct 10 14:28 _o.cfs
> > -rw-------  1 blee blee  12M Oct 10 14:19 _n.cfs
> > -rw-------  1 blee blee  12M Oct 10 14:16 _m.cfs
> > -rw-------  1 blee blee 118M Oct 10 14:10 _l.cfs
> >
> > -rw-------  1 blee blee 129M Oct 10 13:24 _a.cfs
> > -rw-------  1 blee blee    0 Oct 10 13:00 ferret-write.lck
> >
> > Thanks,
> > Ben
> > _______________________________________________
> > Ferret-talk mailing list
> > [email protected]
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
>
>
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to