Re: Optimizing index takes too long

Grant Ingersoll Sun, 11 Nov 2007 18:31:58 -0800

Not sure the numbers are off w/ documents that big, although I imagineyou are hitting the token limit w/ docs that big. Is this all on onemachine as you described, or are you saying you have a couple ofthese? If one, have you tried having just one index?

Since you are using 2.3 (note to other readers, 2.3 is NOT releasedyet, he really means 2.3-dev), are which MergeScheduler andMergePolicy are you using? Can you do some profiling to see where isspending time?

Also, maybe Mike M. can chime in w/ how compressed fields are mergednow. I want to say that with the new indexing changes, they are alldone right away and not revisited so that shouldn't be an issue.Having said that, I am a bit confused by some of your terminology.You say some Fields are stored twice, but then say they are notstored. Can you share what the actual Field constructions are?There probably isn't a reason to compress the short biblio fields.Lucene Field compression, while not deprecated, really isn'trecommended, b/c it doesn't give the application much control (sinceit uses the highest level of compression and is not tunable.) Thebetter approach is to do the compression yourself and store as abinary. Again, though, it doesn't sound like you need compression forthose fields.


Are you using compound file format or not?

Also, were you using 2.2 before and upgraded, or is this anapplication built on 2.3 to begin with? If on 2.2, did you see theseproblems before?


Cheers,
Grant

On Nov 11, 2007, at 8:49 PM, Mark Miller wrote:

For a start, I would lower the merge factor quite a bit. A highmerge factor is over rated :) You will build the index faster, butsearches will be slower and an optimize takes much longer.Essentially, the time you save when indexing is paid when optimizinganyway. You might as well amortize the cost with a lower merge factor.

Grant seems to think the numbers are off anyway, so you may havemore to do -- just a suggestion about the merge factor. How much RAMare you giving your application?

With a machine with 8 cores and 15,000rpm, days does seem a littleridiculous.


- Mark

Barry Forrest wrote:

Hi,

Thanks for your help.

I'm using Lucene 2.3.

Raw document size is about 138G for 1.5M documents, which is about
250k per document.

IndexWriter settings are MergeFactor 50, MaxMergeDocs 2000,
RAMBufferSizeMB 32, MaxFieldLength Integer.MAX_VALUE.

Each document has about 10 short bibliographic fields and 3 longer
content fields and 1 field that contains the entire contents of the
document.  The longer content fields are stored twice - in a stemmed
and unstemmed form.  So actually there are about 8 longer content
fields.  (The effect of storing stemmed and unstemmed versions is to
approximately double the index size over storing the content only
once).  About half the short bibliographic fields are stored
(compressed) in the index.  The longer content fields are not stored,
and no term vectors are stored.

The hardware is quite new and fast: 8 cores, 15,000 RPM disks.

Thanks again
Barry

On Nov 12, 2007 10:41 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

Hmmm, something doesn't sound quite right. You have 10 milliondocs,

split into 5 or so indexes, right?  And each sub index is 150
gigabytes?  How big are your documents?

Can you provide more info about what your Directory and IndexWriter
settings are?  What version of Lucene are you using?  What are your
Field settings?  Are you storing info?  What about Term Vectors?

Can you explain more about your documents, etc?  10 million doesn't
sound like it would need to be split up that much, if at all,
depending on your hardware.

The wiki has some excellent resources on improving both indexing and
search speed.

-Grant



On Nov 11, 2007, at 6:16 PM, Barry Forrest wrote:

Hi,

Optimizing my index of 1.5 million documents takes days and days.
I have a collection of 10 million documents that I am trying toindexwith Lucene. I've divided the collection into chunks of about1.5 - 2million documents each. Indexing 1.5 documents is fast enough(about
12 hours), but this results in an index directory containing about
35000 files. Optimizing this index takes several days, which isa bit
too long for my purposes.  Each sub-index is about 150G.

What can I do to make this process faster?

Thanks for your help,
Barry

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimizing index takes too long

Reply via email to