Mike is right about the occasional slow-down, which appears as a pause and is 
due to large Lucene index segment merging.  This should go away with newer 
versions of Lucene where this is happening in the background.

That said, we just indexed about 20MM documents on a single 8-core machine with 
8 GB of RAM, resulting in nearly 20 GB index.  The whole process took a little 
less than 10 hours - that's over 550 docs/second.  The vanilla approach before 
some of our changes apparently required several days to index the same amount 
of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large  
segment merge operations must occur.  However, this shouldn't really  
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.   
I would recommend trying to do the indexing via a webapp to eliminate  
all your code as a possible factor.  Then, look for signs to what is  
happening when indexing slows.  For instance, is Solr high in cpu, is  
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

> Hi,
>
> Thanks for answering this question a while back. I have made some  
> of the suggestions you mentioned. ie not committing until I've  
> finished indexing. What I am seeing though, is as the index get  
> larger (around 1Gb), indexing is taking a lot longer. In fact it  
> slows down to a crawl. Have you got any pointers as to what I might  
> be doing wrong?
>
> Also, I was looking at using MultiCore solr. Could this help in  
> some way?
>
> Thank you
> Brendan
>
> On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
>
>>
>> : I would think you would see better performance by allowing auto  
>> commit
>> : to handle the commit size instead of reopening the connection  
>> all the
>> : time.
>>
>> if your goal is "fast" indexing, don't use autoCommit at all ...
 just
>> index everything, and don't commit until you are completely done.
>>
>> autoCommitting will slow your indexing down (the benefit being  
>> that more
>> results will be visible to searchers as you proceed)
>>
>>
>>
>>
>> -Hoss
>>
>




Reply via email to