Re: Any tips for indexing large amounts of data?

Glen Newton Thu, 09 Apr 2009 08:55:52 -0700

> - As per
> http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
Sorry, the presentation covers a lot of ground: see slide #20:
"Standard thread pools can have high contention for task queue and
other data structures when used with fine-grained tasks"
[I haven't yet implemented work stealing]


-glen

2009/4/9 Glen Newton <glen.new...@gmail.com>:
> For Solr / Lucene:
> - use -XX:+AggressiveOpts
> - If available, huge pages can help. See
> http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
>  I haven't yet followed-up with my Lucene performance numbers using
> huge pages: it is 10-15% for large indexing jobs.
>
> For Lucene:
> - multi-thread using java.util.concurrent.ThreadPoolExecutor
> (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
>  6.4 million full-text article + metadata indexed resulting in 83GB
> index; these are old number: things are down to ~10hours now)
> - while multithreading on multicore is particularly good, it also
> improves performance on single core, for small (<6 YMMV) numbers of
> threads & good I/O (test for your particular configuration)
> - Use multiple indexes & merge at the end
> - As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
> use separate ThreadPoolExecutor  per index in previous, reducing queue
> contention. This is giving me an additional ~10%. I will blog about
> this in the near future...
>
> -glen
>
> 2009/4/9 sunnyfr <johanna...@gmail.com>:
>>
>> Hi Otis,
>> How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
>> for 14M docs and 50000 update every 30mn but my replication kill everything.
>> My segments are merged too often sor full index replicate and cache lost and
>> .... I've no idea what can I do now?
>> Some help would be brilliant,
>> btw im using Solr 1.4.
>>
>> Thanks,
>>
>>
>> Otis Gospodnetic wrote:
>>>
>>> Mike is right about the occasional slow-down, which appears as a pause and
>>> is due to large Lucene index segment merging.  This should go away with
>>> newer versions of Lucene where this is happening in the background.
>>>
>>> That said, we just indexed about 20MM documents on a single 8-core machine
>>> with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
>>> a little less than 10 hours - that's over 550 docs/second.  The vanilla
>>> approach before some of our changes apparently required several days to
>>> index the same amount of data.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> From: Mike Klaas <mike.kl...@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Monday, November 19, 2007 5:50:19 PM
>>> Subject: Re: Any tips for indexing large amounts of data?
>>>
>>> There should be some slowdown in larger indices as occasionally large
>>> segment merge operations must occur.  However, this shouldn't really
>>> affect overall speed too much.
>>>
>>> You haven't really given us enough data to tell you anything useful.
>>> I would recommend trying to do the indexing via a webapp to eliminate
>>> all your code as a possible factor.  Then, look for signs to what is
>>> happening when indexing slows.  For instance, is Solr high in cpu, is
>>> the computer thrashing, etc?
>>>
>>> -Mike
>>>
>>> On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for answering this question a while back. I have made some
>>>> of the suggestions you mentioned. ie not committing until I've
>>>> finished indexing. What I am seeing though, is as the index get
>>>> larger (around 1Gb), indexing is taking a lot longer. In fact it
>>>> slows down to a crawl. Have you got any pointers as to what I might
>>>> be doing wrong?
>>>>
>>>> Also, I was looking at using MultiCore solr. Could this help in
>>>> some way?
>>>>
>>>> Thank you
>>>> Brendan
>>>>
>>>> On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
>>>>
>>>>>
>>>>> : I would think you would see better performance by allowing auto
>>>>> commit
>>>>> : to handle the commit size instead of reopening the connection
>>>>> all the
>>>>> : time.
>>>>>
>>>>> if your goal is "fast" indexing, don't use autoCommit at all ...
>>>  just
>>>>> index everything, and don't commit until you are completely done.
>>>>>
>>>>> autoCommitting will slow your indexing down (the benefit being
>>>>> that more
>>>>> results will be visible to searchers as you proceed)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -Hoss
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> View this message in context: 
>> http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
>
> -
>



-- 

-

Re: Any tips for indexing large amounts of data?

Reply via email to