Re: Importing large datasets

Grant Ingersoll Wed, 02 Jun 2010 04:12:50 -0700

On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>> 
>>> We have around 5 million items in our index and each item has a description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only indexing
>>> items and not their corresponding description and a full import takes around
>>> 4 hours. Ideally we want to index both our items and their descriptions but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>> 
>>> - How would I profile the indexing process to determine if the bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here or 
>> maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of it, 
>> merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to run
>>> parallel import processes and then merge them together at the end? Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>> that I know of to index is via multiple threads sending batches of documents 
>> at a time (at least 100).  Often, from DBs one can split up the table via 
>> SQL statements that can then be fetched separately.  You may want to write 
>> your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
>


If the bottleneck is the DB, will that do much?

Re: Importing large datasets

Reply via email to