Andrzej Bialecki wrote:
> 
> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>>
>>> We have around 5 million items in our index and each item has a
>>> description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only
>>> indexing
>>> items and not their corresponding description and a full import takes
>>> around
>>> 4 hours. Ideally we want to index both our items and their descriptions
>>> but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>>
>>> - How would I profile the indexing process to determine if the
>>> bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here
>> or maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of
>> it, merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to
>>> run
>>> parallel import processes and then merge them together at the end?
>>> Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest
>> way that I know of to index is via multiple threads sending batches of
>> documents at a time (at least 100).  Often, from DBs one can split up the
>> table via SQL statements that can then be fetched separately.  You may
>> want to write your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

I haven't worked with Hadoop before but I'm willing to try anything to cut
down this full import time. I see this currently uses the embedded solr
server for indexing... would I have to scrap my DIH importing then? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to