Re: How to index large set data

Noble Paul നോബിള്‍ नोब्ळ् Fri, 22 May 2009 05:48:33 -0700

there is already an issue for writing to Solr in multiple threads  SOLR-1089


On Fri, May 22, 2009 at 6:08 PM, Grant Ingersoll <gsing...@apache.org> wrote:
> Can you parallelize this?  I don't know that the DIH can handle it, but
> having multiple threads sending docs to Solr is the best performance wise,
> so maybe you need to look at alternatives to pulling with DIH and instead
> use a client to push into Solr.
>
>
> On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
>
>>
>> about 2.8 m total docs were created. only the first run finishes. In my
>> 2nd try, it hangs there forever at the end of indexing, (I guess right
>> before commit), with cpu usage of 100%. Total 5G (2050) index files are
>> created. Now I have two problems:
>> 1. why it hangs there and failed?
>> 2. how can i speed up the indexing?
>>
>>
>> Here is my solrconfig.xml
>>
>>   <useCompoundFile>false</useCompoundFile>
>>   <ramBufferSizeMB>3000</ramBufferSizeMB>
>>   <mergeFactor>1000</mergeFactor>
>>   <maxMergeDocs>2147483647</maxMergeDocs>
>>   <maxFieldLength>10000</maxFieldLength>
>>   <unlockOnStartup>false</unlockOnStartup>
>>
>>
>>
>>
>> --- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
>> wrote:
>>
>>> From: Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
>>> Subject: Re: How to index large set data
>>> To: solr-user@lucene.apache.org
>>> Date: Thursday, May 21, 2009, 10:39 PM
>>> what is the total no:of docs created
>>> ?  I guess it may not be memory
>>> bound. indexing is mostly amn IO bound operation. You may
>>> be able to
>>> get a better perf if a SSD is used (solid state disk)
>>>
>>> On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai <djian...@yahoo.com>
>>> wrote:
>>>>
>>>> Hi Paul,
>>>>
>>>> Thank you so much for answering my questions. It
>>>
>>> really helped.
>>>>
>>>> After some adjustment, basically setting mergeFactor
>>>
>>> to 1000 from the default value of 10, I can finished the
>>> whole job in 2.5 hours. I checked that during running time,
>>> only around 18% of memory is being used, and VIRT is always
>>> 1418m. I am thinking it may be restricted by JVM memory
>>> setting. But I run the data import command through web,
>>> i.e.,
>>>>
>>> http://<host>:<port>/solr/dataimport?command=full-import,
>>> how can I set the memory allocation for JVM?
>>>>
>>>> Thanks again!
>>>>
>>>> JB
>>>>
>>>> --- On Thu, 5/21/09, Noble Paul നോബിള്‍
>>>
>>>  नोब्ळ् <noble.p...@corp.aol.com>
>>> wrote:
>>>>
>>>>> From: Noble Paul നോബിള്‍
>>>
>>>  नोब्ळ् <noble.p...@corp.aol.com>
>>>>>
>>>>> Subject: Re: How to index large set data
>>>>> To: solr-user@lucene.apache.org
>>>>> Date: Thursday, May 21, 2009, 9:57 PM
>>>>> check the status page of DIH and see
>>>>> if it is working properly. and
>>>>> if, yes what is the rate of indexing
>>>>>
>>>>> On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai
>>>
>>> <djian...@yahoo.com>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have about 45GB xml files to be indexed. I
>>>
>>> am using
>>>>>
>>>>> DataImportHandler. I started the full import 4
>>>
>>> hours ago,
>>>>>
>>>>> and it's still running....
>>>>>>
>>>>>> My computer has 4GB memory. Any suggestion on
>>>
>>> the
>>>>>
>>>>> solutions?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> JB
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>> -----------------------------------------------------
>>>>>
>>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to index large set data

Reply via email to