Re: How to index large set data

Noble Paul നോബിള്‍ नोब्ळ् Fri, 22 May 2009 21:07:11 -0700

no need to use embedded Solrserver. you can use SolrJ with streaming
in multiple threads


On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai <djian...@yahoo.com> wrote:
>
> If I do the xml parsing by myself and use embedded client to do the push, 
> would it be more efficient than DIH?
>
>
> --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> wrote:
>
>> From: Grant Ingersoll <gsing...@apache.org>
>> Subject: Re: How to index large set data
>> To: solr-user@lucene.apache.org
>> Date: Friday, May 22, 2009, 5:38 AM
>> Can you parallelize this?  I
>> don't know that the DIH can handle it,
>> but having multiple threads sending docs to Solr is the
>> best
>> performance wise, so maybe you need to look at alternatives
>> to pulling
>> with DIH and instead use a client to push into Solr.
>>
>>
>> On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
>>
>> >
>> > about 2.8 m total docs were created. only the first
>> run finishes. In
>> > my 2nd try, it hangs there forever at the end of
>> indexing, (I guess
>> > right before commit), with cpu usage of 100%. Total 5G
>> (2050) index
>> > files are created. Now I have two problems:
>> > 1. why it hangs there and failed?
>> > 2. how can i speed up the indexing?
>> >
>> >
>> > Here is my solrconfig.xml
>> >
>> >
>> <useCompoundFile>false</useCompoundFile>
>> >
>> <ramBufferSizeMB>3000</ramBufferSizeMB>
>> >
>> <mergeFactor>1000</mergeFactor>
>> >
>> <maxMergeDocs>2147483647</maxMergeDocs>
>> >
>> <maxFieldLength>10000</maxFieldLength>
>> >
>> <unlockOnStartup>false</unlockOnStartup>
>> >
>> >
>> >
>> >
>> > --- On Thu, 5/21/09, Noble Paul
>> നോബിള്‍  नो
>> > ब्ळ् <noble.p...@corp.aol.com>
>> wrote:
>> >
>> >> From: Noble Paul നോബിള്‍
>> नोब्ळ्
>> >> <noble.p...@corp.aol.com>
>> >> Subject: Re: How to index large set data
>> >> To: solr-user@lucene.apache.org
>> >> Date: Thursday, May 21, 2009, 10:39 PM
>> >> what is the total no:of docs created
>> >> ?  I guess it may not be memory
>> >> bound. indexing is mostly amn IO bound operation.
>> You may
>> >> be able to
>> >> get a better perf if a SSD is used (solid state
>> disk)
>> >>
>> >> On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai
>> <djian...@yahoo.com>
>> >> wrote:
>> >>>
>> >>> Hi Paul,
>> >>>
>> >>> Thank you so much for answering my questions.
>> It
>> >> really helped.
>> >>> After some adjustment, basically setting
>> mergeFactor
>> >> to 1000 from the default value of 10, I can
>> finished the
>> >> whole job in 2.5 hours. I checked that during
>> running time,
>> >> only around 18% of memory is being used, and VIRT
>> is always
>> >> 1418m. I am thinking it may be restricted by JVM
>> memory
>> >> setting. But I run the data import command through
>> web,
>> >> i.e.,
>> >>>
>> >>
>> http://<host>:<port>/solr/dataimport?command=full-import,
>> >> how can I set the memory allocation for JVM?
>> >>> Thanks again!
>> >>>
>> >>> JB
>> >>>
>> >>> --- On Thu, 5/21/09, Noble Paul
>> നോബിള്‍
>> >>  नोब्ळ् <noble.p...@corp..aol.com>
>> >> wrote:
>> >>>
>> >>>> From: Noble Paul നോബിള്‍
>> >>  नोब्ळ् <noble.p...@corp.aol.com>
>> >>>> Subject: Re: How to index large set data
>> >>>> To: solr-user@lucene.apache.org
>> >>>> Date: Thursday, May 21, 2009, 9:57 PM
>> >>>> check the status page of DIH and see
>> >>>> if it is working properly. and
>> >>>> if, yes what is the rate of indexing
>> >>>>
>> >>>> On Thu, May 21, 2009 at 11:48 AM, Jianbin
>> Dai
>> >> <djian...@yahoo.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> I have about 45GB xml files to be
>> indexed. I
>> >> am using
>> >>>> DataImportHandler. I started the full
>> import 4
>> >> hours ago,
>> >>>> and it's still running....
>> >>>>> My computer has 4GB memory. Any
>> suggestion on
>> >> the
>> >>>> solutions?
>> >>>>> Thanks!
>> >>>>>
>> >>>>> JB
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>>
>> >>
>> -----------------------------------------------------
>> >>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> -----------------------------------------------------
>> >> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >>
>> >
>> >
>> >
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem
>> (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination..com/search
>>
>>
>
>
>
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to index large set data

Reply via email to