Re: How to index large set data

Noble Paul നോബിള്‍ नोब्ळ् Mon, 25 May 2009 00:28:20 -0700

On Mon, May 25, 2009 at 10:56 AM, nk 11 <nick.cass...@gmail.com> wrote:
> Hello
> Interesting thread. One request please, because I don't have much experience
> with solr, could you please use full terms and not DIH, RES etc.?


nk11.
DIH =  DataImportHandler
RES=?

it is unavoidable that we end up using short names because of
laziness/lack of time. But if you ever come across one, do not
hesitate to ask.we will be more than glad to clarify.
>
> Thanks :)
>
> On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai <djian...@yahoo.com> wrote:
>
>>
>> Hi Paul,
>>
>> Hope you have a great weekend so far.
>> I still have a couple of questions you might help me out:
>>
>> 1. In your earlier email, you said "if possible , you can setup multiple
>> DIH say /dataimport1, /dataimport2 etc and split your files and can achieve
>> parallelism"
>> I am not sure if I understand it right. I put two requesHandler in
>> solrconfig.xml, like this
>>
>> <requestHandler name="/dataimport"
>> class="org.apache.solr.handler..dataimport.DataImportHandler">
>>    <lst name="defaults">
>>      <str name="config">./data-config.xml</str>
>>    </lst>
>> </requestHandler>
>>
>> <requestHandler name="/dataimport2"
>> class="org.apache.solr.handler.dataimport.DataImportHandler">
>>    <lst name="defaults">
>>      <str name="config">./data-config2.xml</str>
>>    </lst>
>> </requestHandler>
>>
>>
>> and create data-config.xml and data-config2.xml.
>> then I run the command
>> http://host:8080/solr/dataimport?command=full-import
>>
>> But only one data set (the first one) was indexed. Did I get something
>> wrong?
>>
>>
>> 2. I noticed that after solr indexed about 8M documents (around two hours),
>> it gets very very slow. I use "top" command in linux, and noticed that RES
>> is 1g of memory. I did several experiments, every time RES reaches 1g, the
>> indexing process becomes extremely slow. Is this memory limit set by JVM?
>> And how can I set the JVM memory when I use DIH through web command
>> full-import?
>>
>> Thanks!
>>
>>
>> JB
>>
>>
>>
>>
>> --- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
>> wrote:
>>
>> > From: Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>
>> > Subject: Re: How to index large set data
>> > To: "Jianbin Dai" <djian...@yahoo.com>
>> > Date: Friday, May 22, 2009, 10:04 PM
>> > On Sat, May 23, 2009 at 10:27 AM,
>> > Jianbin Dai <djian...@yahoo.com>
>> > wrote:
>> > >
>> > > Hi Pual, but in your previous post, you said "there is
>> > already an issue for writing to Solr in multiple threads
>> >  SOLR-1089". Do you think use solrj alone would be better
>> > than DIH?
>> >
>> > nope
>> > you will have to do indexing in multiple threads
>> >
>> > if possible , you can setup multiple DIH say /dataimport1,
>> > /dataimport2 etc and split your files and can achieve
>> > parallelism
>> >
>> >
>> > > Thanks and have a good weekend!
>> > >
>> > > --- On Fri, 5/22/09, Noble Paul നോബിള്‍
>> >  नोब्ळ् <noble.p...@corp.aol.com>
>> > wrote:
>> > >
>> > >> no need to use embedded Solrserver..
>> > >> you can use SolrJ with streaming
>> > >> in multiple threads
>> > >>
>> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
>> > <djian...@yahoo.com>
>> > >> wrote:
>> > >> >
>> > >> > If I do the xml parsing by myself and use
>> > embedded
>> > >> client to do the push, would it be more efficient
>> > than DIH?
>> > >> >
>> > >> >
>> > >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org>
>> > >> wrote:
>> > >> >
>> > >> >> From: Grant Ingersoll <gsing...@apache.org>
>> > >> >> Subject: Re: How to index large set data
>> > >> >> To: solr-user@lucene.apache.org
>> > >> >> Date: Friday, May 22, 2009, 5:38 AM
>> > >> >> Can you parallelize this?  I
>> > >> >> don't know that the DIH can handle it,
>> > >> >> but having multiple threads sending docs
>> > to Solr
>> > >> is the
>> > >> >> best
>> > >> >> performance wise, so maybe you need to
>> > look at
>> > >> alternatives
>> > >> >> to pulling
>> > >> >> with DIH and instead use a client to push
>> > into
>> > >> Solr.
>> > >> >>
>> > >> >>
>> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai
>> > wrote:
>> > >> >>
>> > >> >> >
>> > >> >> > about 2.8 m total docs were created.
>> > only the
>> > >> first
>> > >> >> run finishes. In
>> > >> >> > my 2nd try, it hangs there forever
>> > at the end
>> > >> of
>> > >> >> indexing, (I guess
>> > >> >> > right before commit), with cpu usage
>> > of 100%.
>> > >> Total 5G
>> > >> >> (2050) index
>> > >> >> > files are created. Now I have two
>> > problems:
>> > >> >> > 1. why it hangs there and failed?
>> > >> >> > 2. how can i speed up the indexing?
>> > >> >> >
>> > >> >> >
>> > >> >> > Here is my solrconfig.xml
>> > >> >> >
>> > >> >> >
>> > >> >>
>> > >>
>> > <useCompoundFile>false</useCompoundFile>
>> > >> >> >
>> > >> >>
>> > >>
>> > <ramBufferSizeMB>3000</ramBufferSizeMB>
>> > >> >> >
>> > >> >>
>> > <mergeFactor>1000</mergeFactor>
>> > >> >> >
>> > >> >>
>> > >>
>> > <maxMergeDocs>2147483647</maxMergeDocs>
>> > >> >> >
>> > >> >>
>> > >>
>> > <maxFieldLength>10000</maxFieldLength>
>> > >> >> >
>> > >> >>
>> > >>
>> > <unlockOnStartup>false</unlockOnStartup>
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > --- On Thu, 5/21/09, Noble Paul
>> > >> >> നോബിള്‍  नो
>> > >> >> > ब्ळ् <noble.p...@corp.aol.com>
>> > >> >> wrote:
>> > >> >> >
>> > >> >> >> From: Noble Paul
>> > നോബിള്‍
>> > >> >> नोब्ळ्
>> > >> >> >> <noble.p...@corp.aol.com>
>> > >> >> >> Subject: Re: How to index large
>> > set data
>> > >> >> >> To: solr-user@lucene.apache.org
>> > >> >> >> Date: Thursday, May 21, 2009,
>> > 10:39 PM
>> > >> >> >> what is the total no:of docs
>> > created
>> > >> >> >> ?  I guess it may not be
>> > memory
>> > >> >> >> bound. indexing is mostly amn IO
>> > bound
>> > >> operation.
>> > >> >> You may
>> > >> >> >> be able to
>> > >> >> >> get a better perf if a SSD is
>> > used (solid
>> > >> state
>> > >> >> disk)
>> > >> >> >>
>> > >> >> >> On Fri, May 22, 2009 at 10:46
>> > AM, Jianbin
>> > >> Dai
>> > >> >> <djian...@yahoo.com>
>> > >> >> >> wrote:
>> > >> >> >>>
>> > >> >> >>> Hi Paul,
>> > >> >> >>>
>> > >> >> >>> Thank you so much for
>> > answering my
>> > >> questions.
>> > >> >> It
>> > >> >> >> really helped.
>> > >> >> >>> After some adjustment,
>> > basically
>> > >> setting
>> > >> >> mergeFactor
>> > >> >> >> to 1000 from the default value
>> > of 10, I
>> > >> can
>> > >> >> finished the
>> > >> >> >> whole job in 2.5 hours. I
>> > checked that
>> > >> during
>> > >> >> running time,
>> > >> >> >> only around 18% of memory is
>> > being used,
>> > >> and VIRT
>> > >> >> is always
>> > >> >> >> 1418m. I am thinking it may be
>> > restricted
>> > >> by JVM
>> > >> >> memory
>> > >> >> >> setting. But I run the data
>> > import
>> > >> command through
>> > >> >> web,
>> > >> >> >> i.e.,
>> > >> >> >>>
>> > >> >> >>
>> > >> >>
>> > >>
>> > http://<host>:<port>/solr/dataimport?command=full-import,
>> > >> >> >> how can I set the memory
>> > allocation for
>> > >> JVM?
>> > >> >> >>> Thanks again!
>> > >> >> >>>
>> > >> >> >>> JB
>> > >> >> >>>
>> > >> >> >>> --- On Thu, 5/21/09, Noble
>> > Paul
>> > >> >> നോബിള്‍
>> > >> >> >>  नोब्ळ् <noble.p...@corp..aol.com>
>> > >> >> >> wrote:
>> > >> >> >>>
>> > >> >> >>>> From: Noble Paul
>> > >> നോബിള്‍
>> > >> >> >>  नोब्ळ् <noble.p...@corp.aol.com>
>> > >> >> >>>> Subject: Re: How to
>> > index large
>> > >> set data
>> > >> >> >>>> To: solr-u...@lucene.apache..org
>> > >> >> >>>> Date: Thursday, May 21,
>> > 2009,
>> > >> 9:57 PM
>> > >> >> >>>> check the status page of
>> > DIH and
>> > >> see
>> > >> >> >>>> if it is working
>> > properly. and
>> > >> >> >>>> if, yes what is the rate
>> > of
>> > >> indexing
>> > >> >> >>>>
>> > >> >> >>>> On Thu, May 21, 2009 at
>> > 11:48 AM,
>> > >> Jianbin
>> > >> >> Dai
>> > >> >> >> <djian...@yahoo.com>
>> > >> >> >>>> wrote:
>> > >> >> >>>>>
>> > >> >> >>>>> Hi,
>> > >> >> >>>>>
>> > >> >> >>>>> I have about 45GB
>> > xml files
>> > >> to be
>> > >> >> indexed. I
>> > >> >> >> am using
>> > >> >> >>>> DataImportHandler. I
>> > started the
>> > >> full
>> > >> >> import 4
>> > >> >> >> hours ago,
>> > >> >> >>>> and it's still
>> > running.....
>> > >> >> >>>>> My computer has 4GB
>> > memory.
>> > >> Any
>> > >> >> suggestion on
>> > >> >> >> the
>> > >> >> >>>> solutions?
>> > >> >> >>>>> Thanks!
>> > >> >> >>>>>
>> > >> >> >>>>> JB
>> > >> >> >>>>>
>> > >> >> >>>>>
>> > >> >> >>>>>
>> > >> >> >>>>>
>> > >> >> >>>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > >> >> >>>> --
>> > >> >> >>>>
>> > >> >> >>
>> > >> >>
>> > >>
>> > -----------------------------------------------------
>> > >> >> >>>> Noble Paul | Principal
>> > Engineer|
>> > >> AOL | http://aol.com
>> > >> >> >>>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> --
>> > >> >> >>
>> > >> >>
>> > >>
>> > -----------------------------------------------------
>> > >> >> >> Noble Paul | Principal Engineer|
>> > AOL | http://aol.com
>> > >> >> >>
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >>
>> > >> >> --------------------------
>> > >> >> Grant Ingersoll
>> > >> >> http://www.lucidimagination.com/
>> > >> >>
>> > >> >> Search the Lucene ecosystem
>> > >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> > >> >> using Solr/Lucene:
>> > >> >> http://www.lucidimagination...com/search
>> > >> >>
>> > >> >>
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > -----------------------------------------------------
>> > >> Noble Paul | Principal Engineer| AOL | http://aol.com
>> > >>
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>>
>>
>>
>>
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to index large set data

Reply via email to