On Mon, May 25, 2009 at 10:56 AM, nk 11 <nick.cass...@gmail.com> wrote: > Hello > Interesting thread. One request please, because I don't have much experience > with solr, could you please use full terms and not DIH, RES etc.?
nk11. DIH = DataImportHandler RES=? it is unavoidable that we end up using short names because of laziness/lack of time. But if you ever come across one, do not hesitate to ask.we will be more than glad to clarify. > > Thanks :) > > On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai <djian...@yahoo.com> wrote: > >> >> Hi Paul, >> >> Hope you have a great weekend so far. >> I still have a couple of questions you might help me out: >> >> 1. In your earlier email, you said "if possible , you can setup multiple >> DIH say /dataimport1, /dataimport2 etc and split your files and can achieve >> parallelism" >> I am not sure if I understand it right. I put two requesHandler in >> solrconfig.xml, like this >> >> <requestHandler name="/dataimport" >> class="org.apache.solr.handler..dataimport.DataImportHandler"> >> <lst name="defaults"> >> <str name="config">./data-config.xml</str> >> </lst> >> </requestHandler> >> >> <requestHandler name="/dataimport2" >> class="org.apache.solr.handler.dataimport.DataImportHandler"> >> <lst name="defaults"> >> <str name="config">./data-config2.xml</str> >> </lst> >> </requestHandler> >> >> >> and create data-config.xml and data-config2.xml. >> then I run the command >> http://host:8080/solr/dataimport?command=full-import >> >> But only one data set (the first one) was indexed. Did I get something >> wrong? >> >> >> 2. I noticed that after solr indexed about 8M documents (around two hours), >> it gets very very slow. I use "top" command in linux, and noticed that RES >> is 1g of memory. I did several experiments, every time RES reaches 1g, the >> indexing process becomes extremely slow. Is this memory limit set by JVM? >> And how can I set the JVM memory when I use DIH through web command >> full-import? >> >> Thanks! >> >> >> JB >> >> >> >> >> --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> >> wrote: >> >> > From: Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> >> > Subject: Re: How to index large set data >> > To: "Jianbin Dai" <djian...@yahoo.com> >> > Date: Friday, May 22, 2009, 10:04 PM >> > On Sat, May 23, 2009 at 10:27 AM, >> > Jianbin Dai <djian...@yahoo.com> >> > wrote: >> > > >> > > Hi Pual, but in your previous post, you said "there is >> > already an issue for writing to Solr in multiple threads >> > SOLR-1089". Do you think use solrj alone would be better >> > than DIH? >> > >> > nope >> > you will have to do indexing in multiple threads >> > >> > if possible , you can setup multiple DIH say /dataimport1, >> > /dataimport2 etc and split your files and can achieve >> > parallelism >> > >> > >> > > Thanks and have a good weekend! >> > > >> > > --- On Fri, 5/22/09, Noble Paul നോബിള് >> > नोब्ळ् <noble.p...@corp.aol.com> >> > wrote: >> > > >> > >> no need to use embedded Solrserver.. >> > >> you can use SolrJ with streaming >> > >> in multiple threads >> > >> >> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai >> > <djian...@yahoo.com> >> > >> wrote: >> > >> > >> > >> > If I do the xml parsing by myself and use >> > embedded >> > >> client to do the push, would it be more efficient >> > than DIH? >> > >> > >> > >> > >> > >> > --- On Fri, 5/22/09, Grant Ingersoll <gsing...@apache.org> >> > >> wrote: >> > >> > >> > >> >> From: Grant Ingersoll <gsing...@apache.org> >> > >> >> Subject: Re: How to index large set data >> > >> >> To: solr-user@lucene.apache.org >> > >> >> Date: Friday, May 22, 2009, 5:38 AM >> > >> >> Can you parallelize this? I >> > >> >> don't know that the DIH can handle it, >> > >> >> but having multiple threads sending docs >> > to Solr >> > >> is the >> > >> >> best >> > >> >> performance wise, so maybe you need to >> > look at >> > >> alternatives >> > >> >> to pulling >> > >> >> with DIH and instead use a client to push >> > into >> > >> Solr. >> > >> >> >> > >> >> >> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai >> > wrote: >> > >> >> >> > >> >> > >> > >> >> > about 2.8 m total docs were created. >> > only the >> > >> first >> > >> >> run finishes. In >> > >> >> > my 2nd try, it hangs there forever >> > at the end >> > >> of >> > >> >> indexing, (I guess >> > >> >> > right before commit), with cpu usage >> > of 100%. >> > >> Total 5G >> > >> >> (2050) index >> > >> >> > files are created. Now I have two >> > problems: >> > >> >> > 1. why it hangs there and failed? >> > >> >> > 2. how can i speed up the indexing? >> > >> >> > >> > >> >> > >> > >> >> > Here is my solrconfig.xml >> > >> >> > >> > >> >> > >> > >> >> >> > >> >> > <useCompoundFile>false</useCompoundFile> >> > >> >> > >> > >> >> >> > >> >> > <ramBufferSizeMB>3000</ramBufferSizeMB> >> > >> >> > >> > >> >> >> > <mergeFactor>1000</mergeFactor> >> > >> >> > >> > >> >> >> > >> >> > <maxMergeDocs>2147483647</maxMergeDocs> >> > >> >> > >> > >> >> >> > >> >> > <maxFieldLength>10000</maxFieldLength> >> > >> >> > >> > >> >> >> > >> >> > <unlockOnStartup>false</unlockOnStartup> >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > --- On Thu, 5/21/09, Noble Paul >> > >> >> നോബിള് नो >> > >> >> > ब्ळ् <noble.p...@corp.aol.com> >> > >> >> wrote: >> > >> >> > >> > >> >> >> From: Noble Paul >> > നോബിള് >> > >> >> नोब्ळ् >> > >> >> >> <noble.p...@corp.aol.com> >> > >> >> >> Subject: Re: How to index large >> > set data >> > >> >> >> To: solr-user@lucene.apache.org >> > >> >> >> Date: Thursday, May 21, 2009, >> > 10:39 PM >> > >> >> >> what is the total no:of docs >> > created >> > >> >> >> ? I guess it may not be >> > memory >> > >> >> >> bound. indexing is mostly amn IO >> > bound >> > >> operation. >> > >> >> You may >> > >> >> >> be able to >> > >> >> >> get a better perf if a SSD is >> > used (solid >> > >> state >> > >> >> disk) >> > >> >> >> >> > >> >> >> On Fri, May 22, 2009 at 10:46 >> > AM, Jianbin >> > >> Dai >> > >> >> <djian...@yahoo.com> >> > >> >> >> wrote: >> > >> >> >>> >> > >> >> >>> Hi Paul, >> > >> >> >>> >> > >> >> >>> Thank you so much for >> > answering my >> > >> questions. >> > >> >> It >> > >> >> >> really helped. >> > >> >> >>> After some adjustment, >> > basically >> > >> setting >> > >> >> mergeFactor >> > >> >> >> to 1000 from the default value >> > of 10, I >> > >> can >> > >> >> finished the >> > >> >> >> whole job in 2.5 hours. I >> > checked that >> > >> during >> > >> >> running time, >> > >> >> >> only around 18% of memory is >> > being used, >> > >> and VIRT >> > >> >> is always >> > >> >> >> 1418m. I am thinking it may be >> > restricted >> > >> by JVM >> > >> >> memory >> > >> >> >> setting. But I run the data >> > import >> > >> command through >> > >> >> web, >> > >> >> >> i.e., >> > >> >> >>> >> > >> >> >> >> > >> >> >> > >> >> > http://<host>:<port>/solr/dataimport?command=full-import, >> > >> >> >> how can I set the memory >> > allocation for >> > >> JVM? >> > >> >> >>> Thanks again! >> > >> >> >>> >> > >> >> >>> JB >> > >> >> >>> >> > >> >> >>> --- On Thu, 5/21/09, Noble >> > Paul >> > >> >> നോബിള് >> > >> >> >> नोब्ळ् <noble.p...@corp..aol.com> >> > >> >> >> wrote: >> > >> >> >>> >> > >> >> >>>> From: Noble Paul >> > >> നോബിള് >> > >> >> >> नोब्ळ् <noble.p...@corp.aol.com> >> > >> >> >>>> Subject: Re: How to >> > index large >> > >> set data >> > >> >> >>>> To: solr-u...@lucene.apache..org >> > >> >> >>>> Date: Thursday, May 21, >> > 2009, >> > >> 9:57 PM >> > >> >> >>>> check the status page of >> > DIH and >> > >> see >> > >> >> >>>> if it is working >> > properly. and >> > >> >> >>>> if, yes what is the rate >> > of >> > >> indexing >> > >> >> >>>> >> > >> >> >>>> On Thu, May 21, 2009 at >> > 11:48 AM, >> > >> Jianbin >> > >> >> Dai >> > >> >> >> <djian...@yahoo.com> >> > >> >> >>>> wrote: >> > >> >> >>>>> >> > >> >> >>>>> Hi, >> > >> >> >>>>> >> > >> >> >>>>> I have about 45GB >> > xml files >> > >> to be >> > >> >> indexed. I >> > >> >> >> am using >> > >> >> >>>> DataImportHandler. I >> > started the >> > >> full >> > >> >> import 4 >> > >> >> >> hours ago, >> > >> >> >>>> and it's still >> > running..... >> > >> >> >>>>> My computer has 4GB >> > memory. >> > >> Any >> > >> >> suggestion on >> > >> >> >> the >> > >> >> >>>> solutions? >> > >> >> >>>>> Thanks! >> > >> >> >>>>> >> > >> >> >>>>> JB >> > >> >> >>>>> >> > >> >> >>>>> >> > >> >> >>>>> >> > >> >> >>>>> >> > >> >> >>>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> -- >> > >> >> >>>> >> > >> >> >> >> > >> >> >> > >> >> > ----------------------------------------------------- >> > >> >> >>>> Noble Paul | Principal >> > Engineer| >> > >> AOL | http://aol.com >> > >> >> >>>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> -- >> > >> >> >> >> > >> >> >> > >> >> > ----------------------------------------------------- >> > >> >> >> Noble Paul | Principal Engineer| >> > AOL | http://aol.com >> > >> >> >> >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> >> > >> >> -------------------------- >> > >> >> Grant Ingersoll >> > >> >> http://www.lucidimagination.com/ >> > >> >> >> > >> >> Search the Lucene ecosystem >> > >> >> (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> > >> >> using Solr/Lucene: >> > >> >> http://www.lucidimagination...com/search >> > >> >> >> > >> >> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> > >> >> > >> >> > >> -- >> > >> >> > ----------------------------------------------------- >> > >> Noble Paul | Principal Engineer| AOL | http://aol.com >> > >> >> > > >> > > >> > > >> > > >> > > >> > >> > >> > >> > -- >> > ----------------------------------------------------- >> > Noble Paul | Principal Engineer| AOL | http://aol.com >> > >> >> >> >> >> > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com