Re: How to index large set data
Hello Interesting thread. One request please, because I don't have much experience with solr, could you please use full terms and not DIH, RES etc.? Thanks :) On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai wrote: > > Hi Paul, > > Hope you have a great weekend so far. > I still have a couple of questions you might help me out: > > 1. In your earlier email, you said "if possible , you can setup multiple > DIH say /dataimport1, /dataimport2 etc and split your files and can achieve > parallelism" > I am not sure if I understand it right. I put two requesHandler in > solrconfig.xml, like this > > class="org.apache.solr.handler..dataimport.DataImportHandler"> > > ./data-config.xml > > > > class="org.apache.solr.handler.dataimport.DataImportHandler"> > > ./data-config2.xml > > > > > and create data-config.xml and data-config2.xml. > then I run the command > http://host:8080/solr/dataimport?command=full-import > > But only one data set (the first one) was indexed. Did I get something > wrong? > > > 2. I noticed that after solr indexed about 8M documents (around two hours), > it gets very very slow. I use "top" command in linux, and noticed that RES > is 1g of memory. I did several experiments, every time RES reaches 1g, the > indexing process becomes extremely slow. Is this memory limit set by JVM? > And how can I set the JVM memory when I use DIH through web command > full-import? > > Thanks! > > > JB > > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् > wrote: > > > From: Noble Paul നോബിള് नोब्ळ् > > Subject: Re: How to index large set data > > To: "Jianbin Dai" > > Date: Friday, May 22, 2009, 10:04 PM > > On Sat, May 23, 2009 at 10:27 AM, > > Jianbin Dai > > wrote: > > > > > > Hi Pual, but in your previous post, you said "there is > > already an issue for writing to Solr in multiple threads > > SOLR-1089". Do you think use solrj alone would be better > > than DIH? > > > > nope > > you will have to do indexing in multiple threads > > > > if possible , you can setup multiple DIH say /dataimport1, > > /dataimport2 etc and split your files and can achieve > > parallelism > > > > > > > Thanks and have a good weekend! > > > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് > > नोब्ळ् > > wrote: > > > > > >> no need to use embedded Solrserver.. > > >> you can use SolrJ with streaming > > >> in multiple threads > > >> > > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai > > > > >> wrote: > > >> > > > >> > If I do the xml parsing by myself and use > > embedded > > >> client to do the push, would it be more efficient > > than DIH? > > >> > > > >> > > > >> > --- On Fri, 5/22/09, Grant Ingersoll > > >> wrote: > > >> > > > >> >> From: Grant Ingersoll > > >> >> Subject: Re: How to index large set data > > >> >> To: solr-user@lucene.apache.org > > >> >> Date: Friday, May 22, 2009, 5:38 AM > > >> >> Can you parallelize this? I > > >> >> don't know that the DIH can handle it, > > >> >> but having multiple threads sending docs > > to Solr > > >> is the > > >> >> best > > >> >> performance wise, so maybe you need to > > look at > > >> alternatives > > >> >> to pulling > > >> >> with DIH and instead use a client to push > > into > > >> Solr. > > >> >> > > >> >> > > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai > > wrote: > > >> >> > > >> >> > > > >> >> > about 2.8 m total docs were created. > > only the > > >> first > > >> >> run finishes. In > > >> >> > my 2nd try, it hangs there forever > > at the end > > >> of > > >> >> indexing, (I guess > > >> >> > right before commit), with cpu usage > > of 100%. > > >> Total 5G > > >> >> (2050) index > > >> >> > files are created. Now I have two > > problems: > > >> >> > 1. why it hangs there and failed? > > >> >> > 2. how can i speed up the indexing? > > >> >> > > > >> >> > > > >> >> > Here is my solrconfig.xml > > >> >> > > > >> >> > > > >> >> > > >> > > false > > >> >> > > > >> >> > > >> > > 3000 > > >> >> > > > >> >> > > 1000 > > >> >> > > > >> >> > > >> > > 2147483647 > > >> >> > > > >> >> > > >> > > 1 > > >> >> > > > >> >> > > >> > > false > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > --- On Thu, 5/21/09, Noble Paul > > >> >> നോബിള് नो > > >> >> > ब्ळ् > > >> >> wrote: > > >> >> > > > >> >> >> From: Noble Paul > > നോബിള് > > >> >> नोब्ळ् > > >> >> >> > > >> >> >> Subject: Re: How to index large > > set data > > >> >> >> To: solr-user@lucene.apache.org > > >> >> >> Date: Thursday, May 21, 2009, > > 10:39 PM > > >> >> >> what is the total no:of docs > > created > > >> >> >> ? I guess it may not be > > memory > > >> >> >> bound. indexing is mostly amn IO > > bound > > >> operation. > > >> >> You may > > >> >> >> be able to > > >> >> >> get a better perf if a SSD is > > used (solid > > >> state > > >> >> disk) > > >> >> >> > > >> >> >> On Fri, May 22, 2009 at 10:46 > > AM, Jianbin > > >> Dai > > >> >> > > >> >> >> wrote: > > >> >> >>> > > >> >> >>> Hi Paul, > > >> >> >>> > > >> >> >>> Thank you so much for > > answeri
Re: Getting 404 for MoreLikeThis handler
Thanks. Will that still be the MoreLikeThisRequestHandler? Or the StandardRequestHandler with mlt option? >> Hi, I'm trying out the mlt handler but I'm getting a 404 error. >> >> HTTP Status 404 - /solr/mlt >> >> solrconfig.xml seem to say that mlt handler is available by default. >> i wonder if there's anything else I should do before I can use it? >> I'm using version 1.3. > > Try /solr/select with mlt=on parameter. > > Koji
Re: How to index large set data
Hi Paul, Hope you have a great weekend so far. I still have a couple of questions you might help me out: 1. In your earlier email, you said "if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism" I am not sure if I understand it right. I put two requesHandler in solrconfig.xml, like this ./data-config.xml ./data-config2.xml and create data-config.xml and data-config2.xml. then I run the command http://host:8080/solr/dataimport?command=full-import But only one data set (the first one) was indexed. Did I get something wrong? 2. I noticed that after solr indexed about 8M documents (around two hours), it gets very very slow. I use "top" command in linux, and noticed that RES is 1g of memory. I did several experiments, every time RES reaches 1g, the indexing process becomes extremely slow. Is this memory limit set by JVM? And how can I set the JVM memory when I use DIH through web command full-import? Thanks! JB --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् wrote: > From: Noble Paul നോബിള് नोब्ळ् > Subject: Re: How to index large set data > To: "Jianbin Dai" > Date: Friday, May 22, 2009, 10:04 PM > On Sat, May 23, 2009 at 10:27 AM, > Jianbin Dai > wrote: > > > > Hi Pual, but in your previous post, you said "there is > already an issue for writing to Solr in multiple threads > SOLR-1089". Do you think use solrj alone would be better > than DIH? > > nope > you will have to do indexing in multiple threads > > if possible , you can setup multiple DIH say /dataimport1, > /dataimport2 etc and split your files and can achieve > parallelism > > > > Thanks and have a good weekend! > > > > --- On Fri, 5/22/09, Noble Paul നോബിള് > नोब्ळ् > wrote: > > > >> no need to use embedded Solrserver.. > >> you can use SolrJ with streaming > >> in multiple threads > >> > >> On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai > > >> wrote: > >> > > >> > If I do the xml parsing by myself and use > embedded > >> client to do the push, would it be more efficient > than DIH? > >> > > >> > > >> > --- On Fri, 5/22/09, Grant Ingersoll > >> wrote: > >> > > >> >> From: Grant Ingersoll > >> >> Subject: Re: How to index large set data > >> >> To: solr-user@lucene.apache.org > >> >> Date: Friday, May 22, 2009, 5:38 AM > >> >> Can you parallelize this? I > >> >> don't know that the DIH can handle it, > >> >> but having multiple threads sending docs > to Solr > >> is the > >> >> best > >> >> performance wise, so maybe you need to > look at > >> alternatives > >> >> to pulling > >> >> with DIH and instead use a client to push > into > >> Solr. > >> >> > >> >> > >> >> On May 22, 2009, at 3:42 AM, Jianbin Dai > wrote: > >> >> > >> >> > > >> >> > about 2.8 m total docs were created. > only the > >> first > >> >> run finishes. In > >> >> > my 2nd try, it hangs there forever > at the end > >> of > >> >> indexing, (I guess > >> >> > right before commit), with cpu usage > of 100%. > >> Total 5G > >> >> (2050) index > >> >> > files are created. Now I have two > problems: > >> >> > 1. why it hangs there and failed? > >> >> > 2. how can i speed up the indexing? > >> >> > > >> >> > > >> >> > Here is my solrconfig.xml > >> >> > > >> >> > > >> >> > >> > false > >> >> > > >> >> > >> > 3000 > >> >> > > >> >> > 1000 > >> >> > > >> >> > >> > 2147483647 > >> >> > > >> >> > >> > 1 > >> >> > > >> >> > >> > false > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > --- On Thu, 5/21/09, Noble Paul > >> >> നോബിള് नो > >> >> > ब्ळ् > >> >> wrote: > >> >> > > >> >> >> From: Noble Paul > നോബിള് > >> >> नोब्ळ् > >> >> >> > >> >> >> Subject: Re: How to index large > set data > >> >> >> To: solr-user@lucene.apache.org > >> >> >> Date: Thursday, May 21, 2009, > 10:39 PM > >> >> >> what is the total no:of docs > created > >> >> >> ? I guess it may not be > memory > >> >> >> bound. indexing is mostly amn IO > bound > >> operation. > >> >> You may > >> >> >> be able to > >> >> >> get a better perf if a SSD is > used (solid > >> state > >> >> disk) > >> >> >> > >> >> >> On Fri, May 22, 2009 at 10:46 > AM, Jianbin > >> Dai > >> >> > >> >> >> wrote: > >> >> >>> > >> >> >>> Hi Paul, > >> >> >>> > >> >> >>> Thank you so much for > answering my > >> questions. > >> >> It > >> >> >> really helped. > >> >> >>> After some adjustment, > basically > >> setting > >> >> mergeFactor > >> >> >> to 1000 from the default value > of 10, I > >> can > >> >> finished the > >> >> >> whole job in 2.5 hours. I > checked that > >> during > >> >> running time, > >> >> >> only around 18% of memory is > being used, > >> and VIRT > >> >> is always > >> >> >> 1418m. I am thinking it may be > restricted > >> by JVM > >> >> memory > >> >> >> setting. But I run the data > import > >> command through > >> >> web, > >> >> >> i.e., > >> >> >>> > >> >> >> > >> >> > >> > http://:/solr/dataimport?command=full-import, > >> >> >> how can I set the memory > allocation for
Re: Getting 404 for MoreLikeThis handler
jlist9 wrote: Hi, I'm trying out the mlt handler but I'm getting a 404 error. HTTP Status 404 - /solr/mlt solrconfig.xml seem to say that mlt handler is available by default. i wonder if there's anything else I should do before I can use it? I'm using version 1.3. Thanks Try /solr/select with mlt=on parameter. Koji
More questions about MoreLikeThis
The wiki page (http://wiki.apache.org/solr/MoreLikeThis) says: mlt.fl: The fields to use for similarity. NOTE: if possible, these should have a stored TermVector I didn't set TermVector to true MoreLikeThis with StandardRequestHandler seems to work fine. The first question is, is TermVector only for performance optimization? The second question is, afterI changed the mlt.fl fields from both indexed and stored to indexed only, I started to get zero results back. Do mlt.fl fields always need to be stored? Thanks
Getting 404 for MoreLikeThis handler
Hi, I'm trying out the mlt handler but I'm getting a 404 error. HTTP Status 404 - /solr/mlt solrconfig.xml seem to say that mlt handler is available by default. i wonder if there's anything else I should do before I can use it? I'm using version 1.3. Thanks
Re: solr replication 1.3
On Sun, May 24, 2009 at 12:37 AM, Otis Gospodnetic wrote: > Yes. Although it might work under Cygwin, too. cygwin wouldn't work: http://www.lucidimagination.com/search/document/32471da18a69b169/replication_in_1_3 -Yonik http://www.lucidimagination.com