Re: How to index large set data
On Mon, May 25, 2009 at 10:56 AM, nk 11 nick.cass...@gmail.com wrote: Hello Interesting thread. One request please, because I don't have much experience with solr, could you please use full terms and not DIH, RES etc.? nk11. DIH = DataImportHandler RES=? it is unavoidable that we end up using short names because of laziness/lack of time. But if you ever come across one, do not hesitate to ask.we will be more than glad to clarify. Thanks :) On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Hope you have a great weekend so far. I still have a couple of questions you might help me out: 1. In your earlier email, you said if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism I am not sure if I understand it right. I put two requesHandler in solrconfig.xml, like this requestHandler name=/dataimport class=org.apache.solr.handler..dataimport.DataImportHandler lst name=defaults str name=config./data-config.xml/str /lst /requestHandler requestHandler name=/dataimport2 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config./data-config2.xml/str /lst /requestHandler and create data-config.xml and data-config2.xml. then I run the command http://host:8080/solr/dataimport?command=full-import But only one data set (the first one) was indexed. Did I get something wrong? 2. I noticed that after solr indexed about 8M documents (around two hours), it gets very very slow. I use top command in linux, and noticed that RES is 1g of memory. I did several experiments, every time RES reaches 1g, the indexing process becomes extremely slow. Is this memory limit set by JVM? And how can I set the JVM memory when I use DIH through web command full-import? Thanks! JB --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: Jianbin Dai djian...@yahoo.com Date: Friday, May 22, 2009, 10:04 PM On Sat, May 23, 2009 at 10:27 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Pual, but in your previous post, you said there is already an issue for writing to Solr in multiple threads SOLR-1089. Do you think use solrj alone would be better than DIH? nope you will have to do indexing in multiple threads if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism Thanks and have a good weekend! --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: no need to use embedded Solrserver.. you can use SolrJ with streaming in multiple threads On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote: If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin
Re: How to index large set data
Hi Paul, Hope you have a great weekend so far. I still have a couple of questions you might help me out: 1. In your earlier email, you said if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism I am not sure if I understand it right. I put two requesHandler in solrconfig.xml, like this requestHandler name=/dataimport class=org.apache.solr.handler..dataimport.DataImportHandler lst name=defaults str name=config./data-config.xml/str /lst /requestHandler requestHandler name=/dataimport2 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config./data-config2.xml/str /lst /requestHandler and create data-config.xml and data-config2.xml. then I run the command http://host:8080/solr/dataimport?command=full-import But only one data set (the first one) was indexed. Did I get something wrong? 2. I noticed that after solr indexed about 8M documents (around two hours), it gets very very slow. I use top command in linux, and noticed that RES is 1g of memory. I did several experiments, every time RES reaches 1g, the indexing process becomes extremely slow. Is this memory limit set by JVM? And how can I set the JVM memory when I use DIH through web command full-import? Thanks! JB --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: Jianbin Dai djian...@yahoo.com Date: Friday, May 22, 2009, 10:04 PM On Sat, May 23, 2009 at 10:27 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Pual, but in your previous post, you said there is already an issue for writing to Solr in multiple threads SOLR-1089. Do you think use solrj alone would be better than DIH? nope you will have to do indexing in multiple threads if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism Thanks and have a good weekend! --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: no need to use embedded Solrserver.. you can use SolrJ with streaming in multiple threads On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote: If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again
Re: How to index large set data
Hello Interesting thread. One request please, because I don't have much experience with solr, could you please use full terms and not DIH, RES etc.? Thanks :) On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Hope you have a great weekend so far. I still have a couple of questions you might help me out: 1. In your earlier email, you said if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism I am not sure if I understand it right. I put two requesHandler in solrconfig.xml, like this requestHandler name=/dataimport class=org.apache.solr.handler..dataimport.DataImportHandler lst name=defaults str name=config./data-config.xml/str /lst /requestHandler requestHandler name=/dataimport2 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config./data-config2.xml/str /lst /requestHandler and create data-config.xml and data-config2.xml. then I run the command http://host:8080/solr/dataimport?command=full-import But only one data set (the first one) was indexed. Did I get something wrong? 2. I noticed that after solr indexed about 8M documents (around two hours), it gets very very slow. I use top command in linux, and noticed that RES is 1g of memory. I did several experiments, every time RES reaches 1g, the indexing process becomes extremely slow. Is this memory limit set by JVM? And how can I set the JVM memory when I use DIH through web command full-import? Thanks! JB --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: Jianbin Dai djian...@yahoo.com Date: Friday, May 22, 2009, 10:04 PM On Sat, May 23, 2009 at 10:27 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Pual, but in your previous post, you said there is already an issue for writing to Solr in multiple threads SOLR-1089. Do you think use solrj alone would be better than DIH? nope you will have to do indexing in multiple threads if possible , you can setup multiple DIH say /dataimport1, /dataimport2 etc and split your files and can achieve parallelism Thanks and have a good weekend! --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: no need to use embedded Solrserver.. you can use SolrJ with streaming in multiple threads On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote: If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished
Re: How to index large set data
about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: How to index large set data
there is already an issue for writing to Solr in multiple threads SOLR-1089 On Fri, May 22, 2009 at 6:08 PM, Grant Ingersoll gsing...@apache.org wrote: Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
Hi, Those settings are a little crazy. Are you sure you want to give Solr/Lucene 3G to buffer documents before flushing them to disk? Are you sure you want to use the mergeFactor of 1000? Checking the logs to see if there are any errors. Look at the index directory to see if Solr is actually still writing to it? (file sizes are changing, number of files is changing). kill -QUIT the JVM pid to see where things are stuck if they are stuck... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jianbin Dai djian...@yahoo.com To: solr-user@lucene.apache.org; noble.p...@gmail.com Sent: Friday, May 22, 2009 3:42:04 AM Subject: Re: How to index large set data about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml false 3000 1000 2147483647 1 false --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://:/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
I dont know exactly what is this 3G Ram buffer used. But what I noticed was both index size and file number were keeping increasing, but stuck in the commit. --- On Fri, 5/22/09, Otis Gospodnetic otis_gospodne...@yahoo..com wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 7:26 AM Hi, Those settings are a little crazy. Are you sure you want to give Solr/Lucene 3G to buffer documents before flushing them to disk? Are you sure you want to use the mergeFactor of 1000? Checking the logs to see if there are any errors. Look at the index directory to see if Solr is actually still writing to it? (file sizes are changing, number of files is changing). kill -QUIT the JVM pid to see where things are stuck if they are stuck... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jianbin Dai djian...@yahoo.com To: solr-user@lucene.apache.org; noble.p...@gmail.com Sent: Friday, May 22, 2009 3:42:04 AM Subject: Re: How to index large set data about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml false 3000 1000 2147483647 1 false --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://:/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running. My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp..aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination..com/search
Re: How to index large set data
If the file numbers and index size was increasing, that means Solr was still working. It's possible it's taking extra long because of such high settings. Bring them both down and try. For example, don't go over 20 with mergeFactor, and try just 1GB for ramBufferSizeMB. Bona fortuna! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jianbin Dai djian...@yahoo.com To: solr-user@lucene.apache.org Sent: Friday, May 22, 2009 11:05:27 AM Subject: Re: How to index large set data I dont know exactly what is this 3G Ram buffer used. But what I noticed was both index size and file number were keeping increasing, but stuck in the commit. --- On Fri, 5/22/09, Otis Gospodnetic wrote: From: Otis Gospodnetic Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 7:26 AM Hi, Those settings are a little crazy. Are you sure you want to give Solr/Lucene 3G to buffer documents before flushing them to disk? Are you sure you want to use the mergeFactor of 1000? Checking the logs to see if there are any errors. Look at the index directory to see if Solr is actually still writing to it? (file sizes are changing, number of files is changing). kill -QUIT the JVM pid to see where things are stuck if they are stuck... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jianbin Dai To: solr-user@lucene.apache.org; noble.p...@gmail.com Sent: Friday, May 22, 2009 3:42:04 AM Subject: Re: How to index large set data about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml false 3000 1000 2147483647 1 false --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://:/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् wrote: From: Noble Paul നോബിള് नोब्ळ् Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running. My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
no need to use embedded Solrserver. you can use SolrJ with streaming in multiple threads On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote: If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp..aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination..com/search -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
Hi Pual, but in your previous post, you said there is already an issue for writing to Solr in multiple threads SOLR-1089. Do you think use solrj alone would be better than DIH? Thanks and have a good weekend! --- On Fri, 5/22/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: no need to use embedded Solrserver. you can use SolrJ with streaming in multiple threads On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote: If I do the xml parsing by myself and use embedded client to do the push, would it be more efficient than DIH? --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Friday, May 22, 2009, 5:38 AM Can you parallelize this? I don't know that the DIH can handle it, but having multiple threads sending docs to Solr is the best performance wise, so maybe you need to look at alternatives to pulling with DIH and instead use a client to push into Solr. On May 22, 2009, at 3:42 AM, Jianbin Dai wrote: about 2.8 m total docs were created. only the first run finishes. In my 2nd try, it hangs there forever at the end of indexing, (I guess right before commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I have two problems: 1. why it hangs there and failed? 2. how can i speed up the indexing? Here is my solrconfig.xml useCompoundFilefalse/useCompoundFile ramBufferSizeMB3000/ramBufferSizeMB mergeFactor1000/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup --- On Thu, 5/21/09, Noble Paul നോബിള് नो ब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 10:39 PM what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp..aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-u...@lucene.apache..org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running. My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination...com/search -- - Noble Paul | Principal Engineer| AOL | http://aol.com
How to index large set data
Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB
Re: How to index large set data
This isn't much data to go on. Do you have any idea what your throughput is?How many documents are you indexing? one 45G doc or 4.5 billion 10 character docs? Have you looked at any profiling data to see how much memory is being consumed? Are you IO bound or CPU bound? Best Erick On Thu, May 21, 2009 at 2:18 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB
Re: How to index large set data
check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: How to index large set data
what is the total no:of docs created ? I guess it may not be memory bound. indexing is mostly amn IO bound operation. You may be able to get a better perf if a SSD is used (solid state disk) On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote: Hi Paul, Thank you so much for answering my questions. It really helped. After some adjustment, basically setting mergeFactor to 1000 from the default value of 10, I can finished the whole job in 2.5 hours. I checked that during running time, only around 18% of memory is being used, and VIRT is always 1418m. I am thinking it may be restricted by JVM memory setting. But I run the data import command through web, i.e., http://host:port/solr/dataimport?command=full-import, how can I set the memory allocation for JVM? Thanks again! JB --- On Thu, 5/21/09, Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com wrote: From: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Subject: Re: How to index large set data To: solr-user@lucene.apache.org Date: Thursday, May 21, 2009, 9:57 PM check the status page of DIH and see if it is working properly. and if, yes what is the rate of indexing On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote: Hi, I have about 45GB xml files to be indexed. I am using DataImportHandler. I started the full import 4 hours ago, and it's still running My computer has 4GB memory. Any suggestion on the solutions? Thanks! JB -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com