Re: Any tips for indexing large amounts of data?
they don't usually turn off the slave , but it is not a bad idea if you can take it offline. It is a logistical headache. BTW do you have very good cache hit ratio? then it makes sense to autowarm . --Noble On Fri, Apr 10, 2009 at 4:07 PM, sunnyfr johanna...@gmail.com wrote: ok but how people do for a frequent update for a large dabase and lot of query on it ? do they turn off the slave during the warmup ?? Noble Paul നോബിള് नोब्ळ् wrote: On Thu, Apr 9, 2009 at 8:51 PM, sunnyfr johanna...@gmail.com wrote: Hi Otis, How did you manage that? I've 8 core machine with 8GB of ram and 11GB index for 14M docs and 5 update every 30mn but my replication kill everything. My segments are merged too often sor full index replicate and cache lost and I've no idea what can I do now? Some help would be brilliant, btw im using Solr 1.4. sunnnyfr , whether the replication is full or delta , the caches are lost completely. you can think of partitioning the index into separate Solrs and updating one partition at a time and perform distributed search. Thanks, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas mike.kl...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22986152.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: Any tips for indexing large amounts of data?
Hi Otis, How did you manage that? I've 8 core machine with 8GB of ram and 11GB index for 14M docs and 5 update every 30mn but my replication kill everything. My segments are merged too often sor full index replicate and cache lost and I've no idea what can I do now? Some help would be brilliant, btw im using Solr 1.4. Thanks, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas mike.kl...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any tips for indexing large amounts of data?
For Solr / Lucene: - use -XX:+AggressiveOpts - If available, huge pages can help. See http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html I haven't yet followed-up with my Lucene performance numbers using huge pages: it is 10-15% for large indexing jobs. For Lucene: - multi-thread using java.util.concurrent.ThreadPoolExecutor (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html 6.4 million full-text article + metadata indexed resulting in 83GB index; these are old number: things are down to ~10hours now) - while multithreading on multicore is particularly good, it also improves performance on single core, for small (6 YMMV) numbers of threads good I/O (test for your particular configuration) - Use multiple indexes merge at the end - As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf use separate ThreadPoolExecutor per index in previous, reducing queue contention. This is giving me an additional ~10%. I will blog about this in the near future... -glen 2009/4/9 sunnyfr johanna...@gmail.com: Hi Otis, How did you manage that? I've 8 core machine with 8GB of ram and 11GB index for 14M docs and 5 update every 30mn but my replication kill everything. My segments are merged too often sor full index replicate and cache lost and I've no idea what can I do now? Some help would be brilliant, btw im using Solr 1.4. Thanks, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas mike.kl...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html Sent from the Solr - User mailing list archive at Nabble.com. -- -
Re: Any tips for indexing large amounts of data?
- As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf Sorry, the presentation covers a lot of ground: see slide #20: Standard thread pools can have high contention for task queue and other data structures when used with fine-grained tasks [I haven't yet implemented work stealing] -glen 2009/4/9 Glen Newton glen.new...@gmail.com: For Solr / Lucene: - use -XX:+AggressiveOpts - If available, huge pages can help. See http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html I haven't yet followed-up with my Lucene performance numbers using huge pages: it is 10-15% for large indexing jobs. For Lucene: - multi-thread using java.util.concurrent.ThreadPoolExecutor (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html 6.4 million full-text article + metadata indexed resulting in 83GB index; these are old number: things are down to ~10hours now) - while multithreading on multicore is particularly good, it also improves performance on single core, for small (6 YMMV) numbers of threads good I/O (test for your particular configuration) - Use multiple indexes merge at the end - As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf use separate ThreadPoolExecutor per index in previous, reducing queue contention. This is giving me an additional ~10%. I will blog about this in the near future... -glen 2009/4/9 sunnyfr johanna...@gmail.com: Hi Otis, How did you manage that? I've 8 core machine with 8GB of ram and 11GB index for 14M docs and 5 update every 30mn but my replication kill everything. My segments are merged too often sor full index replicate and cache lost and I've no idea what can I do now? Some help would be brilliant, btw im using Solr 1.4. Thanks, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas mike.kl...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html Sent from the Solr - User mailing list archive at Nabble.com. -- - -- -
Re: Any tips for indexing large amounts of data?
On Thu, Apr 9, 2009 at 8:51 PM, sunnyfr johanna...@gmail.com wrote: Hi Otis, How did you manage that? I've 8 core machine with 8GB of ram and 11GB index for 14M docs and 5 update every 30mn but my replication kill everything. My segments are merged too often sor full index replicate and cache lost and I've no idea what can I do now? Some help would be brilliant, btw im using Solr 1.4. sunnnyfr , whether the replication is full or delta , the caches are lost completely. you can think of partitioning the index into separate Solrs and updating one partition at a time and perform distributed search. Thanks, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas mike.kl...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss -- View this message in context: http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: Any tips for indexing large amounts of data?
Brendan - yes, 64-bit Linux this is, and the JVM got 5.5 GB heap, though it could have worked with less. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Brendan Grainger [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 1:24:05 PM Subject: Re: Any tips for indexing large amounts of data? Hi Otis, Thanks for this. Are you using a flavor of linux and is it 64bit? How much heap are you giving your jvm? Thanks again Brendan On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
HI Otis, Thanks for the reply. I am using a pretty vanilla approach right now and it's taking about 30 hours to build an index of about 5.5Gb. Can you please tell me what some of the changes you made to optimize the indexing process? Thanks Brendan On Nov 21, 2007, at 2:27 AM, Otis Gospodnetic wrote: Just tried a search for web on this index - 1.1 seconds. This matches about 1MM of about 20MM docs. Redo the search, and it's 1 ms (cached). This is without any load nor serious benchmarking, clearly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 2:11:07 AM Subject: Re: Any tips for indexing large amounts of data? Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Hi Otis, Thanks for this. Are you using a flavor of linux and is it 64bit? How much heap are you giving your jvm? Thanks again Brendan On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Thats great. At what size of the index do you think we should look at partitioning the index file? Eswar On Nov 21, 2007 12:57 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Just tried a search for web on this index - 1.1 seconds. This matches about 1MM of about 20MM docs. Redo the search, and it's 1 ms (cached). This is without any load nor serious benchmarking, clearly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 2:11:07 AM Subject: Re: Any tips for indexing large amounts of data? Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Any tips for indexing large amounts of data?
Thanks so much for your suggestions. I am attempting to index 550K docs at once, but have found I've had to break them up into smaller batches. Indexing seems to stop at around 47K docs (the index reaches 264M in size at this point). The index eventually itself grows to about 2Gb. I am using embedded solr and adding a document with code very similar to this: private void addModel(Model model) throws IOException { UpdateHandler updateHandler = solrCore.getUpdateHandler(); AddUpdateCommand addcmd = new AddUpdateCommand(); DocumentBuilder builder = new DocumentBuilder (solrCore.getSchema()); builder.startDoc(); builder.addField(id, Model: + model.getUuid()); builder.addField(class, Model); builder.addField(uuid, model.getUuid()); builder.addField(one_facet, model.getOneFacet()); builder.addField(another_facet, model.getAnotherFacet()); .. other fields addcmd.doc = builder.getDoc(); addcmd.allowDups = false; addcmd.overwritePending = true; addcmd.overwriteCommitted = true; updateHandler.addDoc(addcmd); } I have other 'Model' objects I'm adding also. Thanks On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
RE: Any tips for indexing large amounts of data?
Usability consideration, Not really answering your question, but i must comment using searching on items up to 100k makes faceted navigation very effective..but becomes least effective past 100k..u may want to consider breaking up the 500k documents in categories(typical breadcrumb) to 100k to faceted browse. Jeryl Cook To: solr-user@lucene.apache.org From: [EMAIL PROTECTED] Subject: Any tips for indexing large amounts of data? Date: Wed, 31 Oct 2007 10:30:50 -0400 Hi, I am creating an index of approx 500K documents. I wrote an indexing program using embeded solr: http://wiki.apache.org/solr/EmbeddedSolr and am seeing probably a 10 fold increase in indexing speeds. My problem is though, that if I try to reindex say 20K docs at a time it slows down considerably. I currently batch my updates in lots of 100 and between batches I close and reopen the connection to solr like so: private void openConnection(String environment) throws ParserConfigurationException, IOException, SAXException { System.setProperty(solr.solr.home, SOLR_HOME); solrConfig = new SolrConfig(solrconfig.xml); solrCore = new SolrCore(SOLR_HOME + data/ + environment, solrConfig, new IndexSchema(solrConfig, schema.xml)); logger.debug(Opened solr connection); } private void closeConnection() { solrCore.close(); solrCore = null; logger.debug(Closed solr connection); } Does anyone have any pointers or see anything obvious I'm doing wrong? Thanks PS Sorry if this is posted twice.
Re: Any tips for indexing large amounts of data?
Greetings Brendan, In the solrconfig.xml file, under the updateHandler, is an auto commit statement. It looks like: autoCommit maxDocs1000/maxDocs maxTime1000/maxTime /autoCommit I would think you would see better performance by allowing auto commit to handle the commit size instead of reopening the connection all the time. I believe the maxTime is in milliseconds. Let us know if this helps, Scott Tabar Brendan Grainger [EMAIL PROTECTED] wrote: Hi, I am creating an index of approx 500K documents. I wrote an indexing program using embeded solr: http://wiki.apache.org/solr/EmbeddedSolr and am seeing probably a 10 fold increase in indexing speeds. My problem is though, that if I try to reindex say 20K docs at a time it slows down considerably. I currently batch my updates in lots of 100 and between batches I close and reopen the connection to solr like so: private void openConnection(String environment) throws ParserConfigurationException, IOException, SAXException { System.setProperty(solr.solr.home, SOLR_HOME); solrConfig = new SolrConfig(solrconfig.xml); solrCore = new SolrCore(SOLR_HOME + data/ + environment, solrConfig, new IndexSchema(solrConfig, schema.xml)); logger.debug(Opened solr connection); } private void closeConnection() { solrCore.close(); solrCore = null; logger.debug(Closed solr connection); } Does anyone have any pointers or see anything obvious I'm doing wrong? Thanks PS Sorry if this is posted twice.
Re: Any tips for indexing large amounts of data?
: currently batch my updates in lots of 100 and between batches I close and : reopen the connection to solr like so: : private void closeConnection() { : solrCore.close(); : solrCore = null; : logger.debug(Closed solr connection); : } : : Does anyone have any pointers or see anything obvious I'm doing wrong? you haven't really shown us much about what you are acutally doing to index your docs so that we can see what might be taking time, but i can tell you that there is absolutely no reason what so ever to close your SolrCore in the middle of a large indexing job. -Hoss
Re: Any tips for indexing large amounts of data?
: I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss