Re: Faster Solr Indexing
Hi Erick, Dimitry and Mikhail thank you all for your time. I tried all of the suggestions below and am happy to report that indexing speeds have improved. There were several confounding problems including - a bank of (~20) regexes that were poorly optimized and compiled at each indexing step - single threaded - not using StreamingUpdateSolrServer - excessive logging However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the time of building the SOLR document. Indexing sped up after precomputing these values offline. Thank you all for your help. best Peyman On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote: > How have you determined that it's the solr add? By timing the call on the > SolrJ side or by looking at the machine where Solr is running? This is the > very first thing you have to answer. You can get a rough ides with any > simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows > box). The point is just to see whether the indexer machine is being > well utilized. I'd guess it's not actually. > > One quick experiment would be to try using StreamingUpdateSolrServer > (SUSS), which has the capability of having multiple threads > fire at Solr at once. It is possible that your performance is spent > waiting for I/O. > > Once you have that question answered, you can refine. But until you > know which side of the wire the problem is on, you're flying blind. > > Both Yandong Peyman: > These times are quite surprising. Running everything locally on my laptop, > I'm indexing between 5-7K documents/second. The source is > the Wikipedia dump. > > I'm particularly surprised by the difference Yandong is seeing based > on the various analysis chains. the first thing I'd back off is the > MaxPermSize. 512M is huge for this parameter. > If you're getting that kind of time differential and your CPU isn't > pegged, you're probably swapping in which case you need > to give the processes more memory. I'd just take the MaxPermSize > out completely as a start. > > Not sure if you've seen this page, something there might help. > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed > > But throw a profiler at the indexer as a first step, just to see > where the problem is, CPU or I/O. > > Best > Erick > > On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin > wrote: >> Hi >> >> I am trying to index 12MM docs faster than is currently happening in Solr >> (using solrj). We have identified solr's add method as the bottleneck (and >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and >> jvm ram). >> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure we >> add and commit in batches. And we've tried both CommonsHttpSolrServer and >> EmbeddedSolrServer (assuming removing http overhead would speed things up >> with embedding) but the differences is marginal. >> >> The docs being indexed are on average 20 fields long, mostly indexed but >> none stored. The major size contributors are two fields: >> >>- content, and >>- shingledContent (populated using copyField of content). >> >> The length of the content field is (likely) gaussian distributed (few large >> docs 50-80K tokens, but majority around 2k tokens). We use shingledContent >> to support phrase queries and content for unigram queries (following the >> advice of Solr Enterprise search server advice - p. 305, section "The >> Solution: Shingling"). >> >> Clearly the size of the docs is a contributor to the slow adds (confirmed by >> removing these 2 fields resulting in halving the indexing time). We've tried >> compressed=true also but that is not working. >> >> Any guidance on how to support our application logic (without having to >> change the schema too much) and speed the indexing speed (from current 212 >> days for 12MM docs) would be much appreciated. >> >> thank you >> >> Peyman >>
Re: Faster Solr Indexing
How have you determined that it's the solr add? By timing the call on the SolrJ side or by looking at the machine where Solr is running? This is the very first thing you have to answer. You can get a rough ides with any simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows box). The point is just to see whether the indexer machine is being well utilized. I'd guess it's not actually. One quick experiment would be to try using StreamingUpdateSolrServer (SUSS), which has the capability of having multiple threads fire at Solr at once. It is possible that your performance is spent waiting for I/O. Once you have that question answered, you can refine. But until you know which side of the wire the problem is on, you're flying blind. Both Yandong Peyman: These times are quite surprising. Running everything locally on my laptop, I'm indexing between 5-7K documents/second. The source is the Wikipedia dump. I'm particularly surprised by the difference Yandong is seeing based on the various analysis chains. the first thing I'd back off is the MaxPermSize. 512M is huge for this parameter. If you're getting that kind of time differential and your CPU isn't pegged, you're probably swapping in which case you need to give the processes more memory. I'd just take the MaxPermSize out completely as a start. Not sure if you've seen this page, something there might help. http://wiki.apache.org/lucene-java/ImproveIndexingSpeed But throw a profiler at the indexer as a first step, just to see where the problem is, CPU or I/O. Best Erick On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin wrote: > Hi > > I am trying to index 12MM docs faster than is currently happening in Solr > (using solrj). We have identified solr's add method as the bottleneck (and > not commit - which is tuned ok through mergeFactor and maxRamBufferSize and > jvm ram). > > Adding 1000 docs is taking approximately 25 seconds. We are making sure we > add and commit in batches. And we've tried both CommonsHttpSolrServer and > EmbeddedSolrServer (assuming removing http overhead would speed things up > with embedding) but the differences is marginal. > > The docs being indexed are on average 20 fields long, mostly indexed but none > stored. The major size contributors are two fields: > > - content, and > - shingledContent (populated using copyField of content). > > The length of the content field is (likely) gaussian distributed (few large > docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to > support phrase queries and content for unigram queries (following the advice > of Solr Enterprise search server advice - p. 305, section "The Solution: > Shingling"). > > Clearly the size of the docs is a contributor to the slow adds (confirmed by > removing these 2 fields resulting in halving the indexing time). We've tried > compressed=true also but that is not working. > > Any guidance on how to support our application logic (without having to > change the schema too much) and speed the indexing speed (from current 212 > days for 12MM docs) would be much appreciated. > > thank you > > Peyman >
Re: Faster Solr Indexing
Dmitry, If you start to speak about logging, don't forget to say that jdk logging is absolutely not really performant, but is default for 3.x. Logback is much faster. Peyman, 1. shingles has performance implication. That is. it can cost much. Why term positions and phrase queries are not enough for you? 2. some time ago there was a similar thread caused by superfluous shingleing, so it's worth do double check that you produce not much than you really need (Captian O speaking) 3. when I have problem with performance the first thing I do is profiler or sampler 4. The way to look inside lucene indexing is enabling infostream, you'll have a lot of info 5. are all of your cpu cores utilized? is they aren't, employ indexing in multiple threads. it scales. Post several indexing requests in parallel. Be aware that DIH doesn't works for multiple threads yet SOLR-3011. 6. Some time ago I need to have a huge throughput and faced the trivial producer-consumer trap. The indexing app (it was DIH hacked a little) pulls data from jdbc, but in this time solr indexing were idle, then it pushed constructed documents into solr for indexing but does it synchronously and being idle while solr consumes them. As result I had overall time is equal to sum of producing and consuming. So, I organized async buffer and reduce time to maximum if those times. Double check that you have maximum of producing and consuming but not a sum of it. I used perf4j to trace those times. 7. As you data is huge you can try to employ cluster magic, spread your docs between two solr instances then search them in parallel SolrShards, SolrCloud for you, I never did it. If you don't like to search in parallel, you can copy index shards between boxes to have a full replica on each box - but I haven't heard about it out-of-the box. Regards On Sun, Mar 11, 2012 at 7:27 PM, Dmitry Kan wrote: > one approach we have taken was decreasing the solr logging level for > the posting session, described here (implemented for 1.4, but should > be easy to port to 3.x): > > http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html > > On 3/11/12, Yandong Yao wrote: > > I have similar issues by using DIH, > > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > > consumes most of the time when indexing 10K rows (each row is about 70K) > > - DIH nextRow takes about 10 seconds totally > > - If index uses whitespace tokenizer and lower case filter, then > > addDoc() methods takes about 80 seconds > > - If index uses whitespace tokenizer, lower case filer, WDF, then > > addDoc uses about 112 seconds > > - If index uses whitespace tokenizer, lower case filer, WDF and > porter > > stemmer, then addDoc uses about 145 seconds > > > > We have more than million rows totally, and am wondering whether i am > using > > sth. wrong or is there any way to improve the performance of addDoc()? > > > > Thanks very much in advance! > > > > > > Following is the configure: > > 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m > > 2) Solr version 3.5 > > 3) solrconfig.xml (almost copied from solr's example/solr directory.) > > > > > > > > false > > > > 10 > > > > 64 > > > > > > > > 2147483647 > > 1000 > > 1 > > > > native > > > > > > 2012/3/11 Peyman Faratin > > > >> Hi > >> > >> I am trying to index 12MM docs faster than is currently happening in > Solr > >> (using solrj). We have identified solr's add method as the bottleneck > (and > >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize > >> and > >> jvm ram). > >> > >> Adding 1000 docs is taking approximately 25 seconds. We are making sure > we > >> add and commit in batches. And we've tried both CommonsHttpSolrServer > and > >> EmbeddedSolrServer (assuming removing http overhead would speed things > up > >> with embedding) but the differences is marginal. > >> > >> The docs being indexed are on average 20 fields long, mostly indexed but > >> none stored. The major size contributors are two fields: > >> > >>- content, and > >>- shingledContent (populated using copyField of content). > >> > >> The length of the content field is (likely) gaussian distributed (few > >> large docs 50-80K tokens, but majority around 2k tokens). We use > >> shingledContent to support phrase queries and content for unigram > queries > >> (following the advice of Solr Enterprise search server advice - p. 305, > >> section "The Solution: Shingling"). > >> > >> Clearly the size of the docs is a contributor to the slow adds > (confirmed > >> by removing these 2 fields resulting in halving the indexing time). > We've > >> tried compressed=true also but that is not working. > >> > >> Any guidance on how to support our application logic (without having to > >> change the schema too much) and speed the indexing speed (from current > 212 > >> days for 12MM docs) would be much appreciated. > >> > >> thank you > >> > >> Peym
Re: Faster Solr Indexing
one approach we have taken was decreasing the solr logging level for the posting session, described here (implemented for 1.4, but should be easy to port to 3.x): http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html On 3/11/12, Yandong Yao wrote: > I have similar issues by using DIH, > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) > consumes most of the time when indexing 10K rows (each row is about 70K) > - DIH nextRow takes about 10 seconds totally > - If index uses whitespace tokenizer and lower case filter, then > addDoc() methods takes about 80 seconds > - If index uses whitespace tokenizer, lower case filer, WDF, then > addDoc uses about 112 seconds > - If index uses whitespace tokenizer, lower case filer, WDF and porter > stemmer, then addDoc uses about 145 seconds > > We have more than million rows totally, and am wondering whether i am using > sth. wrong or is there any way to improve the performance of addDoc()? > > Thanks very much in advance! > > > Following is the configure: > 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m > 2) Solr version 3.5 > 3) solrconfig.xml (almost copied from solr's example/solr directory.) > > > > false > > 10 > > 64 > > > > 2147483647 > 1000 > 1 > > native > > > 2012/3/11 Peyman Faratin > >> Hi >> >> I am trying to index 12MM docs faster than is currently happening in Solr >> (using solrj). We have identified solr's add method as the bottleneck (and >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize >> and >> jvm ram). >> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure we >> add and commit in batches. And we've tried both CommonsHttpSolrServer and >> EmbeddedSolrServer (assuming removing http overhead would speed things up >> with embedding) but the differences is marginal. >> >> The docs being indexed are on average 20 fields long, mostly indexed but >> none stored. The major size contributors are two fields: >> >>- content, and >>- shingledContent (populated using copyField of content). >> >> The length of the content field is (likely) gaussian distributed (few >> large docs 50-80K tokens, but majority around 2k tokens). We use >> shingledContent to support phrase queries and content for unigram queries >> (following the advice of Solr Enterprise search server advice - p. 305, >> section "The Solution: Shingling"). >> >> Clearly the size of the docs is a contributor to the slow adds (confirmed >> by removing these 2 fields resulting in halving the indexing time). We've >> tried compressed=true also but that is not working. >> >> Any guidance on how to support our application logic (without having to >> change the schema too much) and speed the indexing speed (from current 212 >> days for 12MM docs) would be much appreciated. >> >> thank you >> >> Peyman >> >> > -- Regards, Dmitry Kan
Re: Faster Solr Indexing
I have similar issues by using DIH, and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) consumes most of the time when indexing 10K rows (each row is about 70K) - DIH nextRow takes about 10 seconds totally - If index uses whitespace tokenizer and lower case filter, then addDoc() methods takes about 80 seconds - If index uses whitespace tokenizer, lower case filer, WDF, then addDoc uses about 112 seconds - If index uses whitespace tokenizer, lower case filer, WDF and porter stemmer, then addDoc uses about 145 seconds We have more than million rows totally, and am wondering whether i am using sth. wrong or is there any way to improve the performance of addDoc()? Thanks very much in advance! Following is the configure: 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m 2) Solr version 3.5 3) solrconfig.xml (almost copied from solr's example/solr directory.) false 10 64 2147483647 1000 1 native 2012/3/11 Peyman Faratin > Hi > > I am trying to index 12MM docs faster than is currently happening in Solr > (using solrj). We have identified solr's add method as the bottleneck (and > not commit - which is tuned ok through mergeFactor and maxRamBufferSize and > jvm ram). > > Adding 1000 docs is taking approximately 25 seconds. We are making sure we > add and commit in batches. And we've tried both CommonsHttpSolrServer and > EmbeddedSolrServer (assuming removing http overhead would speed things up > with embedding) but the differences is marginal. > > The docs being indexed are on average 20 fields long, mostly indexed but > none stored. The major size contributors are two fields: > >- content, and >- shingledContent (populated using copyField of content). > > The length of the content field is (likely) gaussian distributed (few > large docs 50-80K tokens, but majority around 2k tokens). We use > shingledContent to support phrase queries and content for unigram queries > (following the advice of Solr Enterprise search server advice - p. 305, > section "The Solution: Shingling"). > > Clearly the size of the docs is a contributor to the slow adds (confirmed > by removing these 2 fields resulting in halving the indexing time). We've > tried compressed=true also but that is not working. > > Any guidance on how to support our application logic (without having to > change the schema too much) and speed the indexing speed (from current 212 > days for 12MM docs) would be much appreciated. > > thank you > > Peyman > >
Faster Solr Indexing
Hi I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram). Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields: - content, and - shingledContent (populated using copyField of content). The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section "The Solution: Shingling"). Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. thank you Peyman