Re: Painfully slow indexing
Hey guys, Your responses are welcome, but I still haven't gained a lot of improvements *Are you posting through HTTP/SOLRJ?* I am using RSolr gem, which internally uses Ruby HTTP lib to POST document to Solr *Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right??* Correct. It also includes the time taken to convert all those documents from a Ruby Hash to XML. *generate the ready-for-indexing XML documents on a file system* Alain, I have somewhere 6m documents for Indexing. You mean to say that I should convert all of it into one XML file and then index? *are you calling commit after your batches or do an optimize by any chance?* I am not optimizing, but I am performing an autocommit every 10 docs. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Oct 21, 2011 at 16:32, Simon Willnauer simon.willna...@googlemail.com wrote: On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote: Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N ST QT - 390 docs | 3,478,804 Bytes | 14.5s| 2297 852 docs | 6,039,535 Bytes | 25.3s| 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. indexDefaults useCompoundFilefalse/useCompoundFile mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount10/int int name=maxThreadCount10/int /mergeScheduler ramBufferSizeMB2048/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength300/maxFieldLength writeLockTimeout1000/writeLockTimeout maxBufferedDocs5/maxBufferedDocs termIndexInterval256/termIndexInterval mergeFactor10/mergeFactor useCompoundFilefalse/useCompoundFile !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int /mergePolicy -- /indexDefaults mainIndex unlockOnStartuptrue/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny hey, are you calling commit after your batches or do an optimize by any chance? I would suggest you to stream your documents to solr and try to commit only if you really need to. Set your RAM Buffer to something between 256 and 320 MB and remove the maxBufferedDocs setting completely. You can also experiment with your merge settings a little and 10 merging threads seem to be a lot. I know you have lots of CPU but IO will be the bottleneck here. simon
Re: Painfully slow indexing
Are you posting through HTTP/SOLRJ? Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right?? Try sending in small batches like 10-20. BTW how many documents are u indexing??? Regds Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Painfully slow indexing
As an alternative, I can suggest this one which worked great for me: - generate the ready-for-indexing XML documents on a file system - use curl to feed them into Solr I am not dealing with huge volumes, but was surprised at how *fast* Solr was indexing my documents using this simple approach. Also, the workflow is easy to manage. And the XML contents can easily be provisioned to multiple systems e.g. for setting up test environments. Regards, Alain On Fri, Oct 21, 2011 at 9:46 AM, pravesh suyalprav...@yahoo.com wrote: Are you posting through HTTP/SOLRJ? Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right?? Try sending in small batches like 10-20. BTW how many documents are u indexing??? Regds Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Painfully slow indexing
On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote: Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N S T QT - 390 docs | 3,478,804 Bytes | 14.5s | 2297 852 docs | 6,039,535 Bytes | 25.3s | 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. indexDefaults useCompoundFilefalse/useCompoundFile mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount10/int int name=maxThreadCount10/int /mergeScheduler ramBufferSizeMB2048/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength300/maxFieldLength writeLockTimeout1000/writeLockTimeout maxBufferedDocs5/maxBufferedDocs termIndexInterval256/termIndexInterval mergeFactor10/mergeFactor useCompoundFilefalse/useCompoundFile !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int /mergePolicy -- /indexDefaults mainIndex unlockOnStartuptrue/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny hey, are you calling commit after your batches or do an optimize by any chance? I would suggest you to stream your documents to solr and try to commit only if you really need to. Set your RAM Buffer to something between 256 and 320 MB and remove the maxBufferedDocs setting completely. You can also experiment with your merge settings a little and 10 merging threads seem to be a lot. I know you have lots of CPU but IO will be the bottleneck here. simon