Re: Painfully slow indexing

2011-10-24 Thread Pranav Prakash
Hey guys,

Your responses are welcome, but I still haven't gained a lot of improvements

*Are you posting through HTTP/SOLRJ?*
I am using RSolr gem, which internally uses Ruby HTTP lib to POST document
to Solr

*Your script time 'T' includes time between sending POST request -to-
the response fetched after successful response right??*
Correct. It also includes the time taken to convert all those documents from
a Ruby Hash to XML.


 *generate the ready-for-indexing XML documents on a file system*
Alain, I have somewhere 6m documents for Indexing. You mean to say that I
should convert all of it into one XML file and then index?

*are you calling commit after your batches or do an optimize by any chance?*
I am not optimizing, but I am performing an autocommit every 10 docs.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Oct 21, 2011 at 16:32, Simon Willnauer 
simon.willna...@googlemail.com wrote:

 On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote:
  Hi guys,
 
  I have set up a Solr instance and upon attempting to index document, the
  whole process is painfully slow. I will try to put as much info as I can
 in
  this mail. Pl. feel free to ask me anything else that might be required.
 
  I am sending documents in batches not exceeding 2,000. The size of each
 of
  them depends but usually is around 10-15MiB. My indexing script tells me
  that Solr took T seconds to add N documents of size S. For the same data,
  the Solr Log add QTime is QT. Some of the sample data are:
 
N ST   QT
  -
   390 docs  |   3,478,804 Bytes   | 14.5s|  2297
   852 docs  |   6,039,535 Bytes   | 25.3s|  4237
  1345 docs | 11,147,512 Bytes   |  47s  |  8543
  1147 docs |   9,457,717 Bytes   |  44s  |  2297
  1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
 
  The time T includes the time of converting an array of Hash objects into
  XML, POSTing it to Solr and response acknowledged from Solr. Clearly,
 there
  is a huge difference between both the time T and QT. After a lot of
 efforts,
  I have no clue why these times do not match.
 
  The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
  -XX:+UseParNewGC
 
  I believe my Indexing is getting slow. Relevant portion from my schema
 file
  are as follows. On a related note, every document has one dynamic field.
  Based on this rate, it takes me ~30hrs to do a full index of my database.
  I would really appreciate kindness of community in order to get this
  indexing faster.
 
  indexDefaults
 
  useCompoundFilefalse/useCompoundFile
 
  mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
 
  int name=maxMergeCount10/int
 
  int name=maxThreadCount10/int
 
   /mergeScheduler
 
  ramBufferSizeMB2048/ramBufferSizeMB
 
  maxMergeDocs2147483647/maxMergeDocs
 
  maxFieldLength300/maxFieldLength
 
  writeLockTimeout1000/writeLockTimeout
 
  maxBufferedDocs5/maxBufferedDocs
 
  termIndexInterval256/termIndexInterval
 
  mergeFactor10/mergeFactor
 
  useCompoundFilefalse/useCompoundFile
 
  !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 
   int name=maxMergeAtOnceExplicit19/int
 
  int name=segmentsPerTier9/int
 
  /mergePolicy --
 
  /indexDefaults
 
  mainIndex
 
  unlockOnStartuptrue/unlockOnStartup
 
  reopenReaderstrue/reopenReaders
 
  deletionPolicy class=solr.SolrDeletionPolicy
 
   str name=maxCommitsToKeep1/str
 
  str name=maxOptimizedCommitsToKeep0/str
 
  /deletionPolicy
 
  infoStream file=INFOSTREAM.txtfalse/infoStream
 
  /mainIndex
 
  updateHandler class=solr.DirectUpdateHandler2 
 
  autoCommit
 
   maxDocs10/maxDocs
 
  /autoCommit
 
  /updateHandler
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 

 hey,

 are you calling commit after your batches or do an optimize by any chance?

 I would suggest you to stream your documents to solr and try to commit
 only if you really need to. Set your RAM Buffer to something between
 256 and 320 MB and remove the maxBufferedDocs setting completely. You
 can also experiment with your merge settings a little and 10 merging
 threads seem to be a lot. I know you have lots of CPU but IO will be
 the bottleneck here.

 simon



Re: Painfully slow indexing

2011-10-21 Thread pravesh
Are you posting through HTTP/SOLRJ?

Your script time 'T' includes time between sending POST request -to- the
response fetched after successful response right??

Try sending in small batches like 10-20.  BTW how many documents are u
indexing???

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Painfully slow indexing

2011-10-21 Thread Alain Rogister
As an alternative, I can suggest this one which worked great for me:

- generate the ready-for-indexing XML documents on a file system
- use curl to feed them into Solr

I am not dealing with huge volumes, but was surprised at how *fast* Solr was
indexing my documents using this simple approach. Also, the workflow is easy
to manage. And the XML contents can easily be provisioned to multiple
systems e.g. for setting up test environments.

Regards,

Alain

On Fri, Oct 21, 2011 at 9:46 AM, pravesh suyalprav...@yahoo.com wrote:

 Are you posting through HTTP/SOLRJ?

 Your script time 'T' includes time between sending POST request -to- the
 response fetched after successful response right??

 Try sending in small batches like 10-20.  BTW how many documents are u
 indexing???

 Regds
 Pravesh

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Painfully slow indexing

2011-10-21 Thread Simon Willnauer
On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote:
 Hi guys,

 I have set up a Solr instance and upon attempting to index document, the
 whole process is painfully slow. I will try to put as much info as I can in
 this mail. Pl. feel free to ask me anything else that might be required.

 I am sending documents in batches not exceeding 2,000. The size of each of
 them depends but usually is around 10-15MiB. My indexing script tells me
 that Solr took T seconds to add N documents of size S. For the same data,
 the Solr Log add QTime is QT. Some of the sample data are:

   N                     S                T               QT
 -
  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
 1345 docs | 11,147,512 Bytes   |  47s      |  8543
 1147 docs |   9,457,717 Bytes   |  44s      |  2297
 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782

 The time T includes the time of converting an array of Hash objects into
 XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
 is a huge difference between both the time T and QT. After a lot of efforts,
 I have no clue why these times do not match.

 The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
 -XX:+UseParNewGC

 I believe my Indexing is getting slow. Relevant portion from my schema file
 are as follows. On a related note, every document has one dynamic field.
 Based on this rate, it takes me ~30hrs to do a full index of my database.
 I would really appreciate kindness of community in order to get this
 indexing faster.

 indexDefaults

 useCompoundFilefalse/useCompoundFile

 mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler

 int name=maxMergeCount10/int

 int name=maxThreadCount10/int

  /mergeScheduler

 ramBufferSizeMB2048/ramBufferSizeMB

 maxMergeDocs2147483647/maxMergeDocs

 maxFieldLength300/maxFieldLength

 writeLockTimeout1000/writeLockTimeout

 maxBufferedDocs5/maxBufferedDocs

 termIndexInterval256/termIndexInterval

 mergeFactor10/mergeFactor

 useCompoundFilefalse/useCompoundFile

 !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy

  int name=maxMergeAtOnceExplicit19/int

 int name=segmentsPerTier9/int

 /mergePolicy --

 /indexDefaults

 mainIndex

 unlockOnStartuptrue/unlockOnStartup

 reopenReaderstrue/reopenReaders

 deletionPolicy class=solr.SolrDeletionPolicy

  str name=maxCommitsToKeep1/str

 str name=maxOptimizedCommitsToKeep0/str

 /deletionPolicy

 infoStream file=INFOSTREAM.txtfalse/infoStream

 /mainIndex

 updateHandler class=solr.DirectUpdateHandler2 

 autoCommit

  maxDocs10/maxDocs

 /autoCommit

 /updateHandler


 *Pranav Prakash*

 temet nosce

 Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny


hey,

are you calling commit after your batches or do an optimize by any chance?

I would suggest you to stream your documents to solr and try to commit
only if you really need to. Set your RAM Buffer to something between
256 and 320 MB and remove the maxBufferedDocs setting completely. You
can also experiment with your merge settings a little and 10 merging
threads seem to be a lot. I know you have lots of CPU but IO will be
the bottleneck here.

simon