Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream showed
that every document is following "update document" path instead of "add
document" path. Meaning that all document IDs are getting into the "pending
deletes" queue, and Solr has to rescan its index on every commit for
potential deletions. This is single threaded and seems to get progressively
slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument() function
with deleteTerm not being null. So, I hacked Lucene a little bit and set
deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at 1TB
index size and going, without noticeable degradation of the indexing speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads are
being paused. And ConcurrentMergeScheduler with 4 threads is not helping
much. I guess the changes on the trunk would definitely help, but we will
likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context: 
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-threaded-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to