Shishir, we have 35 million "documents", and should be doing about 5000-10000 
new "documents" a day, but with very small "documents":  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-----Original Message-----
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com] 
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there
be any performance issues?

Thanks,
Shishir

-----Original Message-----
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com] 
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream
showed that every document is following "update document" path instead
of "add document" path. Meaning that all document IDs are getting into
the "pending deletes" queue, and Solr has to rescan its index on every
commit for potential deletions. This is single threaded and seems to get
progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument()
function with deleteTerm not being null. So, I hacked Lucene a little
bit and set deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at
1TB index size and going, without noticeable degradation of the indexing
speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads
are being paused. And ConcurrentMergeScheduler with 4 threads is not
helping much. I guess the changes on the trunk would definitely help,
but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America 
Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are 
Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a 
Condition to Any Banking Service or Activity * Are Not Insured by Any Federal 
Government Agency. Attachments that are part of this EC may have additional 
important disclosures and disclaimers, which you should read. This message is 
subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you 
consent to the foregoing.

Reply via email to