On 2/7/2013 9:29 PM, Alexandre Rafalovitch wrote:
Hello,

What actually happens when using soft (as opposed to hard) commit?

I understand somewhat very high-level picture (documents become available
faster, but you may loose them on power loss).
I don't care about low-level implementation details.

But I am trying to understand what is happening on the medium level of
details.

For example what are stages of a document if we are using all available
transaction log, soft commit, hard commit options? It feels like there is
three stages:
*) Uncommitted (soft or hard): accessible only via direct real-time get?
*) Soft-committed: accessible through all search operatons? (but not on
disk? but where is it? in memory?)
*) Hard-committed: all the same as soft-committed but it is now on disk

Similarly,  in performance section of Wiki, it says: "A commit (including a
soft commit) will free up almost all heap memory" - why would soft commit
free up heap memory? I thought it was not flushed to disk.

Also, with soft-commits and transaction log enabled, doesn't transaction
log allows to replay/recover the latest state after crash? I believe that's
what transaction log does for the database. If not, how does one recover,
if at all?

And where does openSearcher=false fits into that? Does it cause
inconsistent results somehow?

I am missing something, but I am not sure what or where. Any points in the
right direction would be appreciated.

Let's see if I can answer your questions without giving you incorrect information.

New indexed content is not searchable until you open a new searcher, regardless of the type of commit that you do.

A hard commit will close the current transaction log and start a new one. It will also instruct the Directory implementation to flush to disk. If you specify openSearcher=false, then the content that has just been committed will NOT be searchable, as discussed in the previous paragraph. The existing searcher will remain open and continue to serve queries against the same index data.

A soft commit does not flush the new content to disk, but it does open a new searcher. I'm sure that the amount of memory available for caching this content is not large, so it's possible that if you do a lot of indexing with soft commits and your hard commits are too infrequent, you'll end up flushing part of the cached data to disk anyway. I'd love to hear from a committer about this, because I could be wrong.

There's a caveat with that 'flush to disk' operation -- the default Directory implementation in the Solr example config, which is NRTCachingDirectoryFactory, will cache the last few megabytes of indexed data and not flush it to disk even with a hard commit. If your commits are small, then the net result is similar to a soft commit. If the server or Solr were to crash, the transaction logs would be replayed on Solr startup, recovering that last few megabytes. The transaction log may also recover documents that were soft committed, but I'm not 100% sure about that.

To take full advantage of NRT functionality, you can commit as often as you like with soft commits. On some reasonable interval, say every one to fifteen minutes, you can issue a hard commit with openSearcher set to false, to flush things to disk and cycle through transaction logs before they get huge. Solr will keep a few of the transaction logs around, and if they are huge, it can take a long time to replay them. You'll want to choose a hard commit interval that doesn't create giant transaction logs.

If any of the info I've given here is wrong, someone should correct me!

Thanks,
Shawn

Reply via email to