On 2/7/2013 9:29 PM, Alexandre Rafalovitch wrote:
Hello,
What actually happens when using soft (as opposed to hard) commit?
I understand somewhat very high-level picture (documents become available
faster, but you may loose them on power loss).
I don't care about low-level implementation details.
But I am trying to understand what is happening on the medium level of
details.
For example what are stages of a document if we are using all available
transaction log, soft commit, hard commit options? It feels like there is
three stages:
*) Uncommitted (soft or hard): accessible only via direct real-time get?
*) Soft-committed: accessible through all search operatons? (but not on
disk? but where is it? in memory?)
*) Hard-committed: all the same as soft-committed but it is now on disk
Similarly, in performance section of Wiki, it says: "A commit (including a
soft commit) will free up almost all heap memory" - why would soft commit
free up heap memory? I thought it was not flushed to disk.
Also, with soft-commits and transaction log enabled, doesn't transaction
log allows to replay/recover the latest state after crash? I believe that's
what transaction log does for the database. If not, how does one recover,
if at all?
And where does openSearcher=false fits into that? Does it cause
inconsistent results somehow?
I am missing something, but I am not sure what or where. Any points in the
right direction would be appreciated.
Let's see if I can answer your questions without giving you incorrect
information.
New indexed content is not searchable until you open a new searcher,
regardless of the type of commit that you do.
A hard commit will close the current transaction log and start a new
one. It will also instruct the Directory implementation to flush to
disk. If you specify openSearcher=false, then the content that has just
been committed will NOT be searchable, as discussed in the previous
paragraph. The existing searcher will remain open and continue to serve
queries against the same index data.
A soft commit does not flush the new content to disk, but it does open a
new searcher. I'm sure that the amount of memory available for caching
this content is not large, so it's possible that if you do a lot of
indexing with soft commits and your hard commits are too infrequent,
you'll end up flushing part of the cached data to disk anyway. I'd love
to hear from a committer about this, because I could be wrong.
There's a caveat with that 'flush to disk' operation -- the default
Directory implementation in the Solr example config, which is
NRTCachingDirectoryFactory, will cache the last few megabytes of indexed
data and not flush it to disk even with a hard commit. If your commits
are small, then the net result is similar to a soft commit. If the
server or Solr were to crash, the transaction logs would be replayed on
Solr startup, recovering that last few megabytes. The transaction log
may also recover documents that were soft committed, but I'm not 100%
sure about that.
To take full advantage of NRT functionality, you can commit as often as
you like with soft commits. On some reasonable interval, say every one
to fifteen minutes, you can issue a hard commit with openSearcher set to
false, to flush things to disk and cycle through transaction logs before
they get huge. Solr will keep a few of the transaction logs around, and
if they are huge, it can take a long time to replay them. You'll want
to choose a hard commit interval that doesn't create giant transaction logs.
If any of the info I've given here is wrong, someone should correct me!
Thanks,
Shawn