Getting FileNotFoundException with repl command=backup?

Peter Sturge Thu, 22 Jul 2010 06:44:20 -0700

Informational

Hi,


This information is for anyone who might be running into problems when
performing explicit periodic backups of Solr indexes. I encountered this
problem, and hopefully this might be useful to others.
A related Jira issue is: SOLR-1475.

The issue is: When you execute a 'command=backup' request, the snapshot
starts, but then fails later on with file not found errors. This aborts the
snapshot, and you end up with no backup.

This error occurs if, during the backup process, Solr performs more commits
than its 'maxCommitsToKeep' setting in solrconfig.xml. If you don't commit
very often, you probably won't see this problem.
If, however, like me, you have Solr committing very often, the commit point
files for the backup can get deleted before the backup finishes. This is
particualrly true of larger indexes, where the backup can take some time.

Workaround 1:
One workaround to this is to set 'maxCommitsToKeep' to a number higher than
the total number of commits that can occur during the time it takes to do a
backup. Sounds like a 'finger-in-the-air' number? Well, yes it is.
If you commit every 20secs, and a full backup takes 10mins, you'll want a
value of at least 31. The trouble is, how long will a backup take? This can
vary hugely as the index grows, system is busy, disk fragmentation etc.
(my environment takes ~13mins to backup a 5.5GB index to a local folder)

An inefficiency of this approach that needs to be considered is the higher
the 'maxCommitsToKeep' number is, the more files you're going to have
lounging around in your index data folder - the majority of which never get
used. The collective size of these commit point files can be significant.
If you have a high mergeFactor, the number of files will increase as well.
You can set 'maxCommitAge' to delete old commit points after a certain time
- as long as it's not shorter than the 'worst-case' backup time.

I set my 'maxCommitsToKeep' to 2400, and the file not found errors
disappeared (note that 2400 is a hugely conservative number to cater for a
backup taking 24hrs). My mergeFactor is 25, so I get a high number of files
in the index folder, they are generally small in size, but significant extra
storage can be required.

If you're willing to trade off some (ok, potentially a lot of) extraneous
disk usage to keep commit points around waiting for a backup command, this
approach addresses the problem.

Workaround 2:
A preferable method (IMHO), is if you have an extra box, set up a read-only
replica, and then backup from the replica. Then you can then tune the slave
to suit your needs.

Coding:
I'm not very familiar with the repl/backup code, but a coded way to address
this might be to save a commit point's index version files when a backup
command is received, then release them for deletion when complete.
Perhaps someone with good knowledge of this part of Solr could comment more
succinctly.


Thanks,
Peter

Getting FileNotFoundException with repl command=backup?

Reply via email to