Informational Hi,
This information is for anyone who might be running into problems when performing explicit periodic backups of Solr indexes. I encountered this problem, and hopefully this might be useful to others. A related Jira issue is: SOLR-1475. The issue is: When you execute a 'command=backup' request, the snapshot starts, but then fails later on with file not found errors. This aborts the snapshot, and you end up with no backup. This error occurs if, during the backup process, Solr performs more commits than its 'maxCommitsToKeep' setting in solrconfig.xml. If you don't commit very often, you probably won't see this problem. If, however, like me, you have Solr committing very often, the commit point files for the backup can get deleted before the backup finishes. This is particualrly true of larger indexes, where the backup can take some time. Workaround 1: One workaround to this is to set 'maxCommitsToKeep' to a number higher than the total number of commits that can occur during the time it takes to do a backup. Sounds like a 'finger-in-the-air' number? Well, yes it is. If you commit every 20secs, and a full backup takes 10mins, you'll want a value of at least 31. The trouble is, how long will a backup take? This can vary hugely as the index grows, system is busy, disk fragmentation etc. (my environment takes ~13mins to backup a 5.5GB index to a local folder) An inefficiency of this approach that needs to be considered is the higher the 'maxCommitsToKeep' number is, the more files you're going to have lounging around in your index data folder - the majority of which never get used. The collective size of these commit point files can be significant. If you have a high mergeFactor, the number of files will increase as well. You can set 'maxCommitAge' to delete old commit points after a certain time - as long as it's not shorter than the 'worst-case' backup time. I set my 'maxCommitsToKeep' to 2400, and the file not found errors disappeared (note that 2400 is a hugely conservative number to cater for a backup taking 24hrs). My mergeFactor is 25, so I get a high number of files in the index folder, they are generally small in size, but significant extra storage can be required. If you're willing to trade off some (ok, potentially a lot of) extraneous disk usage to keep commit points around waiting for a backup command, this approach addresses the problem. Workaround 2: A preferable method (IMHO), is if you have an extra box, set up a read-only replica, and then backup from the replica. Then you can then tune the slave to suit your needs. Coding: I'm not very familiar with the repl/backup code, but a coded way to address this might be to save a commit point's index version files when a backup command is received, then release them for deletion when complete. Perhaps someone with good knowledge of this part of Solr could comment more succinctly. Thanks, Peter