Hey all, I've been getting familiar in the last week or two with our new operator, and noticed that the way its backups work will miss out on the "incremental" efficiency improvements added recently as a part of SIP-12. For backups to be done incrementally, an ongoing backup has to be able to "see" the files stored by previous backups so that it knows which index files to skip over. Our current operator support does a few things that prevent this in practice:
- the operator "rm -rf"s all files at the backup location before starting each new backup - the operator requests each backup at a unique name/location. - the operator compresses the backup file tree after finishing each backup Everything will still work, the backups just won't be nearly as efficient for many common usecases as they could be. There's a few ways we could address this. In one approach, we could leave 'solrbackup' mostly untouched. For "incremental" situations, we would create a new resource-type ('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly geared towards repeated backups of the same collections and knows to store these all in the same location. Conceivably it could also have other useful ops features like cron-job-like scheduling of backups. 'solrbackupschedule' would then be our solution for users who want to do recurring or repeated backups, and 'solrbackup' could be repositioned in the docs as the solution for those doing an ad-hoc, standalone backup. Another approach would be to focus instead on adding configuration options to 'solrbackup' that would make it suitable for incremental backups: enable/disable backup compression, cleaning/retaining the "location" prior to doing a backup, an override for the backup location, etc. 'solrbackup' would remain the option for anyone doing any sort of backup. (Of course, we could also add a solrbackupschedule resource-type as a layer on top of this if the idea of cron-like backup triggering is appealing, but it could be implemented in terms of managing 'solrbackup' sub-resources that perform the actual "work".) There are tradeoffs for both approaches IMO. The first approach is simplest in terms of backcompat. It may also prove simplest in handling discrepancies between Solr versions (incremental backups only supported in v8.9+). But it leaves a potential usecase gap: users may take backups frequently enough to benefit from "incrementality", but without any sort of defined schedule or set periodicity like a 'solrbackupschedule' resource might require. It also risks duplicating code as both 'solrbackup' and 'solrbackupschedule' would involve similar actions. OTOH, the second approach is more flexible ('solrbackup' would become suitable for any common backup usecase), and 'solrbackupschedule', if created, has a really nice conceptual separation being implemented as a level on top of 'solrbackup'. But it pays for this all by making 'solrbackup' more complex and harder for a non-Solr-SME to "get right" out of the box and opening some backcompat questions/challenges. Lastly, it'd require us to think carefully about how cleanup and resource-deletion works, since this approach will allow multiple 'solrbackup' resources to share a backup "location". Anyone have any thoughts or preferences between those two options? Or some third approach I missed? Or even general context around why our operator backup support looks the way it does? Really appreciate any input! Best, Jason --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org