Sure thing. I put together a writeup on the file layout and formats here: https://cwiki.apache.org/confluence/display/SOLR/Incremental+Backup+File+Format The details get a little verbose, so I made it a subpage that the SIP-proper calls out to.
Let me know what you think when you get a chance to read - hopefully that's sufficient to fill the gap. Jason On Thu, Jan 7, 2021 at 8:34 PM Tomás Fernández Löbbe <[email protected]> wrote: > > Thanks Jason! This is great, and a very much needed feature. > > > This helps to avoid confusion that would > > otherwise arise between identically named files when e.g. a shard > > leader changes between two incremental backups. (I'll try to expand > > on this in the SIP, as it's a bit hard to give the full context here.) > > Thanks, I was wondering the same thing. Maybe it would be good to put an > example of how the file structure of a backup looks like in the backup? and > how the manifest file looks like. As you said, a file with the same name may > refer to different segments created by different cores or the same one (even > if the leader changed, it may be a file from a previous replication). > > On Thu, Jan 7, 2021 at 1:20 PM Jason Gerlowski <[email protected]> wrote: >> >> Thanks for the feedback Mike. I've gotta give any credit to Shalin >> though, he wrote most of it before the holiday. He and Dat wrote much >> of the code involved as well. I haven't done more than steward things >> along so far. As you suggested, I've updated the SIP to mention the >> related SOLR-13608 (see the bottom of the "Motivation" section). >> >> As for your questions, I've tried to answer them below. >> >> 1. Good catch - it doesn't. The SIP should read that each backup >> creates its own manifest files as needed for directories it creates >> under the base "location". This way, additional backups can be added >> to the same location without needing to modify existing metadata >> files. I've updated the SIP to reflect this. >> >> 2. The proposed metadata file is a lot like segments_n (in spirit) in >> that it has pointers to each index file that comprise an >> index/replica. But it differs in that it stores additional >> information about each file (checksum, size) separate from the file >> itself. It also allows a layer of naming indirection between what >> files are named in the storage repository and what name they should be >> given upon restoration. This helps to avoid confusion that would >> otherwise arise between identically named files when e.g. a shard >> leader changes between two incremental backups. (I'll try to expand >> on this in the SIP, as it's a bit hard to give the full context here.) >> >> 3. My intention was that the 'maxNumBackups' parameter would only >> refer to the incremental backups in a given location. This was mostly >> informed by the fact that traditional backups today are required to be >> 1-per-location. (i.e. a backup in 8.6.3 will error out if the >> specified directory has files in it.). We could fix that aspect of >> traditional backups and find semantics for 'maxNumBackups' that might >> include traditional ones, but IMO it'd add complexity and work for a >> format that the SIP is trying to replace more broadly anyways. >> >> 4. I definitely intended to update LocalFileSystemRepository. I have >> code to update HdfsBackupRepository as well, but wasn't quite sure >> where that stood since it's currently deprecated. I haven't seen >> plans to make it a plugin, but might've just missed those discussions >> in other mail. Anyway, I plan to update it but that assumes it's >> sticking around in one form or another. >> >> 5. Good idea - I didn't realize that was an option. But it would be >> really nice if possible. I don't have an estimate on resources. I >> expect the need would be relatively small - you could restrict the >> tests to running on the nightly runs on ASF's Jenkins unless devs >> provide their own (e.g.) s3 creds. But that's just a guess obviously, >> and not even in concrete terms. >> >> Thanks again for taking the time to wade through the SIP - really >> appreciate the feedback. Hope the answers help! >> >> Best, >> >> Jason >> >> On Tue, Jan 5, 2021 at 11:52 AM Mike Drob <[email protected]> wrote: >> > >> > This is a very thorough SIP, thank you for spending the time on it, Jason! >> > >> > I have a few minor questions about points that are unclear to me. >> > >> > 1) If we assume that we cannot overwrite files, how does the manifest file >> > stay current for incremental backup operations to the same directory? >> > 2) How is the manifest file functionally different from the segments_n and >> > segments.gen files? >> > 3) Does the maxNumBackups parameter consider incremental backups or only >> > full backups? What happens if we have a full backup and then N incremental >> > ones? Do we delete the full backup and convert the oldest incremental one >> > into a full? I imagine this might be a metadata operation, but then the >> > concerns from question 1 apply. >> > 4) Do we plan to retrofit HDFS Backup and Local File Backup to use the new >> > interfaces? I believe we should, but may be willing to accept this as out >> > of scope. >> > 5) Regarding cloud provider test resources, we can also approach the ASF >> > Infra team to ask for cloud credits. Can you give rough estimates on what >> > kind of resourcing would be needed? >> > >> > I did not examine the new APIs in detail, but they looked fine at a high >> > level overview. Will probably look again after questions regarding v1/v2 >> > are figured out. >> > >> > On Tue, Jan 5, 2021 at 10:11 AM Mike Drob <[email protected]> wrote: >> >> >> >> Can you explicitly call out in the SIP how it relates to the work done in >> >> SOLR-13608? >> >> >> >> On Tue, Jan 5, 2021 at 8:55 AM Jason Gerlowski <[email protected]> >> >> wrote: >> >>> >> >>> Hey, Happy New Year everybody. >> >>> >> >>> Some SIP updates based on the discussion above: >> >>> >> >>> I added v2 examples for each API to the SIP. Feedback welcome, >> >>> especially on the v2 APIs that are net-new to this proposal (namely: >> >>> "list backups" and "delete backup"). >> >>> >> >>> I've also amended the backcompat/migration section to mention Jan's >> >>> suggestion that the "incremental" features be exposed in the v2 API >> >>> only. Though it's unclear to me whether that's still something people >> >>> want since it turns out that we'll still have backcompat concerns with >> >>> the existing v2 backup/restore APIs. So I've held off from >> >>> removing/replacing the original plan. >> >>> >> >>> Link for convenience: >> >>> https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore >> >>> >> >>> Best, >> >>> >> >>> Jason >> >>> >> >>> >> >>> On Thu, Dec 24, 2020 at 8:11 AM Jan Høydahl <[email protected]> >> >>> wrote: >> >>> > >> >>> > Ok, that’s the one I was looking for, it’s not documented in the >> >>> > backup chapter of ref-guide :( >> >>> > >> >>> > Jan Høydahl >> >>> > >> >>> > > 23. des. 2020 kl. 17:10 skrev Jason Gerlowski >> >>> > > <[email protected]>: >> >>> > > >> >>> > > >> >>> > >> >> >>> > >> We have a path alias to the old API ... but we don’t have a true v2 >> >>> > >> API spec for it, do we? >> >>> > > >> >>> > > Tbh I'm not yet familiar enough with the v2 APIs to understand the >> >>> > > distinction you're making. (Do you have a pointer to something >> >>> > > that'd >> >>> > > fill me in?) >> >>> > > >> >>> > > To zoom in on "backup" as an example, the v2 API I'm referring to >> >>> > > looks like: /v2/collections" -d '{ "backup-collection": >> >>> > > {"collection": "books", "name": "asdf3", "location": "/tmp/foo"}}'. >> >>> > > And it's included in the v2 "introspect" documentation returned by >> >>> > > this API: /v2/collections/_introspect?command=backup-collection". To >> >>> > > me that looked like a v2 API, but maybe path-aliases are also covered >> >>> > > in the introspect docs. >> >>> > > >> >>> > > Jason >> >>> > > >> >>> > >> On Wed, Dec 23, 2020 at 10:29 AM Jan Høydahl >> >>> > >> <[email protected]> wrote: >> >>> > >> >> >>> > >> Actually, don’t think we do have a v2 Backup/Restore API. We have a >> >>> > >> path alias to the old API which takes GET ...&action=backup... but >> >>> > >> we don’t have a true v2 API spec for it, do we? Where is that >> >>> > >> documented? >> >>> > >> >> >>> > >> Jan Høydahl >> >>> > >> >> >>> > >>>> 22. des. 2020 kl. 18:04 skrev Jason Gerlowski >> >>> > >>>> <[email protected]>: >> >>> > >>> >> >>> > >>> Hey guys, >> >>> > >>> >> >>> > >>> Following up to make sure I understand the specifics you're >> >>> > >>> suggesting. You're proposing that: >> >>> > >>> >> >>> > >>> 1. The brand new backup-related APIs (list-backups and >> >>> > >>> delete-backup) >> >>> > >>> be added in v2-form only. >> >>> > >>> 2. Tweaks to existing backup-related APIs (create-backup, restore) >> >>> > >>> be >> >>> > >>> made in V2-form only. >> >>> > >>> 3. All existing v1 backup-related APIs be deprecated and left >> >>> > >>> unchanged. Incremental backups will not be possible using the v1 >> >>> > >>> API. >> >>> > >>> >> >>> > >>> I'm not against going this route if there's consensus around it. >> >>> > >>> But >> >>> > >>> I'm not 100% clear on how it means we don't need to worry about >> >>> > >>> backcompat. Backup and Restore currently exist as both a v1 and a >> >>> > >>> v2 >> >>> > >>> API - I understand how leaving the v1 APIs untouched (other than >> >>> > >>> deprecation) frees us of some backcompat concerns there, but we >> >>> > >>> would >> >>> > >>> still need to make tweaks to the v2 backup/restore APIs and would >> >>> > >>> have >> >>> > >>> to tread just as carefully there in terms of backcompat, afaict. >> >>> > >>> Unless Solr's backcompatibility guarantees only cover the v1 API >> >>> > >>> and >> >>> > >>> leave v2 changes to be made freely? I looked around to see if the >> >>> > >>> v2 >> >>> > >>> APIs had any sort of "experimental" designation, but couldn't find >> >>> > >>> that clearly stated anywhere. Am I missing something? >> >>> > >>> >> >>> > >>> Best, >> >>> > >>> >> >>> > >>> Jason >> >>> > >>> >> >>> > >>>> On Tue, Dec 22, 2020 at 2:49 AM Noble Paul <[email protected]> >> >>> > >>>> wrote: >> >>> > >>>> >> >>> > >>>>> , and implement the new imporved version as a V2-api only, and >> >>> > >>>>> then deprecate the v1 API? >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> V2 only please >> >>> > >>>> >> >>> > >>>>> On Tue, Dec 22, 2020 at 1:34 AM Jason Gerlowski >> >>> > >>>>> <[email protected]> wrote: >> >>> > >>>>> >> >>> > >>>>> Hey Jan, thanks for the review. >> >>> > >>>>> >> >>> > >>>>> I hadn't thought about the V2 API in connection to this work. >> >>> > >>>>> You're >> >>> > >>>>> right though I think - the SIP proposes net-new APIs, so it >> >>> > >>>>> should add >> >>> > >>>>> V2 equivalents at the very least. I'll draft tentative details >> >>> > >>>>> for >> >>> > >>>>> these APIs on the SIP and we can refine things from there. >> >>> > >>>>> >> >>> > >>>>> I'm more up in the air on your specific suggestion to restrict >> >>> > >>>>> the SIP >> >>> > >>>>> changes to these v2 APIs. It is an elegant approach to the >> >>> > >>>>> backcompat, and it provides a carrot for v2 adoption - both of >> >>> > >>>>> which I >> >>> > >>>>> like. But it would let users create snapshot-based backups (and >> >>> > >>>>> keep >> >>> > >>>>> us maintaining that code) longer than there's any strict need >> >>> > >>>>> to. And >> >>> > >>>>> users are left on the less-efficient format by default. (By >> >>> > >>>>> contrast, >> >>> > >>>>> the current SIP has snapshot-backup creation being replaced by >> >>> > >>>>> incremental-backup creation as soon as the latter is >> >>> > >>>>> available.). Did >> >>> > >>>>> you have a particular lifespan in mind for snapshot-based >> >>> > >>>>> creation if >> >>> > >>>>> we go with this approach? >> >>> > >>>>> >> >>> > >>>>> Jason >> >>> > >>>>> >> >>> > >>>>> On Thu, Dec 17, 2020 at 3:54 PM Jan Høydahl >> >>> > >>>>> <[email protected]> wrote: >> >>> > >>>>>> >> >>> > >>>>>> Much needed! Thanks for initiating this Jason! >> >>> > >>>>>> >> >>> > >>>>>> As we want to move away from v1 APIs where a HTTP GET is used >> >>> > >>>>>> for creation and deletion, would it be an idea to leave the old >> >>> > >>>>>> backup/resotre APIs as-is, and implement the new imporved >> >>> > >>>>>> version as a V2-api only, and then deprecate the v1 API? Then >> >>> > >>>>>> we don't need to worry about back-compat, and we get a >> >>> > >>>>>> head-start on converting the COLLECTION API to v2 style. >> >>> > >>>>>> >> >>> > >>>>>> Jan >> >>> > >>>>>> >> >>> > >>>>>>> 15. des. 2020 kl. 15:48 skrev Jason Gerlowski >> >>> > >>>>>>> <[email protected]>: >> >>> > >>>>>>> >> >>> > >>>>>>> Hey all, >> >>> > >>>>>>> >> >>> > >>>>>>> This morning I published SIP-12, which proposes an overhaul of >> >>> > >>>>>>> Solr's >> >>> > >>>>>>> backup and restore functionality. While the "headline" >> >>> > >>>>>>> improvement in >> >>> > >>>>>>> this SIP is a change to do backups incrementally, it bundles >> >>> > >>>>>>> in a >> >>> > >>>>>>> number of other improvements as well, including the addition of >> >>> > >>>>>>> corruption checks, APIs to list and delete backups, and >> >>> > >>>>>>> stronger >> >>> > >>>>>>> integration points with popular object storage APIs. >> >>> > >>>>>>> >> >>> > >>>>>>> The SIP can be found here: >> >>> > >>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore >> >>> > >>>>>>> >> >>> > >>>>>>> Please read the SIP description and come back here for >> >>> > >>>>>>> discussion. As >> >>> > >>>>>>> the discussion progresses we will update the SIP page with any >> >>> > >>>>>>> outcomes and eventually move things to a VOTE. >> >>> > >>>>>>> >> >>> > >>>>>>> Looking forward to hearing your feedback. >> >>> > >>>>>>> >> >>> > >>>>>>> Best, >> >>> > >>>>>>> >> >>> > >>>>>>> Jason >> >>> > >>>>>>> >> >>> > >>>>>>> --------------------------------------------------------------------- >> >>> > >>>>>>> To unsubscribe, e-mail: [email protected] >> >>> > >>>>>>> For additional commands, e-mail: [email protected] >> >>> > >>>>>>> >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > >>>>>> --------------------------------------------------------------------- >> >>> > >>>>>> To unsubscribe, e-mail: [email protected] >> >>> > >>>>>> For additional commands, e-mail: [email protected] >> >>> > >>>>>> >> >>> > >>>>> >> >>> > >>>>> --------------------------------------------------------------------- >> >>> > >>>>> To unsubscribe, e-mail: [email protected] >> >>> > >>>>> For additional commands, e-mail: [email protected] >> >>> > >>>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> -- >> >>> > >>>> ----------------------------------------------------- >> >>> > >>>> Noble Paul >> >>> > >>>> >> >>> > >>>> --------------------------------------------------------------------- >> >>> > >>>> To unsubscribe, e-mail: [email protected] >> >>> > >>>> For additional commands, e-mail: [email protected] >> >>> > >>>> >> >>> > >>> >> >>> > >>> --------------------------------------------------------------------- >> >>> > >>> To unsubscribe, e-mail: [email protected] >> >>> > >>> For additional commands, e-mail: [email protected] >> >>> > >>> >> >>> > >> >> >>> > >> --------------------------------------------------------------------- >> >>> > >> To unsubscribe, e-mail: [email protected] >> >>> > >> For additional commands, e-mail: [email protected] >> >>> > >> >> >>> > > >> >>> > > --------------------------------------------------------------------- >> >>> > > To unsubscribe, e-mail: [email protected] >> >>> > > For additional commands, e-mail: [email protected] >> >>> > > >> >>> > >> >>> > --------------------------------------------------------------------- >> >>> > To unsubscribe, e-mail: [email protected] >> >>> > For additional commands, e-mail: [email protected] >> >>> > >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
