Sure thing.  I put together a writeup on the file layout and formats
here: 
https://cwiki.apache.org/confluence/display/SOLR/Incremental+Backup+File+Format
The details get a little verbose, so I made it a subpage that the
SIP-proper calls out to.

Let me know what you think when you get a chance to read - hopefully
that's sufficient to fill the gap.

Jason

On Thu, Jan 7, 2021 at 8:34 PM Tomás Fernández Löbbe
<[email protected]> wrote:
>
> Thanks Jason! This is great, and a very much needed feature.
>
> > This helps to avoid confusion that would
> > otherwise arise between identically named files when e.g. a shard
> > leader changes between two incremental backups.  (I'll try to expand
> > on this in the SIP, as it's a bit hard to give the full context here.)
>
> Thanks, I was wondering the same thing. Maybe it would be good to put an 
> example of how the file structure of a backup looks like in the backup? and 
> how the manifest file looks like. As you said, a file with the same name may 
> refer to different segments created by different cores or the same one (even 
> if the leader changed, it may be a file from a previous replication).
>
> On Thu, Jan 7, 2021 at 1:20 PM Jason Gerlowski <[email protected]> wrote:
>>
>> Thanks for the feedback Mike.  I've gotta give any credit to Shalin
>> though, he wrote most of it before the holiday.  He and Dat wrote much
>> of the code involved as well.  I haven't done more than steward things
>> along so far.  As you suggested, I've updated the SIP to mention the
>> related SOLR-13608 (see the bottom of the "Motivation" section).
>>
>> As for your questions, I've tried to answer them below.
>>
>> 1. Good catch - it doesn't. The SIP should read that each backup
>> creates its own manifest files as needed for directories it creates
>> under the base "location".  This way, additional backups can be added
>> to the same location without needing to modify existing metadata
>> files.  I've updated the SIP to reflect this.
>>
>> 2. The proposed metadata file is a lot like segments_n (in spirit) in
>> that it has pointers to each index file that comprise an
>> index/replica.  But it differs in that it stores additional
>> information about each file (checksum, size) separate from the file
>> itself.  It also allows a layer of naming indirection between what
>> files are named in the storage repository and what name they should be
>> given upon restoration.  This helps to avoid confusion that would
>> otherwise arise between identically named files when e.g. a shard
>> leader changes between two incremental backups.  (I'll try to expand
>> on this in the SIP, as it's a bit hard to give the full context here.)
>>
>> 3. My intention was that the 'maxNumBackups' parameter would only
>> refer to the incremental backups in a given location.  This was mostly
>> informed by the fact that traditional backups today are required to be
>> 1-per-location.  (i.e. a backup in 8.6.3 will error out if the
>> specified directory has files in it.).  We could fix that aspect of
>> traditional backups and find semantics for 'maxNumBackups' that might
>> include traditional ones, but IMO it'd add complexity and work for a
>> format that the SIP is trying to replace more broadly anyways.
>>
>> 4. I definitely intended to update LocalFileSystemRepository.  I have
>> code to update HdfsBackupRepository as well, but wasn't quite sure
>> where that stood since it's currently deprecated.  I haven't seen
>> plans to make it a plugin, but might've just missed those discussions
>> in other mail.  Anyway, I plan to update it but that assumes it's
>> sticking around in one form or another.
>>
>> 5. Good idea - I didn't realize that was an option.  But it would be
>> really nice if possible.  I don't have an estimate on resources.  I
>> expect the need would be relatively small - you could restrict the
>> tests to running on the nightly runs on ASF's Jenkins unless devs
>> provide their own (e.g.) s3 creds.  But that's just a guess obviously,
>> and not even in concrete terms.
>>
>> Thanks again for taking the time to wade through the SIP - really
>> appreciate the feedback.  Hope the answers help!
>>
>> Best,
>>
>> Jason
>>
>> On Tue, Jan 5, 2021 at 11:52 AM Mike Drob <[email protected]> wrote:
>> >
>> > This is a very thorough SIP, thank you for spending the time on it, Jason!
>> >
>> > I have a few minor questions about points that are unclear to me.
>> >
>> > 1) If we assume that we cannot overwrite files, how does the manifest file 
>> > stay current for incremental backup operations to the same directory?
>> > 2) How is the manifest file functionally different from the segments_n and 
>> > segments.gen files?
>> > 3) Does the maxNumBackups parameter consider incremental backups or only 
>> > full backups? What happens if we have a full backup and then N incremental 
>> > ones? Do we delete the full backup and convert the oldest incremental one 
>> > into a full? I imagine this might be a metadata operation, but then the 
>> > concerns from question 1 apply.
>> > 4) Do we plan to retrofit HDFS Backup and Local File Backup to use the new 
>> > interfaces? I believe we should, but may be willing to accept this as out 
>> > of scope.
>> > 5) Regarding cloud provider test resources, we can also approach the ASF 
>> > Infra team to ask for cloud credits. Can you give rough estimates on what 
>> > kind of resourcing would be needed?
>> >
>> > I did not examine the new APIs in detail, but they looked fine at a high 
>> > level overview. Will probably look again after questions regarding v1/v2 
>> > are figured out.
>> >
>> > On Tue, Jan 5, 2021 at 10:11 AM Mike Drob <[email protected]> wrote:
>> >>
>> >> Can you explicitly call out in the SIP how it relates to the work done in 
>> >> SOLR-13608?
>> >>
>> >> On Tue, Jan 5, 2021 at 8:55 AM Jason Gerlowski <[email protected]> 
>> >> wrote:
>> >>>
>> >>> Hey, Happy New Year everybody.
>> >>>
>> >>> Some SIP updates based on the discussion above:
>> >>>
>> >>> I added v2 examples for each API to the SIP.  Feedback welcome,
>> >>> especially on the v2 APIs that are net-new to this proposal (namely:
>> >>> "list backups" and "delete backup").
>> >>>
>> >>> I've also amended the backcompat/migration section to mention Jan's
>> >>> suggestion that the "incremental" features be exposed in the v2 API
>> >>> only.  Though it's unclear to me whether that's still something people
>> >>> want since it turns out that we'll still have backcompat concerns with
>> >>> the existing v2 backup/restore APIs.  So I've held off from
>> >>> removing/replacing the original plan.
>> >>>
>> >>> Link for convenience:
>> >>> https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore
>> >>>
>> >>> Best,
>> >>>
>> >>> Jason
>> >>>
>> >>>
>> >>> On Thu, Dec 24, 2020 at 8:11 AM Jan Høydahl <[email protected]> 
>> >>> wrote:
>> >>> >
>> >>> > Ok, that’s the one I was looking for, it’s not documented in the 
>> >>> > backup chapter of ref-guide :(
>> >>> >
>> >>> > Jan Høydahl
>> >>> >
>> >>> > > 23. des. 2020 kl. 17:10 skrev Jason Gerlowski 
>> >>> > > <[email protected]>:
>> >>> > >
>> >>> > > 
>> >>> > >>
>> >>> > >> We have a path alias to the old API ... but we don’t have a true v2 
>> >>> > >> API spec for it, do we?
>> >>> > >
>> >>> > > Tbh I'm not yet familiar enough with the v2 APIs to understand the
>> >>> > > distinction you're making.  (Do you have a pointer to something 
>> >>> > > that'd
>> >>> > > fill me in?)
>> >>> > >
>> >>> > > To zoom in on "backup" as an example, the v2 API I'm referring to
>> >>> > > looks like:  /v2/collections" -d '{ "backup-collection":
>> >>> > > {"collection": "books", "name": "asdf3", "location": "/tmp/foo"}}'.
>> >>> > > And it's included in the v2 "introspect" documentation returned by
>> >>> > > this API: /v2/collections/_introspect?command=backup-collection".  To
>> >>> > > me that looked like a v2 API, but maybe path-aliases are also covered
>> >>> > > in the introspect docs.
>> >>> > >
>> >>> > > Jason
>> >>> > >
>> >>> > >> On Wed, Dec 23, 2020 at 10:29 AM Jan Høydahl 
>> >>> > >> <[email protected]> wrote:
>> >>> > >>
>> >>> > >> Actually, don’t think we do have a v2 Backup/Restore API. We have a 
>> >>> > >> path alias to the old API which takes GET ...&action=backup... but 
>> >>> > >> we don’t have a true v2 API spec for it, do we? Where is that 
>> >>> > >> documented?
>> >>> > >>
>> >>> > >> Jan Høydahl
>> >>> > >>
>> >>> > >>>> 22. des. 2020 kl. 18:04 skrev Jason Gerlowski 
>> >>> > >>>> <[email protected]>:
>> >>> > >>>
>> >>> > >>> Hey guys,
>> >>> > >>>
>> >>> > >>> Following up to make sure I understand the specifics you're
>> >>> > >>> suggesting.  You're proposing that:
>> >>> > >>>
>> >>> > >>> 1. The brand new backup-related APIs (list-backups and 
>> >>> > >>> delete-backup)
>> >>> > >>> be added in v2-form only.
>> >>> > >>> 2. Tweaks to existing backup-related APIs (create-backup, restore) 
>> >>> > >>> be
>> >>> > >>> made in V2-form only.
>> >>> > >>> 3. All existing v1 backup-related APIs be deprecated and left
>> >>> > >>> unchanged.  Incremental backups will not be possible using the v1 
>> >>> > >>> API.
>> >>> > >>>
>> >>> > >>> I'm not against going this route if there's consensus around it.  
>> >>> > >>> But
>> >>> > >>> I'm not 100% clear on how it means we don't need to worry about
>> >>> > >>> backcompat.  Backup and Restore currently exist as both a v1 and a 
>> >>> > >>> v2
>> >>> > >>> API - I understand how leaving the v1 APIs untouched (other than
>> >>> > >>> deprecation) frees us of some backcompat concerns there, but we 
>> >>> > >>> would
>> >>> > >>> still need to make tweaks to the v2 backup/restore APIs and would 
>> >>> > >>> have
>> >>> > >>> to tread just as carefully there in terms of backcompat, afaict.
>> >>> > >>> Unless Solr's backcompatibility guarantees only cover the v1 API 
>> >>> > >>> and
>> >>> > >>> leave v2 changes to be made freely?  I looked around to see if the 
>> >>> > >>> v2
>> >>> > >>> APIs had any sort of "experimental" designation, but couldn't find
>> >>> > >>> that clearly stated anywhere.  Am I missing something?
>> >>> > >>>
>> >>> > >>> Best,
>> >>> > >>>
>> >>> > >>> Jason
>> >>> > >>>
>> >>> > >>>> On Tue, Dec 22, 2020 at 2:49 AM Noble Paul <[email protected]> 
>> >>> > >>>> wrote:
>> >>> > >>>>
>> >>> > >>>>> , and implement the new imporved version as a V2-api only, and 
>> >>> > >>>>> then deprecate the v1 API?
>> >>> > >>>>
>> >>> > >>>>
>> >>> > >>>> V2 only please
>> >>> > >>>>
>> >>> > >>>>> On Tue, Dec 22, 2020 at 1:34 AM Jason Gerlowski 
>> >>> > >>>>> <[email protected]> wrote:
>> >>> > >>>>>
>> >>> > >>>>> Hey Jan, thanks for the review.
>> >>> > >>>>>
>> >>> > >>>>> I hadn't thought about the V2 API in connection to this work.  
>> >>> > >>>>> You're
>> >>> > >>>>> right though I think - the SIP proposes net-new APIs, so it 
>> >>> > >>>>> should add
>> >>> > >>>>> V2 equivalents at the very least.  I'll draft tentative details 
>> >>> > >>>>> for
>> >>> > >>>>> these APIs on the SIP and we can refine things from there.
>> >>> > >>>>>
>> >>> > >>>>> I'm more up in the air on your specific suggestion to restrict 
>> >>> > >>>>> the SIP
>> >>> > >>>>> changes to these v2 APIs.  It is an elegant approach to the
>> >>> > >>>>> backcompat, and it provides a carrot for v2 adoption - both of 
>> >>> > >>>>> which I
>> >>> > >>>>> like.  But it would let users create snapshot-based backups (and 
>> >>> > >>>>> keep
>> >>> > >>>>> us maintaining that code) longer than there's any strict need 
>> >>> > >>>>> to.  And
>> >>> > >>>>> users are left on the less-efficient format by default.  (By 
>> >>> > >>>>> contrast,
>> >>> > >>>>> the current SIP has snapshot-backup creation being replaced by
>> >>> > >>>>> incremental-backup creation as soon as the latter is 
>> >>> > >>>>> available.).  Did
>> >>> > >>>>> you have a particular lifespan in mind for snapshot-based 
>> >>> > >>>>> creation if
>> >>> > >>>>> we go with this approach?
>> >>> > >>>>>
>> >>> > >>>>> Jason
>> >>> > >>>>>
>> >>> > >>>>> On Thu, Dec 17, 2020 at 3:54 PM Jan Høydahl 
>> >>> > >>>>> <[email protected]> wrote:
>> >>> > >>>>>>
>> >>> > >>>>>> Much needed! Thanks for initiating this Jason!
>> >>> > >>>>>>
>> >>> > >>>>>> As we want to move away from v1 APIs where a HTTP GET is used 
>> >>> > >>>>>> for creation and deletion, would it be an idea to leave the old 
>> >>> > >>>>>> backup/resotre APIs as-is, and implement the new imporved 
>> >>> > >>>>>> version as a V2-api only, and then deprecate the v1 API? Then 
>> >>> > >>>>>> we don't need to worry about back-compat, and we get a 
>> >>> > >>>>>> head-start on converting the COLLECTION API to v2 style.
>> >>> > >>>>>>
>> >>> > >>>>>> Jan
>> >>> > >>>>>>
>> >>> > >>>>>>> 15. des. 2020 kl. 15:48 skrev Jason Gerlowski 
>> >>> > >>>>>>> <[email protected]>:
>> >>> > >>>>>>>
>> >>> > >>>>>>> Hey all,
>> >>> > >>>>>>>
>> >>> > >>>>>>> This morning I published SIP-12, which proposes an overhaul of 
>> >>> > >>>>>>> Solr's
>> >>> > >>>>>>> backup and restore functionality.  While the "headline" 
>> >>> > >>>>>>> improvement in
>> >>> > >>>>>>> this SIP is a change to do backups incrementally, it bundles 
>> >>> > >>>>>>> in a
>> >>> > >>>>>>> number of other improvements as well, including the addition of
>> >>> > >>>>>>> corruption checks, APIs to list and delete backups, and 
>> >>> > >>>>>>> stronger
>> >>> > >>>>>>> integration points with popular object storage APIs.
>> >>> > >>>>>>>
>> >>> > >>>>>>> The SIP can be found here:
>> >>> > >>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore
>> >>> > >>>>>>>
>> >>> > >>>>>>> Please read the SIP description and come back here for 
>> >>> > >>>>>>> discussion.  As
>> >>> > >>>>>>> the discussion progresses we will update the SIP page with any
>> >>> > >>>>>>> outcomes and eventually move things to a VOTE.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Looking forward to hearing your feedback.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Best,
>> >>> > >>>>>>>
>> >>> > >>>>>>> Jason
>> >>> > >>>>>>>
>> >>> > >>>>>>> ---------------------------------------------------------------------
>> >>> > >>>>>>> To unsubscribe, e-mail: [email protected]
>> >>> > >>>>>>> For additional commands, e-mail: [email protected]
>> >>> > >>>>>>>
>> >>> > >>>>>>
>> >>> > >>>>>>
>> >>> > >>>>>> ---------------------------------------------------------------------
>> >>> > >>>>>> To unsubscribe, e-mail: [email protected]
>> >>> > >>>>>> For additional commands, e-mail: [email protected]
>> >>> > >>>>>>
>> >>> > >>>>>
>> >>> > >>>>> ---------------------------------------------------------------------
>> >>> > >>>>> To unsubscribe, e-mail: [email protected]
>> >>> > >>>>> For additional commands, e-mail: [email protected]
>> >>> > >>>>>
>> >>> > >>>>
>> >>> > >>>>
>> >>> > >>>> --
>> >>> > >>>> -----------------------------------------------------
>> >>> > >>>> Noble Paul
>> >>> > >>>>
>> >>> > >>>> ---------------------------------------------------------------------
>> >>> > >>>> To unsubscribe, e-mail: [email protected]
>> >>> > >>>> For additional commands, e-mail: [email protected]
>> >>> > >>>>
>> >>> > >>>
>> >>> > >>> ---------------------------------------------------------------------
>> >>> > >>> To unsubscribe, e-mail: [email protected]
>> >>> > >>> For additional commands, e-mail: [email protected]
>> >>> > >>>
>> >>> > >>
>> >>> > >> ---------------------------------------------------------------------
>> >>> > >> To unsubscribe, e-mail: [email protected]
>> >>> > >> For additional commands, e-mail: [email protected]
>> >>> > >>
>> >>> > >
>> >>> > > ---------------------------------------------------------------------
>> >>> > > To unsubscribe, e-mail: [email protected]
>> >>> > > For additional commands, e-mail: [email protected]
>> >>> > >
>> >>> >
>> >>> > ---------------------------------------------------------------------
>> >>> > To unsubscribe, e-mail: [email protected]
>> >>> > For additional commands, e-mail: [email protected]
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to