Re: documenting the backup manifest file format

2020-05-15 Thread David Steele

On 5/15/20 10:17 AM, Tom Lane wrote:

David Steele  writes:

On 5/15/20 9:34 AM, Tom Lane wrote:

I vote for following the backup_label precedent; that's stood for quite
some years now.



Of course, my actual preference is to use epoch time which is easy to
work with and eliminates the possibility of conversion errors. It is
also compact.


Well, if we did that then it'd be sufficiently different from the backup
label as to remove any risk of confusion.  But "easy to work with" is in
the eye of the beholder; do we really want a format that's basically
unreadable to the naked eye?


Well, I lost this argument before so it seems I'm in the minority on 
easy-to-use. We use epoch time in the pgBackRest manifests which has 
been easy to deal with in both C and Perl, so experience tells me it 
really is easy, at least for programs.


The manifest (to me, at least) is generally intended to be 
machine-processed. For instance, it contains checksums which are not all 
that useful unless they are checked programmatically -- they can't just 
be eye-balled.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-05-15 Thread Tom Lane
David Steele  writes:
> On 5/15/20 9:34 AM, Tom Lane wrote:
>> I vote for following the backup_label precedent; that's stood for quite
>> some years now.

> Of course, my actual preference is to use epoch time which is easy to 
> work with and eliminates the possibility of conversion errors. It is 
> also compact.

Well, if we did that then it'd be sufficiently different from the backup
label as to remove any risk of confusion.  But "easy to work with" is in
the eye of the beholder; do we really want a format that's basically
unreadable to the naked eye?

regards, tom lane




Re: documenting the backup manifest file format

2020-05-15 Thread David Steele

On 5/15/20 9:34 AM, Tom Lane wrote:

Robert Haas  writes:

It's a good question. My inclination was to think that GMT would be
the clearest thing, but I also didn't realize that the result would
thus be inconsistent with backup_label. Not sure what's best here.


I vote for following the backup_label precedent; that's stood for quite
some years now.


I'd rather keep it GMT. The timestamps in the backup label are purely 
informational, but the timestamps in the manifest are useful, e.g. to 
set the mtime on a restore to the original value.


Forcing the user to do timezone conversions is prone to error. Some 
languages, like C, simply aren't good at it.


Of course, my actual preference is to use epoch time which is easy to 
work with and eliminates the possibility of conversion errors. It is 
also compact.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-05-15 Thread Tom Lane
Robert Haas  writes:
> It's a good question. My inclination was to think that GMT would be
> the clearest thing, but I also didn't realize that the result would
> thus be inconsistent with backup_label. Not sure what's best here.

I vote for following the backup_label precedent; that's stood for quite
some years now.

regards, tom lane




Re: documenting the backup manifest file format

2020-05-15 Thread Robert Haas
On Fri, May 15, 2020 at 2:10 AM Fujii Masao  wrote:
> I have one question related to this; Why don't we use log_timezone,
> like backup_label? log_timezone is used for "START TIME" field in
> backup_label. Sorry if this was already discussed.
>
> /* Use the log timezone here, not the session timezone */
> stamp_time = (pg_time_t) time(NULL);
> pg_strftime(strfbuf, sizeof(strfbuf),
> "%Y-%m-%d %H:%M:%S %Z",
> pg_localtime(&stamp_time, 
> log_timezone));
>
> OTOH, *if* we want to use the same timezone for backup-related files because
> backup can be used in different environements and timezone setting
> may be different there or for other reasons, backup_label also should use
> GMT or something for the sake of consistency?

It's a good question. My inclination was to think that GMT would be
the clearest thing, but I also didn't realize that the result would
thus be inconsistent with backup_label. Not sure what's best here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-05-14 Thread Fujii Masao




On 2020/04/15 5:33, Andrew Dunstan wrote:


On 4/14/20 4:09 PM, Alvaro Herrera wrote:

On 2020-Apr-14, Andrew Dunstan wrote:


OK, but I think if we're putting a timestamp string in ISO-8601 format
in the manifest it should be in UTC / Zulu time, precisely to avoid
these issues. If that's too much trouble then yes an epoch time will
probably do.

The timestamp is always specified and always UTC (except the code calls
it GMT).

+   /*
+* Convert last modification time to a string and append it to the
+* manifest. Since it's not clear what time zone to use and since time
+* zone definitions can change, possibly causing confusion, use GMT
+* always.
+*/
+   appendStringInfoString(&buf, "\"Last-Modified\": \"");
+   enlargeStringInfo(&buf, 128);
+   buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
+  pg_gmtime(&mtime));
+   appendStringInfoString(&buf, "\"");

I was merely saying that it's trivial to make this iso-8601 compliant as

 buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ",

ie. omit the "GMT" string and replace it with a literal Z, and remove
the space and replace it with a T.


I have one question related to this; Why don't we use log_timezone,
like backup_label? log_timezone is used for "START TIME" field in
backup_label. Sorry if this was already discussed.

/* Use the log timezone here, not the session timezone */
stamp_time = (pg_time_t) time(NULL);
pg_strftime(strfbuf, sizeof(strfbuf),
"%Y-%m-%d %H:%M:%S %Z",
pg_localtime(&stamp_time, 
log_timezone));

OTOH, *if* we want to use the same timezone for backup-related files because
backup can be used in different environements and timezone setting
may be different there or for other reasons, backup_label also should use
GMT or something for the sake of consistency?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: documenting the backup manifest file format

2020-04-17 Thread Fujii Masao




On 2020/04/15 22:24, Robert Haas wrote:

On Tue, Apr 14, 2020 at 11:49 PM Fujii Masao
 wrote:

While reading the document that you pushed, I thought that it's better
to define index term for backup manifest, so that we can easily reach
this document from the index page. Thought? Patch attached.


Fine with me. I tend not to think about the index very much, so I'm
glad you are. :-)


Pushed! Thanks!

Regards,
 


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: documenting the backup manifest file format

2020-04-16 Thread Jehan-Guillaume de Rorthais
On Wed, 15 Apr 2020 18:54:14 -0400
David Steele  wrote:

> On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 15 Apr 2020 12:03:28 -0400
> > Robert Haas  wrote:
> >   
> >> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
> >>  wrote:  
> >>> But for backup_manifest, it's kind of shame we have to check the checksum
> >>> against an transformed version of the file. Did you consider creating eg.
> >>> a separate backup_manifest.sha256 file?
> >>>
> >>> I'm very sorry in advance if this has been discussed previously.  
> >>
> >> It was briefly mentioned in the original (lengthy) discussion, but I
> >> think there was one vote in favor and two votes against or something
> >> like that, so it didn't go anywhere.  
> > 
> > Argh.
> >   
> >> I didn't realize that there were handy command-line tools for manipulating
> >> json like that, or I probably would have considered that idea more
> >> strongly.  
> > 
> > That was indeed a lengthy thread with various details discussed. I'm sorry I
> > didn't catch the ball back then.  
> 
> One of the reasons to use JSON was to be able to use command line tools 
> like jq to do tasks (I use it myself).

That's perfectly fine. I was only wondering about having the manifest checksum
outside of the manifest itself.

> But I think only the pg_verifybackup tool should be used to verify the
> internal checksum.

true.

> Two thoughts:
> 
> 1) You can always generate an external checksum when you generate the 
> backup if you want to do your own verification without running 
> pg_verifybackup.

Sure, but by the time I want to produce an external checksum, the manifest
would have travel around quite a bit with various danger on its way to corrupt
it. Checksuming it from the original process that produced it sounds safer.

> 2) Perhaps it would be good if the pg_verifybackup command had a 
> --verify-manifest-checksum option (or something) to check that the 
> manifest file looks valid without checking any files. That's not going 
> to happen for PG13, but it's possible for PG14.

Sure.

I just liked the idea to be able to check the manifest using an external
command line implementing the same standardized checksum algo. Without editing
the manifest first. But I understand it's too late to discuss this now.

Regards,




Re: documenting the backup manifest file format

2020-04-15 Thread David Steele

On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote:

On Wed, 15 Apr 2020 12:03:28 -0400
Robert Haas  wrote:


On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
 wrote:

But for backup_manifest, it's kind of shame we have to check the checksum
against an transformed version of the file. Did you consider creating eg. a
separate backup_manifest.sha256 file?

I'm very sorry in advance if this has been discussed previously.


It was briefly mentioned in the original (lengthy) discussion, but I
think there was one vote in favor and two votes against or something
like that, so it didn't go anywhere.


Argh.


I didn't realize that there were handy command-line tools for manipulating
json like that, or I probably would have considered that idea more strongly.


That was indeed a lengthy thread with various details discussed. I'm sorry I
didn't catch the ball back then.


One of the reasons to use JSON was to be able to use command line tools 
like jq to do tasks (I use it myself). But I think only the 
pg_verifybackup tool should be used to verify the internal checksum.


Two thoughts:

1) You can always generate an external checksum when you generate the 
backup if you want to do your own verification without running 
pg_verifybackup.


2) Perhaps it would be good if the pg_verifybackup command had a 
--verify-manifest-checksum option (or something) to check that the 
manifest file looks valid without checking any files. That's not going 
to happen for PG13, but it's possible for PG14.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-15 Thread Jehan-Guillaume de Rorthais
On Wed, 15 Apr 2020 12:03:28 -0400
Robert Haas  wrote:

> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
>  wrote:
> > But for backup_manifest, it's kind of shame we have to check the checksum
> > against an transformed version of the file. Did you consider creating eg. a
> > separate backup_manifest.sha256 file?
> >
> > I'm very sorry in advance if this has been discussed previously.  
> 
> It was briefly mentioned in the original (lengthy) discussion, but I
> think there was one vote in favor and two votes against or something
> like that, so it didn't go anywhere.

Argh.

> I didn't realize that there were handy command-line tools for manipulating
> json like that, or I probably would have considered that idea more strongly.

That was indeed a lengthy thread with various details discussed. I'm sorry I
didn't catch the ball back then.

Regards,




Re: documenting the backup manifest file format

2020-04-15 Thread Robert Haas
On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
 wrote:
> But for backup_manifest, it's kind of shame we have to check the checksum
> against an transformed version of the file. Did you consider creating eg. a
> separate backup_manifest.sha256 file?
>
> I'm very sorry in advance if this has been discussed previously.

It was briefly mentioned in the original (lengthy) discussion, but I
think there was one vote in favor and two votes against or something
like that, so it didn't go anywhere. I didn't realize that there were
handy command-line tools for manipulating json like that, or I
probably would have considered that idea more strongly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-15 Thread Jehan-Guillaume de Rorthais
On Tue, 14 Apr 2020 12:56:49 -0400
Robert Haas  wrote:

> On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera 
> wrote:
> > Yeah, I guess I'm just saying that it feels brittle to have a file
> > format that's supposed to be good for data exchange and then make it
> > itself depend on representation details such as the order that fields
> > appear in, the letter case, or the format of newlines.  Maybe this isn't
> > really of concern, but it seemed strange.  
> 
> I didn't want to use JSON for this at all, but I got outvoted. When I
> raised this issue, it was suggested that I deal with it in this way,
> so I did. I can't really defend it too far beyond that, although I do
> think that one nice thing about this is that you can verify the
> checksum using shell commands if you want. Just figure out the number
> of lines in the file, minus one, and do head -n$LINES backup_manifest
> | shasum -a256 and boom. If there were some whitespace-skipping thing
> figuring out how to reproduce the checksum calculation would be hard.

FWIW, shell commands (md5sum and sha*sum) read checksums from a separate file
with a very simple format: one file per line with format "CHECKSUM FILEPATH".

Thanks to json, it is fairly easy to extract checksums and filenames from the
current manifest file format and check them all with one command:

  jq -r '.Files|.[]|.Checksum+" "+.Path' backup_manifest > checksums.sha256
  sha256sum --check --quiet checksums.sha256

You can even pipe these commands together to avoid the intermediary file.

But for backup_manifest, it's kind of shame we have to check the checksum
against an transformed version of the file. Did you consider creating eg. a
separate backup_manifest.sha256 file?

I'm very sorry in advance if this has been discussed previously.

Regards,




Re: documenting the backup manifest file format

2020-04-15 Thread Robert Haas
On Tue, Apr 14, 2020 at 11:49 PM Fujii Masao
 wrote:
> While reading the document that you pushed, I thought that it's better
> to define index term for backup manifest, so that we can easily reach
> this document from the index page. Thought? Patch attached.

Fine with me. I tend not to think about the index very much, so I'm
glad you are. :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-14 Thread Fujii Masao



On 2020/04/14 2:40, Robert Haas wrote:

On Fri, Mar 27, 2020 at 4:32 PM Andres Freund  wrote:

I don't like having a file format that's intended to be used by external
tools too that's undocumented except for code that assembles it in a
piecemeal fashion.  Do you mean in a follow-on patch this release, or
later? I don't have a problem with the former.


Here is a patch for that.


While reading the document that you pushed, I thought that it's better
to define index term for backup manifest, so that we can easily reach
this document from the index page. Thought? Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/backup-manifest.sgml 
b/doc/src/sgml/backup-manifest.sgml
index 376aff0d6d..b9634f2706 100644
--- a/doc/src/sgml/backup-manifest.sgml
+++ b/doc/src/sgml/backup-manifest.sgml
@@ -3,6 +3,10 @@
 
  Backup Manifest Format
 
+  
+   Backup Manifest
+  
+
   
The backup manifest generated by  is
primarily intended to permit the backup to be verified using


Re: documenting the backup manifest file format

2020-04-14 Thread David Steele

On 4/14/20 4:11 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


Happily ISO-8601 is always UTC.


Uh, it is not --
https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators


Whoops, you are correct. I've just never seen non-UTC in the wild yet.

--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan


On 4/14/20 4:09 PM, Alvaro Herrera wrote:
> On 2020-Apr-14, Andrew Dunstan wrote:
>
>> OK, but I think if we're putting a timestamp string in ISO-8601 format
>> in the manifest it should be in UTC / Zulu time, precisely to avoid
>> these issues. If that's too much trouble then yes an epoch time will
>> probably do.
> The timestamp is always specified and always UTC (except the code calls
> it GMT).
>
> +   /*
> +* Convert last modification time to a string and append it to the
> +* manifest. Since it's not clear what time zone to use and since time
> +* zone definitions can change, possibly causing confusion, use GMT
> +* always.
> +*/
> +   appendStringInfoString(&buf, "\"Last-Modified\": \"");
> +   enlargeStringInfo(&buf, 128);
> +   buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
> +  pg_gmtime(&mtime));
> +   appendStringInfoString(&buf, "\"");
>
> I was merely saying that it's trivial to make this iso-8601 compliant as
>
> buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ",
>
> ie. omit the "GMT" string and replace it with a literal Z, and remove
> the space and replace it with a T.
>

+1


cheers


andre



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera
On 2020-Apr-14, David Steele wrote:

> Happily ISO-8601 is always UTC.

Uh, it is not --
https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera
On 2020-Apr-14, Andrew Dunstan wrote:

> OK, but I think if we're putting a timestamp string in ISO-8601 format
> in the manifest it should be in UTC / Zulu time, precisely to avoid
> these issues. If that's too much trouble then yes an epoch time will
> probably do.

The timestamp is always specified and always UTC (except the code calls
it GMT).

+   /*
+* Convert last modification time to a string and append it to the
+* manifest. Since it's not clear what time zone to use and since time
+* zone definitions can change, possibly causing confusion, use GMT
+* always.
+*/
+   appendStringInfoString(&buf, "\"Last-Modified\": \"");
+   enlargeStringInfo(&buf, 128);
+   buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
+  pg_gmtime(&mtime));
+   appendStringInfoString(&buf, "\"");

I was merely saying that it's trivial to make this iso-8601 compliant as

buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ",

ie. omit the "GMT" string and replace it with a literal Z, and remove
the space and replace it with a T.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: documenting the backup manifest file format

2020-04-14 Thread David Steele

On 4/14/20 3:55 PM, Andrew Dunstan wrote:


OK, but I think if we're putting a timestamp string in ISO-8601 format
in the manifest it should be in UTC / Zulu time, precisely to avoid
these issues. If that's too much trouble then yes an epoch time will
probably do.


Happily ISO-8601 is always UTC. The problem I'm referring to is the 
timezone setting on the host system when doing conversions in C.


To be fair most languages handle this well and C is C so I'm not sure we 
need to make a big deal of it. In JSON/XML it's pretty common to use 
ISO-8601 so that seems like a rational choice.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan


On 4/14/20 3:19 PM, David Steele wrote:
> On 4/14/20 3:03 PM, Andrew Dunstan wrote:
>>
>> On 4/14/20 1:33 PM, David Steele wrote:
>>> On 4/14/20 1:27 PM, Alvaro Herrera wrote:
 On 2020-Apr-14, David Steele wrote:

> On 4/14/20 12:56 PM, Robert Haas wrote:
>
>> Hmm, did David suggest that before? I don't recall for sure. I think
>> he had some suggestion, but I'm not sure if it was the same one.
>
> "I'm also partial to using epoch time in the manifest because it is
> generally easier for programs to work with.  But, human-readable
> doesn't
> suck, either."

 Ugh.  If you go down that road, why write human-readable contents at
 all?  You may as well just use a binary format.  But that's a very
 slippery slope and you won't like to be in the bottom -- I don't see
 what that gains you.  It's not like it's a lot of work to parse a
 timestamp in a non-internationalized well-defined human-readable
 format.
>>>
>>> Well, times are a special case because they are so easy to mess up.
>>> Try converting ISO-8601 to epoch time using the standard C functions
>>> on a system where TZ != UTC. Fun times.
>>
>> Even if it's a zulu time? That would be pretty damn sad.
> ZULU/GMT/UTC are all fine. But if the server timezone is EDT for
> example (not that I recommend this) you are likely to get the wrong
> result. Results vary based on your platform. For instance, we found
> MacOS was more likely to work the way you would expect and Linux was
> hopeless.
>
> There are all kinds of fun tricks to get around this (sort of). One is
> to temporarily set TZ=UTC which sucks if an error happens before it
> gets set back. There are some hacks to try to determine your offset
> which have inherent race conditions around DST changes.
>
> After some experimentation we just used the Posix definition for epoch
> time and used that to do our conversions:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16
>
>
>

OK, but I think if we're putting a timestamp string in ISO-8601 format
in the manifest it should be in UTC / Zulu time, precisely to avoid
these issues. If that's too much trouble then yes an epoch time will
probably do.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: documenting the backup manifest file format

2020-04-14 Thread David Steele

On 4/14/20 3:03 PM, Andrew Dunstan wrote:


On 4/14/20 1:33 PM, David Steele wrote:

On 4/14/20 1:27 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


On 4/14/20 12:56 PM, Robert Haas wrote:


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is
generally easier for programs to work with.  But, human-readable
doesn't
suck, either."


Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.


Well, times are a special case because they are so easy to mess up.
Try converting ISO-8601 to epoch time using the standard C functions
on a system where TZ != UTC. Fun times.


Even if it's a zulu time? That would be pretty damn sad.
ZULU/GMT/UTC are all fine. But if the server timezone is EDT for example 
(not that I recommend this) you are likely to get the wrong result. 
Results vary based on your platform. For instance, we found MacOS was 
more likely to work the way you would expect and Linux was hopeless.


There are all kinds of fun tricks to get around this (sort of). One is 
to temporarily set TZ=UTC which sucks if an error happens before it gets 
set back. There are some hacks to try to determine your offset which 
have inherent race conditions around DST changes.


After some experimentation we just used the Posix definition for epoch 
time and used that to do our conversions:


https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16

Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-14 Thread Andrew Dunstan


On 4/14/20 1:33 PM, David Steele wrote:
> On 4/14/20 1:27 PM, Alvaro Herrera wrote:
>> On 2020-Apr-14, David Steele wrote:
>>
>>> On 4/14/20 12:56 PM, Robert Haas wrote:
>>>
 Hmm, did David suggest that before? I don't recall for sure. I think
 he had some suggestion, but I'm not sure if it was the same one.
>>>
>>> "I'm also partial to using epoch time in the manifest because it is
>>> generally easier for programs to work with.  But, human-readable
>>> doesn't
>>> suck, either."
>>
>> Ugh.  If you go down that road, why write human-readable contents at
>> all?  You may as well just use a binary format.  But that's a very
>> slippery slope and you won't like to be in the bottom -- I don't see
>> what that gains you.  It's not like it's a lot of work to parse a
>> timestamp in a non-internationalized well-defined human-readable format.
>
> Well, times are a special case because they are so easy to mess up.
> Try converting ISO-8601 to epoch time using the standard C functions
> on a system where TZ != UTC. Fun times.
>
>


Even if it's a zulu time? That would be pretty damn sad.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: documenting the backup manifest file format

2020-04-14 Thread David Steele

On 4/14/20 1:27 PM, Alvaro Herrera wrote:

On 2020-Apr-14, David Steele wrote:


On 4/14/20 12:56 PM, Robert Haas wrote:


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is
generally easier for programs to work with.  But, human-readable doesn't
suck, either."


Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.


Well, times are a special case because they are so easy to mess up. Try 
converting ISO-8601 to epoch time using the standard C functions on a 
system where TZ != UTC. Fun times.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-14 Thread Alvaro Herrera
On 2020-Apr-14, David Steele wrote:

> On 4/14/20 12:56 PM, Robert Haas wrote:
>
> > Hmm, did David suggest that before? I don't recall for sure. I think
> > he had some suggestion, but I'm not sure if it was the same one.
> 
> "I'm also partial to using epoch time in the manifest because it is
> generally easier for programs to work with.  But, human-readable doesn't
> suck, either."

Ugh.  If you go down that road, why write human-readable contents at
all?  You may as well just use a binary format.  But that's a very
slippery slope and you won't like to be in the bottom -- I don't see
what that gains you.  It's not like it's a lot of work to parse a
timestamp in a non-internationalized well-defined human-readable format.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: documenting the backup manifest file format

2020-04-14 Thread David Steele

On 4/14/20 12:56 PM, Robert Haas wrote:

On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera  wrote:

Yeah, I guess I'm just saying that it feels brittle to have a file
format that's supposed to be good for data exchange and then make it
itself depend on representation details such as the order that fields
appear in, the letter case, or the format of newlines.  Maybe this isn't
really of concern, but it seemed strange.


I didn't want to use JSON for this at all, but I got outvoted. When I
raised this issue, it was suggested that I deal with it in this way,
so I did. I can't really defend it too far beyond that, although I do
think that one nice thing about this is that you can verify the
checksum using shell commands if you want. Just figure out the number
of lines in the file, minus one, and do head -n$LINES backup_manifest
| shasum -a256 and boom. If there were some whitespace-skipping thing
figuring out how to reproduce the checksum calculation would be hard.


I think strict ISO 8601 might be preferable (with the T in the middle
and ending in Z instead of " GMT").


Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.


"I'm also partial to using epoch time in the manifest because it is 
generally easier for programs to work with.  But, human-readable doesn't 
suck, either."


Also you don't need to worry about time-zone conversion errors -- even 
if the source time is UTC this can easily happen if you are not careful. 
It also saves a parsing step.


The downside is it is not human-readable but this is intended to be a 
machine-readable format so I don't think it's a big deal (encoded 
filenames will be just as opaque). If a user really needs to know what 
time some file is (rare, I think) they can paste it with a web tool to 
find out.


Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-14 Thread Robert Haas
On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera  wrote:
> Yeah, I guess I'm just saying that it feels brittle to have a file
> format that's supposed to be good for data exchange and then make it
> itself depend on representation details such as the order that fields
> appear in, the letter case, or the format of newlines.  Maybe this isn't
> really of concern, but it seemed strange.

I didn't want to use JSON for this at all, but I got outvoted. When I
raised this issue, it was suggested that I deal with it in this way,
so I did. I can't really defend it too far beyond that, although I do
think that one nice thing about this is that you can verify the
checksum using shell commands if you want. Just figure out the number
of lines in the file, minus one, and do head -n$LINES backup_manifest
| shasum -a256 and boom. If there were some whitespace-skipping thing
figuring out how to reproduce the checksum calculation would be hard.

> I think strict ISO 8601 might be preferable (with the T in the middle
> and ending in Z instead of " GMT").

Hmm, did David suggest that before? I don't recall for sure. I think
he had some suggestion, but I'm not sure if it was the same one.

> > > Why is the top-level checksum only allowed to be SHA-256, if the
> > > files can use up to SHA-512?
>
> Thanks for the discussion.  I think you mostly want to make sure that
> the manifest is sensible (not corrupt) rather than defend against
> somebody maliciously giving you an attacking manifest (??).  I incline
> to agree that any SHA-2 hash is going to serve that purpose and have no
> further comment to make.

The code has other sanity checks against the manifest failing to parse
properly, so you can't (I hope) crash it or anything even if you
falsify the checksum. But suppose that there is a gremlin running
around your system flipping occasional bits. If said gremlin flips a
bit in a "0" that appears in a file's checksum string, it could become
a "1", a "3", or a "7", all of which are still valid characters for a
hex string. When you then tried to verify the backup, verification for
that file would fail, but you'd think it was a problem with the file,
rather than a problem with the manifest. The manifest checksum
prevents that: you'll get a complaint about the manifest checksum
being wrong rather than a complaint about the file not matching the
manifest checksum. A sufficiently smart gremlin could figure out the
expected checksum for the revised manifest and flip bits to make the
actual value match the expected one, but I think we're worried about
"chaotic neutral" gremlins, not "lawful evil" ones.

That having been said, there was some discussion on the original
thread about keeping your backup on regular storage and your manifest
checksum in a concrete bunker at the bottom of the ocean; in that
scenario, it should be possible to detect tampering in either the
manifest itself or in non-WAL data files, as long as the adversary
can't break SHA-256. But I'm not sure how much we should really worry
about that. For me, the design center for this feature is a user who
untars base.tar and forgets about 43965.tar. If that person runs
pg_verifybackup, it's gonna tell them that things are broken, and
that's good enough for me. It may not be good enough for everybody,
but it's good enough for me.

I think I'm going to go ahed and push this now, maybe with a small
wording tweak as discussed upthread with Andrew. The rest of this
discussion is really about whether the patch needs any design changes
rather than about whether the documentation describes what the patch
does, so it makes sense to me to commit this first and then if
somebody wants to argue for a change they certainly can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-13 Thread Alvaro Herrera
On 2020-Apr-13, Robert Haas wrote:

> On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera  
> wrote:
> > Are these hex figures upper or lower case?  No leading zeroes?  This
> > would normally not matter, but the toplevel checksum will care.
> 
> Not really. You just feed the whole file except for the last line
> through shasum and you get the answer.
> 
> It so happens that the server generates lower-case, but
> pg_verifybackup will accept either.
> 
> Leading zeroes are not omitted. If the checksum's not the right
> length, it ain't gonna work. If SHA is used, it's the same output you
> would get from running shasum -a on the file, which is
> certainly a fixed length. I assumed that this followed from the
> statement that there are two characters per byte in the checksum, and
> from the fact that no checksum algorithm I know about drops leading
> zeroes in the output.

Eh, apologies, I was completely unclear -- I was looking at the LSN
fields when writing the above.  So the leading zeroes and letter case
comment refers to those in the LSN values.  I agree that it doesn't
matter as long as the same tool generates the json file and writes the
checksum.

> > Also, I see no mention of prettification-chars such as newlines or
> > indentation.  I suppose if I pass a manifest file through
> > prettification (or Windows newline conversion), the checksum may
> > break.
> 
> It would indeed break. I'm not sure what you want me to say here,
> though. If you're trying to parse a manifest, you shouldn't care about
> how the whitespace is arranged. If you're trying to generate one, you
> can arrange it any way you like, as long as you also include it in the
> checksum.

Yeah, I guess I'm just saying that it feels brittle to have a file
format that's supposed to be good for data exchange and then make it
itself depend on representation details such as the order that fields
appear in, the letter case, or the format of newlines.  Maybe this isn't
really of concern, but it seemed strange.

> > As for Last-Modification, I think the spec should indicate the exact
> > format that's used, because it'll also be critical for checksumming.
> 
> Again, I don't think it really matters for checksumming, but it's
> "-MM-DD HH:MM:SS TZ" format, where TZ is always GMT.

I agree that whatever format you use will work as long as it isn't
modified.

I think strict ISO 8601 might be preferable (with the T in the middle
and ending in Z instead of " GMT").

> > Why is the top-level checksum only allowed to be SHA-256, if the
> > files can use up to SHA-512?

Thanks for the discussion.  I think you mostly want to make sure that
the manifest is sensible (not corrupt) rather than defend against
somebody maliciously giving you an attacking manifest (??).  I incline
to agree that any SHA-2 hash is going to serve that purpose and have no
further comment to make.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: documenting the backup manifest file format

2020-04-13 Thread David Steele

On 4/13/20 4:14 PM, Robert Haas wrote:

On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera  wrote:


Also, I
see no mention of prettification-chars such as newlines or indentation.
I suppose if I pass a manifest file through prettification (or Windows
newline conversion), the checksum may break.


It would indeed break. I'm not sure what you want me to say here,
though. If you're trying to parse a manifest, you shouldn't care about
how the whitespace is arranged. If you're trying to generate one, you
can arrange it any way you like, as long as you also include it in the
checksum.


pgBackRest ignores whitespace but this is a legacy of the way Perl 
calculated checksums, not an intentional feature. This worked well when 
the manifest was loaded as a whole, converted to JSON, and checksummed, 
but it is a major pain for the streaming code we now have in C.


I guarantee that that our next manifest version will do a simple 
checksum of bytes as Robert has done in this feature.


So, I'm +1 as implemented.


Why is the top-level checksum only allowed to be SHA-256, if the files
can use up to SHA-512?





I agree that it's a little bit weird that you can have a stronger
checksum for the files instead of the manifest itself, but I also
wonder what the use case would be for using a stronger checksum on the
manifest. David Steele argued that strong checksums on the files could
be useful to software that wants to rifle through all the backups
you've ever taken and find another copy of that file by looking for
something with a matching checksum. CRC-32C wouldn't be strong enough
for that, because eventually you could have enough files that you
start to have collisions. The SHA algorithms output enough bits to
make that quite unlikely. But this argument only makes sense for the
files, not the manifest.


Agreed. I think SHA-256 is *more* than enough to protect the manifest 
against corruption. That said, since the cost of SHA-256 vs. SHA-512 in 
the context on the manifest is negligible we could just use the stronger 
algorithm to deflect a similar question going forward.


That choice might not age well, but we could always say, well, we picked 
it because it was the strongest available at the time. Allowing a choice 
of which algorithm to use for to manifest checksum seems like it will 
just make verifying the file harder with no tangible benefit.


Maybe just a comment in the docs about why SHA-256 was used would be fine.


(Also, did we intentionally omit the dash in
hash names, so "SHA-256" to make it SHA256?  This will also be critical
for checksumming the manifest itself.)


I debated this with myself, settled on this spelling, and nobody
complained until now. It could be changed, though. I didn't have any
particular reason for choosing it except the feeling that people would
probably prefer to type --manifest-checksum=sha256 rather than
--manifest-checksum=sha-256.


+1 for sha256 rather than sha-256.

Regards,
--
-David
da...@pgmasters.net




Re: documenting the backup manifest file format

2020-04-13 Thread Robert Haas
On Mon, Apr 13, 2020 at 4:10 PM Andrew Dunstan
 wrote:
> Seems ok. A tiny example, or an excerpt, might be nice.

An empty database produces a manifest about 1200 lines long, so a full
example seems like too much to include in the documentation. An
excerpt could be included, I suppose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-13 Thread Robert Haas
On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera  wrote:
> Are these hex figures upper or lower case?  No leading zeroes?  This
> would normally not matter, but the toplevel checksum will care.

Not really. You just feed the whole file except for the last line
through shasum and you get the answer.

It so happens that the server generates lower-case, but
pg_verifybackup will accept either.

Leading zeroes are not omitted. If the checksum's not the right
length, it ain't gonna work. If SHA is used, it's the same output you
would get from running shasum -a on the file, which is
certainly a fixed length. I assumed that this followed from the
statement that there are two characters per byte in the checksum, and
from the fact that no checksum algorithm I know about drops leading
zeroes in the output.

> Also, I
> see no mention of prettification-chars such as newlines or indentation.
> I suppose if I pass a manifest file through prettification (or Windows
> newline conversion), the checksum may break.

It would indeed break. I'm not sure what you want me to say here,
though. If you're trying to parse a manifest, you shouldn't care about
how the whitespace is arranged. If you're trying to generate one, you
can arrange it any way you like, as long as you also include it in the
checksum.

> As for Last-Modification, I think the spec should indicate the exact
> format that's used, because it'll also be critical for checksumming.

Again, I don't think it really matters for checksumming, but it's
"-MM-DD HH:MM:SS TZ" format, where TZ is always GMT.

> Why is the top-level checksum only allowed to be SHA-256, if the files
> can use up to SHA-512?

If we allowed the top-level checksum to be changed to something else,
then we'd probably we want to indicate which kind of checksum is being
used at the beginning of the file, so as to enable incremental parsing
with checksum verification at the end. pg_verifybackup doesn't
currently do incremental parsing, but I'd like to add that sometime,
if I get time to hash out the details. I think the use case for
varying the checksum type of the manifest itself is much less than for
varying it for the files. The big problem with checksumming the files
is that it can be slow, because the files can be big. However, unless
you have a truckload of empty files in the database, the manifest is
going to be very small compared to the sizes of all the files, so it
seemed harmless to use a stronger checksum algorithm for the manifest
itself. Maybe someone with a ton of empty or nearly-empty relations
will complain, but they can always use --no-manifest if they want.

I agree that it's a little bit weird that you can have a stronger
checksum for the files instead of the manifest itself, but I also
wonder what the use case would be for using a stronger checksum on the
manifest. David Steele argued that strong checksums on the files could
be useful to software that wants to rifle through all the backups
you've ever taken and find another copy of that file by looking for
something with a matching checksum. CRC-32C wouldn't be strong enough
for that, because eventually you could have enough files that you
start to have collisions. The SHA algorithms output enough bits to
make that quite unlikely. But this argument only makes sense for the
files, not the manifest.

Naturally, all this is arguable, though, and a good deal of arguing
about it has been done, as you have probably noticed. I am still of
the opinion that if somebody's goal is to use this facility for its
intended purpose, which is to find out whether your backup got
corrupted, any of these algorithms are fine, and are highly likely to
tell you that you have a problem if, in fact, you do. In fact, I bet
that even a checksum algorithm considerably stupider than anything I'd
actually consider using would accomplish that goal in a high
percentage of cases. But not everybody agrees with me, to the point
where I am starting to wonder if I really understand how computers
work.

> (Also, did we intentionally omit the dash in
> hash names, so "SHA-256" to make it SHA256?  This will also be critical
> for checksumming the manifest itself.)

I debated this with myself, settled on this spelling, and nobody
complained until now. It could be changed, though. I didn't have any
particular reason for choosing it except the feeling that people would
probably prefer to type --manifest-checksum=sha256 rather than
--manifest-checksum=sha-256.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-13 Thread Andrew Dunstan


On 4/13/20 1:40 PM, Robert Haas wrote:
> On Fri, Mar 27, 2020 at 4:32 PM Andres Freund  wrote:
>> I don't like having a file format that's intended to be used by external
>> tools too that's undocumented except for code that assembles it in a
>> piecemeal fashion.  Do you mean in a follow-on patch this release, or
>> later? I don't have a problem with the former.
> Here is a patch for that.
>


Seems ok. A tiny example, or an excerpt, might be nice.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: documenting the backup manifest file format

2020-04-13 Thread Robert Haas
On Mon, Apr 13, 2020 at 2:28 PM Erik Rijkers  wrote:
> Can you double check this sentence?  Seems strange to me but I don't
> know why; it may well be that my english is not good enough.  Maybe a
> comma after 'required' makes reading easier?
>
> The timeline from which this range of WAL records will be required in
> order to make use of this backup. The value is an integer.

It sounds a little awkward to me, but not outright wrong. I'm not
exactly sure how to rephrase it, though. Maybe just shorten it to "the
timeline for this range of WAL records"?

> One typo:
>
> 'when making using'  should be
> 'when making use'

Right, thanks, fixed in my local copy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: documenting the backup manifest file format

2020-04-13 Thread Alvaro Herrera
+  The LSN at which replay must begin on the indicated timeline in order to
+  make use of this backup.  The LSN is stored in the format normally used
+  by PostgreSQL; that is, it is a string
+  consisting of two strings of hexademical characters, each with a length
+  of between 1 and 8, separated by a slash.

typo "hexademical"

Are these hex figures upper or lower case?  No leading zeroes?  This
would normally not matter, but the toplevel checksum will care.  Also, I
see no mention of prettification-chars such as newlines or indentation.
I suppose if I pass a manifest file through prettification (or Windows
newline conversion), the checksum may break.

As for Last-Modification, I think the spec should indicate the exact
format that's used, because it'll also be critical for checksumming.

Why is the top-level checksum only allowed to be SHA-256, if the files
can use up to SHA-512?  (Also, did we intentionally omit the dash in
hash names, so "SHA-256" to make it SHA256?  This will also be critical
for checksumming the manifest itself.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: documenting the backup manifest file format

2020-04-13 Thread Erik Rijkers

On 2020-04-13 20:08, Robert Haas wrote:

[v2-0001-Document-the-backup-manifest-file-format.patch]


Can you double check this sentence?  Seems strange to me but I don't 
know why; it may well be that my english is not good enough.  Maybe a 
comma after 'required' makes reading easier?


   The timeline from which this range of WAL records will be required in
   order to make use of this backup. The value is an integer.


One typo:

'when making using'  should be
'when making use'



Erik Rijkers





Re: documenting the backup manifest file format

2020-04-13 Thread Robert Haas
On Mon, Apr 13, 2020 at 1:55 PM Justin Pryzby  wrote:
> typos:
> manifes
> hexademical (twice)

Thanks. v2 attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-Document-the-backup-manifest-file-format.patch
Description: Binary data


Re: documenting the backup manifest file format

2020-04-13 Thread Justin Pryzby
On Mon, Apr 13, 2020 at 01:40:56PM -0400, Robert Haas wrote:
> On Fri, Mar 27, 2020 at 4:32 PM Andres Freund  wrote:
> > I don't like having a file format that's intended to be used by external
> > tools too that's undocumented except for code that assembles it in a
> > piecemeal fashion.  Do you mean in a follow-on patch this release, or
> > later? I don't have a problem with the former.
> 
> Here is a patch for that.

typos:
manifes
hexademical (twice)

-- 
Justin