Re: RFC: cksum --base64/-b support

2023-01-31 Thread Jim Meyering
On Tue, Jan 31, 2023 at 1:43 PM Pádraig Brady  wrote:
> s/accepted/supports/ in NEWS

Thanks. Good catch.
I prefer to use "accepts", matching the entry just below.
Will push soon.



Re: RFC: cksum --base64/-b support

2023-01-31 Thread Pádraig Brady

s/accepted/supports/ in NEWS

Otherwise good to push.

thanks!
Pádraig



Re: RFC: cksum --base64/-b support

2023-01-31 Thread Jim Meyering
On Tue, Jan 31, 2023 at 5:00 AM Pádraig Brady  wrote:
> On 31/01/2023 06:48, Jim Meyering wrote:
> > On Mon, Jan 30, 2023 at 11:29 AM Pádraig Brady  wrote:
> >> ...
> >
> > Thanks for the speedy feedback.
> >
> >> "If must be followed by white space." comment has a typo
> >> and also not enforced explicitly, so could be removed.
> >
> > Thanks. Removed.
> >
> >> valid_digits() may check beyond the end of the buffer
> >> in the case len != BASE64_LENGTH.
> >
> > Please share how that could happen.
> > I don't see how there could be a read-overrun there, since the input
> > string is always NUL-terminated.
>
> Oh right I missed that.
> Indeed the !isxdigit() would have terminated upon reaching the NUL.
> Thanks for using the len though as it's easier to read at least.
>
> Similarly I now see that the `while (!ISWHITE())` can write past the NUL 
> terminator,
> which could be triggered with a line full of spaces for example.
> Patch suggested below...

Good catch. An input of just three NUL bytes would trigger a read
buffer overrun:

  printf '\0\0\0' |valgrind src/cksum -a sha1 --check

V3 has a test to cover that.

> > V2 attached.
>
> We should also mention that --check autodetects the encoding.
> I wonder should we have the separate *sum utils autodetect the encoding also,
> but not support the --base64 option. Then `cksum -a blah -b` format
> would be supported by `blahsum -c`. Perhaps it's best to restrict to cksum,
> to aid in the deprecation of the separate utils.

I too prefer to limit this feature addition to cksum.

> So suggested changes...
>
> diff --git a/doc/coreutils.texi b/doc/coreutils.texi

Thanks. Applied along with test addition.


cu-cksum-base64-v3.diff
Description: Binary data


Re: RFC: cksum --base64/-b support

2023-01-31 Thread Pádraig Brady

On 31/01/2023 06:48, Jim Meyering wrote:

On Mon, Jan 30, 2023 at 11:29 AM Pádraig Brady  wrote:

...


Thanks for the speedy feedback.


"If must be followed by white space." comment has a typo
and also not enforced explicitly, so could be removed.


Thanks. Removed.


valid_digits() may check beyond the end of the buffer
in the case len != BASE64_LENGTH.


Please share how that could happen.
I don't see how there could be a read-overrun there, since the input
string is always NUL-terminated.


Oh right I missed that.
Indeed the !isxdigit() would have terminated upon reaching the NUL.
Thanks for using the len though as it's easier to read at least.

Similarly I now see that the `while (!ISWHITE())` can write past the NUL 
terminator,
which could be triggered with a line full of spaces for example.
Patch suggested below...


However, I've taken your suggestion below:


Perhaps change the "else" to "else if (len == digest_hex_bytes)".
Similarly it may be more defensive / efficient to pass
digest_len out of split_3().


I'd considered whether to make *split_3 return that length to the
caller, but erred on the side of keeping changes small.
However, given your feedback, I've made both bsd_split_3 and split_3
return digest length via *D_LEN, and thus avoid adding that strlen
call.


non padded base64 encodings are not supported.
Is that OK. Worth mentioning in texinfo if so.


IMHO, we must reject a digest that's had any change to its trailing
"=" (padding) bytes.
Done.


A non padded negative test in cksum-c.sh would be useful.


Good idea.
I've added tests like that in the new test file, removing one or both
"=" bytes in each of those examples.

V2 attached.


We should also mention that --check autodetects the encoding.
I wonder should we have the separate *sum utils autodetect the encoding also,
but not support the --base64 option. Then `cksum -a blah -b` format
would be supported by `blahsum -c`. Perhaps it's best to restrict to cksum,
to aid in the deprecation of the separate utils.

So suggested changes...

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 3624a7d03..f70c3d530 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4184,6 +4184,11 @@ For the @command{cksum} command, the @option{--check} 
option
 supports auto-detecting the digest algorithm to use,
 when presented with checksum information in the @option{--tag} output format.

+Also for the @command{cksum} command, the @option{--check} option
+supports auto-detecting the digest encoding to use,
+when presented with standard hexidecimal checksums or those generated
+with the @command{cksum} @option{--base64} option.
+
 Output with @option{--zero} enabled is not supported by @option{--check}.
 @sp 1
 For each such line, @command{md5sum} reads the named file and computes its
diff --git a/src/digest.c b/src/digest.c
index 1f98ea7bc..c0616fcb2 100644
--- a/src/digest.c
+++ b/src/digest.c
@@ -646,6 +646,9 @@ valid_digits (unsigned char const *s, size_t len)
   ++s;
 }
 }
+  else
+return false;
+
   return *s == '\0';
 }

@@ -847,7 +850,7 @@ split_3 (char *s, size_t s_len,

   /* This field must be the hexadecimal or base64 representation
  of the message digest.  */
-  while (!ISWHITE (s[i]))
+  while (s[i] && !ISWHITE (s[i]))
 i++;

   *d_len = [i] - (char *) *digest;





Re: RFC: cksum --base64/-b support

2023-01-31 Thread Pádraig Brady

On 31/01/2023 06:07, Jim Meyering wrote:

On Mon, Jan 30, 2023 at 10:17 AM Pádraig Brady  wrote:

BTW I noted the following possible option when I recently refactored cksum:

--digest_format={int, hex, base64, binary}
/* cksum output formats:
   int (sum, and cksum default),
   hex (rest default),
   b64 (to reduce size),
   bin (would auto suppress names? restrict to single argument? */


I'm not sure how --digest-format=int -a sha1 would work.


yes that would be awkward, "int" is probably not worth supporting


Or --digest-format=bin -a sum


That would be the bytes in network byte order.
This would be a separate patch anyway.
If we're just considering base64 with no possibility of a future other encoding 
option,
then we can stick with --base64 and --raw to select between base64 and binary.


But doing it that way does provide a way to override a prior --base64 option.
Without that, is a --hex option warranted?


If --hex was warranted then that would tip the balance
in favor of a single --digest-format={base16,base64,binary} option.
I'm 50:50 on whether -b would need to be overridden like that though.
I.e. whether -b would be set as a default in a script or alias.

cheers,
Pádraig



Re: RFC: cksum --base64/-b support

2023-01-30 Thread Jim Meyering
On Mon, Jan 30, 2023 at 11:29 AM Pádraig Brady  wrote:
> ...

Thanks for the speedy feedback.

> "If must be followed by white space." comment has a typo
> and also not enforced explicitly, so could be removed.

Thanks. Removed.

> valid_digits() may check beyond the end of the buffer
> in the case len != BASE64_LENGTH.

Please share how that could happen.
I don't see how there could be a read-overrun there, since the input
string is always NUL-terminated.
However, I've taking your suggestion below:

> Perhaps change the "else" to "else if (len == digest_hex_bytes)".
> Similarly it may be more defensive / efficient to pass
> digest_len out of split_3().

I'd considered whether to make *split_3 return that length to the
caller, but erred on the side of keeping changes small.
However, given your feedback, I've made both bsd_split_3 and split_3
return digest length via *D_LEN, and thus avoid adding that strlen
call.

> non padded base64 encodings are not supported.
> Is that OK. Worth mentioning in texinfo if so.

IMHO, we must reject a digest that's had any change to its trailing
"=" (padding) bytes.
Done.

> A non padded negative test in cksum-c.sh would be useful.

Good idea.
I've added tests like that in the new test file, removing one or both
"=" bytes in each of those examples.

V2 attached.


cu-cksum-base64-v2.diff
Description: Binary data


Re: RFC: cksum --base64/-b support

2023-01-30 Thread Jim Meyering
On Mon, Jan 30, 2023 at 10:17 AM Pádraig Brady  wrote:
> On 29/01/2023 20:40, Jim Meyering wrote:
> > Hi Pádraig! Happy new year (belatedly ;-). Hope you're well.
> >
> > I'd like our generated announcements to be able to include
> > base64-encoded checksums without having to recommend verifying them
> > using openbsd's cksum, so...

Thanks for the long reply.

> This is so the checksums are shorter right?

That was my primary motivation.

> I.e. 4x/3 rather than the 2x for hex.
> Playing devil's advocate, is the complexity of
> generally using base64 for this worth it, since say a 512 bit checksum
> would only reduce from 128 chars to 86 chars?

IMHO, added complexity is small, so yes, it's worth it.

> Also related to this is the use of variable length algorithms
> (supported with the existing -l option).
> A quick scan of https://www.keylength.com/ suggests newer algorithms
> like blake2 blake3 sha3 with -l 256 may be sufficient for this use case
> in which case the difference would only be 64 chars to 44 chars.
> The following demos that, while also using existing tools:
>
>$ sha256sum < /dev/null | xxd -r -p | basenc --base16
>E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
>$ sha256sum < /dev/null | xxd -r -p | basenc --base64
>47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
>
> > I've begun writing the code to add --base64/-b support for GNU cksum,
> > prompted by this thread:
> > https://lists.gnu.org/r/diffutils-devel/2023-01/msg00015.html and the
> > fact that OpenBSD already has this option (as -b):
> > https://man.openbsd.org/cksum
>
> We originally considered this back at:
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=7313
> which showed ways to achieve this with existing tools.
> Though if it was a common operation (like it would be for your use case),
> and for the fact that openbsd already implements this,
> then it would be worth adding an option.
>
> BTW I noted the following possible option when I recently refactored cksum:
>
>--digest_format={int, hex, base64, binary}
>/* cksum output formats:
>   int (sum, and cksum default),
>   hex (rest default),
>   b64 (to reduce size),
>   bin (would auto suppress names? restrict to single argument? */

I'm not sure how --digest-format=int -a sha1 would work.
Or --digest-format=bin -a sum

But doing it that way does provide a way to override a prior --base64 option.
Without that, is a --hex option warranted?



Re: RFC: cksum --base64/-b support

2023-01-30 Thread Pádraig Brady

On 30/01/2023 17:52, Jim Meyering wrote:

On Sun, Jan 29, 2023 at 1:10 PM Jim Meyering  wrote:

On Sun, Jan 29, 2023 at 12:40 PM Jim Meyering  wrote:
...

- I'm inclined to work like the openbsd cksum and accept invocations
like "cksum -a sha1x" and "cksum -a sha1b". Any objection?


Actually, I am now **disinclined** to implement this part. It'd make
sense only if we were able to compute multiple checksums in a single
invocation.
Allowing that would require a fundamental redesign, which feels out of
scope, at least initially.


Here's a proposed patch for that.


"If must be followed by white space." comment has a typo
and also not enforced explicitly, so could be removed.

valid_digits() may check beyond the end of the buffer
in the case len != BASE64_LENGTH.
Perhaps change the "else" to "else if (len == digest_hex_bytes)".
Similarly it may be more defensive / efficient to pass
digest_len out of split_3().

non padded base64 encodings are not supported.
Is that OK. Worth mentioning in texinfo if so.
A non padded negative test in cksum-c.sh would be useful.

thanks!
Pádraig



Re: RFC: cksum --base64/-b support

2023-01-30 Thread Pádraig Brady

On 29/01/2023 20:40, Jim Meyering wrote:

Hi Pádraig! Happy new year (belatedly ;-). Hope you're well.

I'd like our generated announcements to be able to include
base64-encoded checksums without having to recommend verifying them
using openbsd's cksum, so...


This is so the checksums are shorter right?
I.e. 4x/3 rather than the 2x for hex.
Playing devil's advocate, is the complexity of
generally using base64 for this worth it, since say a 512 bit checksum
would only reduce from 128 chars to 86 chars?

Also related to this is the use of variable length algorithms
(supported with the existing -l option).
A quick scan of https://www.keylength.com/ suggests newer algorithms
like blake2 blake3 sha3 with -l 256 may be sufficient for this use case
in which case the difference would only be 64 chars to 44 chars.
The following demos that, while also using existing tools:

  $ sha256sum < /dev/null | xxd -r -p | basenc --base16
  E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
  $ sha256sum < /dev/null | xxd -r -p | basenc --base64
  47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=


I've begun writing the code to add --base64/-b support for GNU cksum,
prompted by this thread:
https://lists.gnu.org/r/diffutils-devel/2023-01/msg00015.html and the
fact that OpenBSD already has this option (as -b):
https://man.openbsd.org/cksum


We originally considered this back at:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=7313
which showed ways to achieve this with existing tools.
Though if it was a common operation (like it would be for your use case),
and for the fact that openbsd already implements this,
then it would be worth adding an option.

BTW I noted the following possible option when I recently refactored cksum:

  --digest_format={int, hex, base64, binary}
  /* cksum output formats:
 int (sum, and cksum default),
 hex (rest default),
 b64 (to reduce size),
 bin (would auto suppress names? restrict to single argument? */


Two questions:
- blake2b's tag is inconsistently capitalized. All of the other tags
are all-caps versions of their lower-case strings, but this one is
spelled BLAKE2b, with a lower-case "b" at the end. I presume this is
desired. Likely too late to change to make it consistent. Arose while
considering how to implement support for the "x" and "b"
option suffixes, to denote "use hex" or "use base64" as encoding,
while the usual default is of course hex, and --base64 changes that.


BLAKE2b came from the original blake2 reference implementation.
I.e. that was the preferred naming, and couldn't be changed now
due to backwards compat.  It's not so bad to work around,
but yes would lead to messier code.


Leading to my second question:
- I'm inclined to work like the openbsd cksum and accept invocations
like "cksum -a sha1x" and "cksum -a sha1b". Any objection?


bsd compat so good.
multiple ways to do something, so possibly confusing.


Also, comparing algorithms, openbsd has two that we don't: rmd160, sha512/256
I'm not interested in adding those in this diff, of course, but it may
be something to consider for compatibility.


Yes I was considering it, also along with sha3 and blake3


What's the schedule for the next release?
Assuming this is desirable, want to include it there? >
My own ETA is variable, depending on pressure/desire.
I've written most of the code (but not yet suffix support) and minimal
tests, but no documentation or NEWS.


I hope to get the next release out in about 3 weeks,
so it would be good to include this if possible.

thanks,
Pádraig




Re: RFC: cksum --base64/-b support

2023-01-30 Thread Jim Meyering
On Sun, Jan 29, 2023 at 1:10 PM Jim Meyering  wrote:
> On Sun, Jan 29, 2023 at 12:40 PM Jim Meyering  wrote:
> ...
> > - I'm inclined to work like the openbsd cksum and accept invocations
> > like "cksum -a sha1x" and "cksum -a sha1b". Any objection?
>
> Actually, I am now **disinclined** to implement this part. It'd make
> sense only if we were able to compute multiple checksums in a single
> invocation.
> Allowing that would require a fundamental redesign, which feels out of
> scope, at least initially.

Here's a proposed patch for that.


cu-cksum-base64.diff
Description: Binary data


Re: RFC: cksum --base64/-b support

2023-01-29 Thread Jim Meyering
On Sun, Jan 29, 2023 at 12:40 PM Jim Meyering  wrote:
...
> - I'm inclined to work like the openbsd cksum and accept invocations
> like "cksum -a sha1x" and "cksum -a sha1b". Any objection?

Actually, I am now **disinclined** to implement this part. It'd make
sense only if we were able to compute multiple checksums in a single
invocation.
Allowing that would require a fundamental redesign, which feels out of
scope, at least initially.