On Sat, Jan 17, 2026 at 5:55 PM Timofei Zhakov <[email protected]> wrote:
> On Tue, Jan 13, 2026 at 11:43 AM Branko Čibej <[email protected]> wrote: > >> On 13. 1. 26 09:57, Timofei Zhakov wrote: >> >> On Mon, Jan 12, 2026 at 9:17 PM Branko Čibej <[email protected]> wrote: >> >>> On 12. 1. 26 18:31, Timofei Zhakov wrote: >>> >>> On Mon, Jan 12, 2026 at 3:29 PM Branko Čibej <[email protected]> wrote: >>> >>>> On 12. 1. 26 14:30, Timofei Zhakov wrote: >>>> >>>> On Mon, Jan 12, 2026 at 1:22 PM Branko Čibej <[email protected]> wrote: >>>> >>>>> On 12. 1. 26 12:52, [email protected] wrote: >>>>> >>>>> Author: brane >>>>> Date: Mon Jan 12 11:52:07 2026 >>>>> New Revision: 1931262 >>>>> >>>>> Log: >>>>> Fix warnings in the new checksum code. >>>>> >>>>> * subversion/libsvn_subr/checksum.c >>>>> (svn_checksum): Remove unused variable. >>>>> * subversion/libsvn_subr/checksum_apr.c: Include <limits.h>. >>>>> (svn_checksum__sha1, svn_checksum__sha1_ctx_update): Do not blindly cast >>>>> or implicitly narrow the data length. Check the limits first. >>>>> >>>>> Modified: >>>>> subversion/trunk/subversion/libsvn_subr/checksum.c >>>>> subversion/trunk/subversion/libsvn_subr/checksum_apr.c >>>>> >>>>> Modified: subversion/trunk/subversion/libsvn_subr/checksum.c >>>>> ============================================================================== >>>>> --- subversion/trunk/subversion/libsvn_subr/checksum.c Mon Jan 12 >>>>> 11:51:21 2026 (r1931261) >>>>> +++ subversion/trunk/subversion/libsvn_subr/checksum.c Mon Jan 12 >>>>> 11:52:07 2026 (r1931262) >>>>> @@ -475,8 +475,6 @@ svn_checksum(svn_checksum_t **checksum, >>>>> apr_size_t len, >>>>> apr_pool_t *pool) >>>>> { >>>>> - apr_sha1_ctx_t sha1_ctx; >>>>> - >>>>> SVN_ERR(validate_kind(kind)); >>>>> *checksum = svn_checksum_create(kind, pool); >>>>> >>>>> >>>>> Modified: subversion/trunk/subversion/libsvn_subr/checksum_apr.c >>>>> ============================================================================== >>>>> --- subversion/trunk/subversion/libsvn_subr/checksum_apr.c Mon Jan >>>>> 12 11:51:21 2026 (r1931261) >>>>> +++ subversion/trunk/subversion/libsvn_subr/checksum_apr.c Mon Jan >>>>> 12 11:52:07 2026 (r1931262) >>>>> @@ -21,6 +21,8 @@ >>>>> * ==================================================================== >>>>> */ >>>>> >>>>> +#include <limits.h> >>>>> + >>>>> #include "svn_private_config.h" >>>>> #ifdef SVN_CHECKSUM_BACKEND_APR >>>>> >>>>> @@ -89,6 +91,10 @@ svn_checksum__sha1(unsigned char *digest >>>>> apr_size_t len) >>>>> { >>>>> apr_sha1_ctx_t sha1_ctx; >>>>> + >>>>> + /* Do not blindly truncate the data length. */ >>>>> + SVN_ERR_ASSERT(len < UINT_MAX); >>>>> + >>>>> apr_sha1_init(&sha1_ctx); >>>>> apr_sha1_update(&sha1_ctx, data, (unsigned int)len); >>>>> apr_sha1_final(digest, &sha1_ctx); >>>>> @@ -121,7 +127,9 @@ svn_checksum__sha1_ctx_update(svn_checks >>>>> const void *data, >>>>> apr_size_t len) >>>>> { >>>>> - apr_sha1_update(&ctx->apr_ctx, data, len); >>>>> + /* Do not blindly truncate the data length. */ >>>>> + SVN_ERR_ASSERT(len < UINT_MAX); >>>>> + apr_sha1_update(&ctx->apr_ctx, data, (unsigned int)len); >>>>> return SVN_NO_ERROR; >>>>> } >>>>> >>>>> >>>>> Just a quick note, it's generally not a good idea to use unchecked >>>>> typecasts anywhere. In this case there was a borderline chance of >>>>> Subversion quietly computing invalid hashes. The assert at least removes >>>>> the "quiet" part, although I suspect the better solution is to call >>>>> apr_sha1_update in UINT_MAX-sized chunks until all data are consumed. >>>>> >>>> >>>> I agree. It would be great if APR was accepting size as size_t. But >>>> it's not a big deal. Anyways, I'm pretty sure the issue existed there >>>> before. >>>> >>>> Maybe this function could feed it blocks of UINT_MAX if it exceeds the >>>> limit. But I don't think we should bother to support such edge cases, >>>> because if it happens, the issue is on the caller side -- it should not >>>> allocate 4GB buffers in memory. >>>> >>>> >>>> >>>> Is that a fact? Sorry but these days 4gigs is laughably small in >>>> certain applications. Our API promises "apr_size_t" chunks which on 64-bit >>>> platforms means larger than 4 GiB. We can change our API or we can fix the >>>> bug. It doesn't matter if the bug was there before, we know it's there now. >>>> >>>> -- Brane >>>> >>>> >>> Okay, it makes sense. I made a patch to support that behaviour. What do >>> you think? >>> >>> [[[ >>> Index: subversion/libsvn_subr/checksum_apr.c >>> =================================================================== >>> --- subversion/libsvn_subr/checksum_apr.c (revision 1931267) >>> +++ subversion/libsvn_subr/checksum_apr.c (working copy) >>> @@ -91,12 +91,19 @@ >>> apr_size_t len) >>> { >>> apr_sha1_ctx_t sha1_ctx; >>> + const char *ptr = data; >>> >>> - /* Do not blindly truncate the data length. */ >>> - SVN_ERR_ASSERT(len < UINT_MAX); >>> + apr_sha1_init(&sha1_ctx); >>> >>> - apr_sha1_init(&sha1_ctx); >>> - apr_sha1_update(&sha1_ctx, data, (unsigned int)len); >>> + while (len > 0) >>> + { >>> + unsigned int block = (len < UINT_MAX) ? (unsigned int)len >>> + : UINT_MAX; >>> + apr_sha1_update(&sha1_ctx, ptr, block); >>> + len -= block; >>> + ptr += block; >>> + } >>> + >>> apr_sha1_final(digest, &sha1_ctx); >>> return SVN_NO_ERROR; >>> } >>> @@ -127,9 +134,17 @@ >>> const void *data, >>> apr_size_t len) >>> { >>> - /* Do not blindly truncate the data length. */ >>> - SVN_ERR_ASSERT(len < UINT_MAX); >>> - apr_sha1_update(&ctx->apr_ctx, data, (unsigned int)len); >>> + const char *ptr = data; >>> + >>> + while (len > 0) >>> + { >>> + unsigned int block = (len < UINT_MAX) ? (unsigned int)len >>> + : UINT_MAX; >>> + apr_sha1_update(&ctx->apr_ctx, ptr, block); >>> + len -= block; >>> + ptr += block; >>> + } >>> + >>> return SVN_NO_ERROR; >>> } >>> ]]] >>> >>> I don't really like casting 'void *' to 'char *' but it seems like the >>> only option. >>> >>> >>> That cast isn't a problem, it already happens implicitly when >>> apr_sha1_update is called; this is from apr_sha1.h: >>> >>> APU_DECLARE(void) apr_sha1_update(apr_sha1_ctx_t *context, const char >>> *input, >>> unsigned int inputLen); >>> >>> Oh, good to know. >> >> >> [...] >> >> >> >>> >>> It would be nice, or at least it would satisfy my sense of aesthetics, >>> if the loop wasn't copy-pasted twice ... I'd make a helper function that >>> will almost certainly be inlined. As a bonus, that helper function can take >>> a 'const char*' argument, then you don't need the cast nor a second >>> variable. Something like this? >>> >>> static void update_sha1(apr_sha1_ctx_t *ctx, const char* data, apr_size_t >>> len) >>> { >>> while (len > UINT_MAX) >>> { >>> apr_sha1_update(ctx, data, UINT_MAX); >>> len -= UINT_MAX; >>> data += UINT_MAX; >>> } >>> if (len > 0) >>> apr_sha1_update(ctx, data, (unsigned int)len); /* This cast is safe. */ >>> } >>> >>> >>> then just call update_sha1() with no casts instead of apr_sha1_update(). >>> Nice. >>> >>> >> I thought about making a function. It might be better for readability to >> have everything inlined rather than adding extra abstractions. But as I >> think now it is probably worth it. This is like a wrapper around APR which >> allows for bigger buffers. And again it removes the risk of copy-paste >> mistakes. >> >> It's also a good argument to utilise implicit cast to char-ptr during >> function invocation. >> >> Could you please elaborate why you are suggesting re-writing the loop in >> this way? I think it was much simpler before. I would prefer to have a >> single invocation of the function, because it assures that no matter what >> we'd get the same behaviour. Imagine mistyping parameters in the first one >> (inside of the loop) which would never be actually executed and then trying >> to figure out where that bug comes from (and this all will most likely >> happen while performing some unbounded operation which is also a bug). >> >> >> It gets rid of this >> >> unsigned int block = (len < UINT_MAX) ? (unsigned int)len : UINT_MAX; >> >> >> happening on every operation, something the compiler can't just ignore or >> optimise out. It also avoids >> >> len -= UINT_MAX; >> data += UINT_MAX; >> >> >> when it isn't necessary; something that probably would be optimised out. >> > >> *shrug* Compared to the actual SHA-1 calculation, that's trivial. I just >> find it cleaner to not have to recalculate the block length every time. >> It's really a matter of taste. >> > > > I think that the way to address this problem is more general. Good to know > the "optimal" version of this "while loop" would be. > > Also perhaps it was suboptimal to use the ternary thing here, because > there is a chance for implicit conversions. > > But basically, yeah, it's just a matter of preference. I would prefer the > one that just takes less lines of code/logic branches/assembly after > optimisation. > > By the way, I tested both of those snippets on the compiler explorer. > Attaching the code I used below. Also I used this website: [1]. I didn't > conclude anything special though. > > #include <limits.h> > #include <stdio.h> > #include <stdint.h> > > void __attribute__ ((noinline)) apr_sha1_update(const void *data, unsigned > int len) > { > printf("qq", data, len); > } > > void qq1(const char *data, size_t len) > { > while (len > 0) > { > unsigned int block = (len <= UINT_MAX) ? (unsigned int)len : UINT_MAX; > apr_sha1_update(data, len); > len -= block; > data += block; > } > } > > void qq2(const char *data, size_t len) > { > while (len > UINT_MAX) > { > apr_sha1_update(data, UINT_MAX); > len -= UINT_MAX; > data += UINT_MAX; > } > if (len > 0) > apr_sha1_update(data, (unsigned int)len); /* This cast is safe. */ > } > > > >> >> I see that it removes the ternary operator, but it was good in explaining >> what this code does; feed it UINT_MAX bytes or whatever amount we have. On >> the other hand, after your rewrite, two completely separate branches would >> be executed depending on whether it iterates UINT_MAX blocks or feeds the >> remaining. >> >> >> As I said, it's a matter of taste. I find it more obvious what's >> happening when the while() condition tests against an explicit, constant >> upper bound. >> >> >> >> >>> I'm planning to do a similar thing to the bcrypt backend. A tiny note: >>> long is usually 32 bit even on 64 bit platforms (on Windows). >>> >>> >>> +1. Yes, I remembered later that ULONG == DWORD == not something I'd use >>> for object sizes on a 64-bit platform. Same problem as using unsigned int >>> for the length in APR. >>> >> > I'm going to go ahead and commit it to trunk soon. Thanks for reviewing! I > really appreciate that! > > [1] https://godbolt.org/ > > > Committed in r1931389. Thanks! -- Timofei Zhakov

