Wow, please get this in!!!

This fixes cvs update on hard disks, to go much much faster. When I am
updating the entire set of cvs trees: www, src, xenocara, ports, I can
still use firefox and have it perfectly usable. There's a night and
day improvement, before and after. Thanks for debugging and fixing
this.

amit

On Wed, Jun 7, 2017 at 12:29 PM, Mike Belopuhov <m...@belopuhov.com> wrote:
> Hi,
>
> I've discovered that short reads (nonzero b_resid) aren't
> handled very well in our kernel and I've proposed a diff
> like this to handle short reads of buffercache read-ahead
> buffers:
>
> diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
> index 95bc80bc0e6..1cc1943d752 100644
> --- sys/kern/vfs_bio.c
> +++ sys/kern/vfs_bio.c
> @@ -534,11 +534,27 @@ bread_cluster_callback(struct buf *bp)
>                  */
>                 buf_fix_mapping(bp, newsize);
>                 bp->b_bcount = newsize;
>         }
>
> -       for (i = 1; xbpp[i] != 0; i++) {
> +       /* Invalidate read-ahead buffers if read short */
> +       if (bp->b_resid > 0) {
> +               for (i = 0; xbpp[i] != NULL; i++)
> +                       continue;
> +               for (i = i - 1; i != 0; i--) {
> +                       if (xbpp[i]->b_bufsize <= bp->b_resid) {
> +                               bp->b_resid -= xbpp[i]->b_bufsize;
> +                               SET(xbpp[i]->b_flags, B_INVAL);
> +                       } else if (bp->b_resid > 0) {
> +                               bp->b_resid = 0;
> +                               SET(xbpp[i]->b_flags, B_INVAL);
> +                       } else
> +                               break;
> +               }
> +       }
> +
> +       for (i = 1; xbpp[i] != NULL; i++) {
>                 if (ISSET(bp->b_flags, B_ERROR))
>                         SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
>                 biodone(xbpp[i]);
>         }
>
>
> Now I said before that the only issue that this diff didn't
> fix was with the xbpp[0] aka the buf we return to FFS: if we
> have a 64k sized cluster on our filesystem then we've never
> created read-ahead bufs and thus this code never runs and we
> never account for the b_resid.  However, this is thankfully
> not correct as FFS handles short reads itself (except one
> small detail...). Here's a chunk from ffs_read:
>
>         if (lblktosize(fs, nextlbn) >= DIP(ip, size))
>                 error = bread(vp, lbn, size, &bp);
>         else
>                 error = bread_cluster(vp, lbn, size, &bp);
>
>         if (error)
>                 break;
>
>         /*
>          * We should only get non-zero b_resid when an I/O error
>          * has occurred, which should cause us to break above.
>          * However, if the short read did not cause an error,
>          * then we want to ensure that we do not uiomove bad
>          * or uninitialized data.
>          */
>         size -= bp->b_resid;
>         if (size < xfersize) {
>                 if (size == 0)
>                         break;
>                 xfersize = size;
>         }
>         error = uiomove(bp->b_data + blkoffset, xfersize, uio);
>
> As you can see it copies (size - bp->b_resid) into the uio.
> That would be OK if b_resid was as large as the 'size'. But
> due to how bread_cluster extends the b_count to cover for
> all additional read-ahead buffers, the transfer in the end
> can have a b_resid anywhere in the interval of [0, MAXPHYS]
> which can be larger than 'size' that FFS has asked for.
>
> This leads to 'size' underflow because it's an integer and
> then uiomove gets a negative value for xfersize which gets
> converted to a very large unsigned long (size_t) parameter
> for uiomove. And this is bad.  Therefore, additionally I'd
> like to assert this in the FFS code itself.  If this is the
> way to go, I'll look into other filesystems and propose a
> similar check.
>
> diff --git sys/ufs/ffs/ffs_vnops.c sys/ufs/ffs/ffs_vnops.c
> index 160e187820f..56c222612a2 100644
> --- sys/ufs/ffs/ffs_vnops.c
> +++ sys/ufs/ffs/ffs_vnops.c
> @@ -244,10 +244,11 @@ ffs_read(void *v)
>                  * has occurred, which should cause us to break above.
>                  * However, if the short read did not cause an error,
>                  * then we want to ensure that we do not uiomove bad
>                  * or uninitialized data.
>                  */
> +               KASSERT(bp->b_resid <= size);
>                 size -= bp->b_resid;
>                 if (size < xfersize) {
>                         if (size == 0)
>                                 break;
>                         xfersize = size;
>
>
> So to make it clear: I'd like to commit both changes and
> if that's something we agree upon, I'll look into other
> filesystems and make sure that they implement similar
> assertions.
>
> Opinions?
>

Reply via email to