Wow, please get this in!!! This fixes cvs update on hard disks, to go much much faster. When I am updating the entire set of cvs trees: www, src, xenocara, ports, I can still use firefox and have it perfectly usable. There's a night and day improvement, before and after. Thanks for debugging and fixing this.
amit On Wed, Jun 7, 2017 at 12:29 PM, Mike Belopuhov <[email protected]> wrote: > Hi, > > I've discovered that short reads (nonzero b_resid) aren't > handled very well in our kernel and I've proposed a diff > like this to handle short reads of buffercache read-ahead > buffers: > > diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c > index 95bc80bc0e6..1cc1943d752 100644 > --- sys/kern/vfs_bio.c > +++ sys/kern/vfs_bio.c > @@ -534,11 +534,27 @@ bread_cluster_callback(struct buf *bp) > */ > buf_fix_mapping(bp, newsize); > bp->b_bcount = newsize; > } > > - for (i = 1; xbpp[i] != 0; i++) { > + /* Invalidate read-ahead buffers if read short */ > + if (bp->b_resid > 0) { > + for (i = 0; xbpp[i] != NULL; i++) > + continue; > + for (i = i - 1; i != 0; i--) { > + if (xbpp[i]->b_bufsize <= bp->b_resid) { > + bp->b_resid -= xbpp[i]->b_bufsize; > + SET(xbpp[i]->b_flags, B_INVAL); > + } else if (bp->b_resid > 0) { > + bp->b_resid = 0; > + SET(xbpp[i]->b_flags, B_INVAL); > + } else > + break; > + } > + } > + > + for (i = 1; xbpp[i] != NULL; i++) { > if (ISSET(bp->b_flags, B_ERROR)) > SET(xbpp[i]->b_flags, B_INVAL | B_ERROR); > biodone(xbpp[i]); > } > > > Now I said before that the only issue that this diff didn't > fix was with the xbpp[0] aka the buf we return to FFS: if we > have a 64k sized cluster on our filesystem then we've never > created read-ahead bufs and thus this code never runs and we > never account for the b_resid. However, this is thankfully > not correct as FFS handles short reads itself (except one > small detail...). Here's a chunk from ffs_read: > > if (lblktosize(fs, nextlbn) >= DIP(ip, size)) > error = bread(vp, lbn, size, &bp); > else > error = bread_cluster(vp, lbn, size, &bp); > > if (error) > break; > > /* > * We should only get non-zero b_resid when an I/O error > * has occurred, which should cause us to break above. > * However, if the short read did not cause an error, > * then we want to ensure that we do not uiomove bad > * or uninitialized data. > */ > size -= bp->b_resid; > if (size < xfersize) { > if (size == 0) > break; > xfersize = size; > } > error = uiomove(bp->b_data + blkoffset, xfersize, uio); > > As you can see it copies (size - bp->b_resid) into the uio. > That would be OK if b_resid was as large as the 'size'. But > due to how bread_cluster extends the b_count to cover for > all additional read-ahead buffers, the transfer in the end > can have a b_resid anywhere in the interval of [0, MAXPHYS] > which can be larger than 'size' that FFS has asked for. > > This leads to 'size' underflow because it's an integer and > then uiomove gets a negative value for xfersize which gets > converted to a very large unsigned long (size_t) parameter > for uiomove. And this is bad. Therefore, additionally I'd > like to assert this in the FFS code itself. If this is the > way to go, I'll look into other filesystems and propose a > similar check. > > diff --git sys/ufs/ffs/ffs_vnops.c sys/ufs/ffs/ffs_vnops.c > index 160e187820f..56c222612a2 100644 > --- sys/ufs/ffs/ffs_vnops.c > +++ sys/ufs/ffs/ffs_vnops.c > @@ -244,10 +244,11 @@ ffs_read(void *v) > * has occurred, which should cause us to break above. > * However, if the short read did not cause an error, > * then we want to ensure that we do not uiomove bad > * or uninitialized data. > */ > + KASSERT(bp->b_resid <= size); > size -= bp->b_resid; > if (size < xfersize) { > if (size == 0) > break; > xfersize = size; > > > So to make it clear: I'd like to commit both changes and > if that's something we agree upon, I'll look into other > filesystems and make sure that they implement similar > assertions. > > Opinions? >
