Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 11:13:02PM +0300, Kostik Belousov wrote: > > The need is, as commented, to return EFBIG when the new file size will > > be larger than the FS supports. Without this code, passing in > > something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem > > out of space. > Handling max file size and not overflowing the fs are different things. > VOP_WRITE() will handle file size on its own too. I see no problem with > exhausting free space if this is what user asked for. This makes me wonder that current implementation isn't atomic. If we get out of space error, we won't shrink the file back. Even if we could shirk it at the end, we won't be able to put holes in the middle of it. Not sure if this is a big issue, but one doesn't expect from rename(2) to create new link and not remove old one. All in all, making it atomic would be impossible currently for various reasons (we can't put holes back and we can crash in the middle). -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpGDsNehES8J.pgp Description: PGP signature
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
2011/4/18 Kostik Belousov : > On Mon, Apr 18, 2011 at 12:49:38PM -0700, m...@freebsd.org wrote: >> On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov >> wrote: >> > On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote: >> >> Author: mdf >> >> Date: Mon Apr 18 16:32:22 2011 >> >> New Revision: 220791 >> >> URL: http://svn.freebsd.org/changeset/base/220791 >> >> >> >> Log: >> >> Add the posix_fallocate(2) syscall. The default implementation in >> >> vop_stdallocate() is filesystem agnostic and will run as slow as a >> >> read/write loop in userspace; however, it serves to correctly >> >> implement the functionality for filesystems that do not implement a >> >> VOP_ALLOCATE. >> >> >> >> Note that __FreeBSD_version was already bumped today to 900036 for any >> >> ports which would like to use this function. >> >> >> >> Also reserve space in the syscall table for posix_fadvise(2). >> The need is, as commented, to return EFBIG when the new file size will >> be larger than the FS supports. Without this code, passing in >> something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem >> out of space. > Handling max file size and not overflowing the fs are different things. > VOP_WRITE() will handle file size on its own too. I see no problem with > exhausting free space if this is what user asked for. This violates the standard, though, since it would return ENOSPC instead of EFBIG. Also, this is just the default implementation. I'm adding a second VOP_SETATTR in vop_stdallocate() to restore the file size after extending, which will let the implementation use bzero/VOP_WRITE instead of VOP_READ/VOP_WRITE when past the original EOF. This is more of an issue when calling the vop iteratively, since subsequent calls don't know what the "correct" file size is. Cheers, matthew ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 12:47 PM, Pawel Jakub Dawidek wrote: > On Mon, Apr 18, 2011 at 10:28:10PM +0300, Kostik Belousov wrote: >> > + if (offset + len > vap->va_size) { >> > + VATTR_NULL(vap); >> > + vap->va_size = offset + len; >> > + error = VOP_SETATTR(vp, vap, td->td_ucred); >> > + if (error != 0) >> > + goto out; >> > + } >> I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will >> do auto-extend as needed. Also, see below. > > Yeah, also when we extend file size we could skip reading zeros. > >> > + if (offset < vap->va_size) { > [...] >> > + } else { >> > + bzero(buf, cur); >> > + } >> Wouldn't VOP_SETATTR() at the start of the function mostly prevent >> this bzero from executing ? > > Once we drop the vnode lock, the file size can change under us, no? Yes, which is why the VOP_GETATTR is re-run. The SETATTR doesn't need to be re-run, since the purpose was just to check that offset + len is less than the filesystem's maximum supported file size. >> I estimated what it would take to do the optimized implementation for UFS, >> and I think that the following change would allow to lessen the code >> duplication much. >> >> What if the vnode lock drop and looping be handled by the syscall, instead >> of the vop implementation ? In other words, allow the VOP_ALLOCATE() >> to allocate less then requested, and return the allocated amount to >> the caller. The loop would be centralized then, freeing fs from doing >> the dance. Also, if fs considers that suitable, it would do a whole >> allocation in one run. > > I'd still go with SEEK_DATA/SEEK_HOLE loop as I suggested on arch@. > If you would like to spend time on it, having SEEK_DATA/SEEK_HOLE > support in UFS would be beneficial for other purposes too. Well, if we had this functionality it could be used. I'd like the framework but the filesystems need modification to support it. We want it for OneFS at Isilon so eventually when I get to this part of the project I'll have some wrapper code if someone doesn't get there first. Cheers, matthew ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov wrote: > On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote: >> Author: mdf >> Date: Mon Apr 18 16:32:22 2011 >> New Revision: 220791 >> URL: http://svn.freebsd.org/changeset/base/220791 >> >> Log: >> Add the posix_fallocate(2) syscall. The default implementation in >> vop_stdallocate() is filesystem agnostic and will run as slow as a >> read/write loop in userspace; however, it serves to correctly >> implement the functionality for filesystems that do not implement a >> VOP_ALLOCATE. >> >> Note that __FreeBSD_version was already bumped today to 900036 for any >> ports which would like to use this function. >> >> Also reserve space in the syscall table for posix_fadvise(2). >> +#ifdef __notyet__ >> + /* >> + * Check if the filesystem sets f_maxfilesize; if not use >> + * VOP_SETATTR to perform the check. >> + */ >> + error = VFS_STATFS(vp->v_mount, &sfs, td); >> + if (error != 0) >> + goto out; >> + if (sfs.f_maxfilesize) { >> + if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize || >> + offset + len > sfs.f_maxfilesize) { >> + error = EFBIG; >> + goto out; >> + } >> + } else >> +#endif >> + if (offset + len > vap->va_size) { >> + VATTR_NULL(vap); >> + vap->va_size = offset + len; >> + error = VOP_SETATTR(vp, vap, td->td_ucred); >> + if (error != 0) >> + goto out; >> + } > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > do auto-extend as needed. Also, see below. The need is, as commented, to return EFBIG when the new file size will be larger than the FS supports. Without this code, passing in something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem out of space. >> + >> + while (len > 0) { >> + if (should_yield()) { >> + VOP_UNLOCK(vp, 0); >> + locked = 0; >> + kern_yield(-1); > Please note that, despite dropping the vnode lock, the snapshot creation > is still blocked at this point, due to previous vn_start_write(). > > If doing vn_finished_write() there, then bwillwrite() before > next iteration is desired. >> + error = vn_lock(vp, LK_EXCLUSIVE); >> + if (error != 0) >> + break; >> + locked = 1; >> + error = VOP_GETATTR(vp, vap, td->td_ucred); >> + if (error != 0) >> + break; >> + } >> + >> + /* >> + * Read and write back anything below the nominal file >> + * size. There's currently no way outside the filesystem >> + * to know whether this area is sparse or not. >> + */ >> + cur = iosize; >> + if ((offset % iosize) != 0) >> + cur -= (offset % iosize); >> + if (cur > len) >> + cur = len; >> + if (offset < vap->va_size) { >> + aiov.iov_base = buf; >> + aiov.iov_len = cur; >> + auio.uio_iov = &aiov; >> + auio.uio_iovcnt = 1; >> + auio.uio_offset = offset; >> + auio.uio_resid = cur; >> + auio.uio_segflg = UIO_SYSSPACE; >> + auio.uio_rw = UIO_READ; >> + auio.uio_td = td; >> + error = VOP_READ(vp, &auio, 0, td->td_ucred); >> + if (error != 0) >> + break; >> + if (auio.uio_resid > 0) { >> + bzero(buf + cur - auio.uio_resid, >> + auio.uio_resid); >> + } >> + } else { >> + bzero(buf, cur); >> + } > Wouldn't VOP_SETATTR() at the start of the function mostly prevent > this bzero from executing ? Yes. If struct statfs had a member indicating the file system's max file size, then the extend wouldn't be necessary. We have that feature locally, but it's only implemented for ufs and our custom file system, and it requires an ABI change so it's a bit of work to upstream. And as with most of those things, it's hard to find the time to upstream it outside of work hours. > I estimated what it would take to do the optimized implementation for UFS, > and I think that the following change would allow to lessen the code > duplication much. > > What if the vnode lock drop and looping be handled by the syscall, instead > of the vop implementation ? In other words, allow the VOP_ALLOCATE() > to allocate less then requested, and return the allocated amount to > the caller. The loop would be centralized then, freeing fs fr
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 12:49:38PM -0700, m...@freebsd.org wrote: > On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov wrote: > > On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote: > >> Author: mdf > >> Date: Mon Apr 18 16:32:22 2011 > >> New Revision: 220791 > >> URL: http://svn.freebsd.org/changeset/base/220791 > >> > >> Log: > >> Add the posix_fallocate(2) syscall. The default implementation in > >> vop_stdallocate() is filesystem agnostic and will run as slow as a > >> read/write loop in userspace; however, it serves to correctly > >> implement the functionality for filesystems that do not implement a > >> VOP_ALLOCATE. > >> > >> Note that __FreeBSD_version was already bumped today to 900036 for any > >> ports which would like to use this function. > >> > >> Also reserve space in the syscall table for posix_fadvise(2). > > >> +#ifdef __notyet__ > >> + /* > >> + * Check if the filesystem sets f_maxfilesize; if not use > >> + * VOP_SETATTR to perform the check. > >> + */ > >> + error = VFS_STATFS(vp->v_mount, &sfs, td); > >> + if (error != 0) > >> + goto out; > >> + if (sfs.f_maxfilesize) { > >> + if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize || > >> + offset + len > sfs.f_maxfilesize) { > >> + error = EFBIG; > >> + goto out; > >> + } > >> + } else > >> +#endif > >> + if (offset + len > vap->va_size) { > >> + VATTR_NULL(vap); > >> + vap->va_size = offset + len; > >> + error = VOP_SETATTR(vp, vap, td->td_ucred); > >> + if (error != 0) > >> + goto out; > >> + } > > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > > do auto-extend as needed. Also, see below. > > The need is, as commented, to return EFBIG when the new file size will > be larger than the FS supports. Without this code, passing in > something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem > out of space. Handling max file size and not overflowing the fs are different things. VOP_WRITE() will handle file size on its own too. I see no problem with exhausting free space if this is what user asked for. > > >> + > >> + while (len > 0) { > >> + if (should_yield()) { > >> + VOP_UNLOCK(vp, 0); > >> + locked = 0; > >> + kern_yield(-1); > > Please note that, despite dropping the vnode lock, the snapshot creation > > is still blocked at this point, due to previous vn_start_write(). > > > > If doing vn_finished_write() there, then bwillwrite() before > > next iteration is desired. > >> + error = vn_lock(vp, LK_EXCLUSIVE); > >> + if (error != 0) > >> + break; > >> + locked = 1; > >> + error = VOP_GETATTR(vp, vap, td->td_ucred); > >> + if (error != 0) > >> + break; > >> + } > >> + > >> + /* > >> + * Read and write back anything below the nominal file > >> + * size. There's currently no way outside the filesystem > >> + * to know whether this area is sparse or not. > >> + */ > >> + cur = iosize; > >> + if ((offset % iosize) != 0) > >> + cur -= (offset % iosize); > >> + if (cur > len) > >> + cur = len; > >> + if (offset < vap->va_size) { > >> + aiov.iov_base = buf; > >> + aiov.iov_len = cur; > >> + auio.uio_iov = &aiov; > >> + auio.uio_iovcnt = 1; > >> + auio.uio_offset = offset; > >> + auio.uio_resid = cur; > >> + auio.uio_segflg = UIO_SYSSPACE; > >> + auio.uio_rw = UIO_READ; > >> + auio.uio_td = td; > >> + error = VOP_READ(vp, &auio, 0, td->td_ucred); > >> + if (error != 0) > >> + break; > >> + if (auio.uio_resid > 0) { > >> + bzero(buf + cur - auio.uio_resid, > >> + auio.uio_resid); > >> + } > >> + } else { > >> + bzero(buf, cur); > >> + } > > Wouldn't VOP_SETATTR() at the start of the function mostly prevent > > this bzero from executing ? > > Yes. If struct statfs had a member indicating the file system's max > file size, then the extend wouldn't be necessary. We have that > feature locally, but it's only implemented for ufs and our custom file > system, and it requires an ABI change so it's a bit of work to > upstream. And as with most of those things, it's hard to find the > time to upstr
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 10:28:10PM +0300, Kostik Belousov wrote: > > + if (offset + len > vap->va_size) { > > + VATTR_NULL(vap); > > + vap->va_size = offset + len; > > + error = VOP_SETATTR(vp, vap, td->td_ucred); > > + if (error != 0) > > + goto out; > > + } > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > do auto-extend as needed. Also, see below. Yeah, also when we extend file size we could skip reading zeros. > > + if (offset < vap->va_size) { [...] > > + } else { > > + bzero(buf, cur); > > + } > Wouldn't VOP_SETATTR() at the start of the function mostly prevent > this bzero from executing ? Once we drop the vnode lock, the file size can change under us, no? > I estimated what it would take to do the optimized implementation for UFS, > and I think that the following change would allow to lessen the code > duplication much. > > What if the vnode lock drop and looping be handled by the syscall, instead > of the vop implementation ? In other words, allow the VOP_ALLOCATE() > to allocate less then requested, and return the allocated amount to > the caller. The loop would be centralized then, freeing fs from doing > the dance. Also, if fs considers that suitable, it would do a whole > allocation in one run. I'd still go with SEEK_DATA/SEEK_HOLE loop as I suggested on arch@. If you would like to spend time on it, having SEEK_DATA/SEEK_HOLE support in UFS would be beneficial for other purposes too. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgp7uMUMXLSum.pgp Description: PGP signature
Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys
On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote: > Author: mdf > Date: Mon Apr 18 16:32:22 2011 > New Revision: 220791 > URL: http://svn.freebsd.org/changeset/base/220791 > > Log: > Add the posix_fallocate(2) syscall. The default implementation in > vop_stdallocate() is filesystem agnostic and will run as slow as a > read/write loop in userspace; however, it serves to correctly > implement the functionality for filesystems that do not implement a > VOP_ALLOCATE. > > Note that __FreeBSD_version was already bumped today to 900036 for any > ports which would like to use this function. > > Also reserve space in the syscall table for posix_fadvise(2). > > Reviewed by:-arch (previous version) Thank you for taking the remarks into consideration. > > +int > +vop_stdallocate(struct vop_allocate_args *ap) > +{ > +#ifdef __notyet__ > + struct statfs sfs; > +#endif > + struct iovec aiov; > + struct vattr vattr, *vap; > + struct uio auio; > + off_t len, cur, offset; > + uint8_t *buf; > + struct thread *td; > + struct vnode *vp; > + size_t iosize; > + int error, locked; > + > + buf = NULL; > + error = 0; > + locked = 1; > + td = curthread; > + vap = &vattr; > + vp = ap->a_vp; > + len = ap->a_len; > + offset = ap->a_offset; > + > + error = VOP_GETATTR(vp, vap, td->td_ucred); > + if (error != 0) > + goto out; > + iosize = vap->va_blocksize; > + if (iosize == 0) > + iosize = BLKDEV_IOSIZE; > + if (iosize > MAXPHYS) > + iosize = MAXPHYS; > + buf = malloc(iosize, M_TEMP, M_WAITOK); > + > +#ifdef __notyet__ > + /* > + * Check if the filesystem sets f_maxfilesize; if not use > + * VOP_SETATTR to perform the check. > + */ > + error = VFS_STATFS(vp->v_mount, &sfs, td); > + if (error != 0) > + goto out; > + if (sfs.f_maxfilesize) { > + if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize || > + offset + len > sfs.f_maxfilesize) { > + error = EFBIG; > + goto out; > + } > + } else > +#endif > + if (offset + len > vap->va_size) { > + VATTR_NULL(vap); > + vap->va_size = offset + len; > + error = VOP_SETATTR(vp, vap, td->td_ucred); > + if (error != 0) > + goto out; > + } I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will do auto-extend as needed. Also, see below. > + > + while (len > 0) { > + if (should_yield()) { > + VOP_UNLOCK(vp, 0); > + locked = 0; > + kern_yield(-1); Please note that, despite dropping the vnode lock, the snapshot creation is still blocked at this point, due to previous vn_start_write(). If doing vn_finished_write() there, then bwillwrite() before next iteration is desired. > + error = vn_lock(vp, LK_EXCLUSIVE); > + if (error != 0) > + break; > + locked = 1; > + error = VOP_GETATTR(vp, vap, td->td_ucred); > + if (error != 0) > + break; > + } > + > + /* > + * Read and write back anything below the nominal file > + * size. There's currently no way outside the filesystem > + * to know whether this area is sparse or not. > + */ > + cur = iosize; > + if ((offset % iosize) != 0) > + cur -= (offset % iosize); > + if (cur > len) > + cur = len; > + if (offset < vap->va_size) { > + aiov.iov_base = buf; > + aiov.iov_len = cur; > + auio.uio_iov = &aiov; > + auio.uio_iovcnt = 1; > + auio.uio_offset = offset; > + auio.uio_resid = cur; > + auio.uio_segflg = UIO_SYSSPACE; > + auio.uio_rw = UIO_READ; > + auio.uio_td = td; > + error = VOP_READ(vp, &auio, 0, td->td_ucred); > + if (error != 0) > + break; > + if (auio.uio_resid > 0) { > + bzero(buf + cur - auio.uio_resid, > + auio.uio_resid); > + } > + } else { > + bzero(buf, cur); > + } Wouldn't VOP_SETATTR() at the start of the function mostly prevent this bzero from executing ? > + > + aiov.iov_base = buf; > + aiov.iov_len = cur; > + auio.uio_iov = &aiov; > + auio.uio_iovcnt = 1; > + auio.uio_offset = offset; > + auio.uio_r