Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread Pawel Jakub Dawidek
On Mon, Apr 18, 2011 at 11:13:02PM +0300, Kostik Belousov wrote:
> > The need is, as commented, to return EFBIG when the new file size will
> > be larger than the FS supports.  Without this code, passing in
> > something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
> > out of space.
> Handling max file size and not overflowing the fs are different things.
> VOP_WRITE() will handle file size on its own too. I see no problem with
> exhausting free space if this is what user asked for.

This makes me wonder that current implementation isn't atomic.
If we get out of space error, we won't shrink the file back.
Even if we could shirk it at the end, we won't be able to put holes in
the middle of it.
Not sure if this is a big issue, but one doesn't expect from rename(2)
to create new link and not remove old one.
All in all, making it atomic would be impossible currently for various
reasons (we can't put holes back and we can crash in the middle).

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpGDsNehES8J.pgp
Description: PGP signature


Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread mdf
2011/4/18 Kostik Belousov :
> On Mon, Apr 18, 2011 at 12:49:38PM -0700, m...@freebsd.org wrote:
>> On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov  
>> wrote:
>> > On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote:
>> >> Author: mdf
>> >> Date: Mon Apr 18 16:32:22 2011
>> >> New Revision: 220791
>> >> URL: http://svn.freebsd.org/changeset/base/220791
>> >>
>> >> Log:
>> >>   Add the posix_fallocate(2) syscall.  The default implementation in
>> >>   vop_stdallocate() is filesystem agnostic and will run as slow as a
>> >>   read/write loop in userspace; however, it serves to correctly
>> >>   implement the functionality for filesystems that do not implement a
>> >>   VOP_ALLOCATE.
>> >>
>> >>   Note that __FreeBSD_version was already bumped today to 900036 for any
>> >>   ports which would like to use this function.
>> >>
>> >>   Also reserve space in the syscall table for posix_fadvise(2).

>> The need is, as commented, to return EFBIG when the new file size will
>> be larger than the FS supports.  Without this code, passing in
>> something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
>> out of space.
> Handling max file size and not overflowing the fs are different things.
> VOP_WRITE() will handle file size on its own too. I see no problem with
> exhausting free space if this is what user asked for.

This violates the standard, though, since it would return ENOSPC
instead of EFBIG.

Also, this is just the default implementation.  I'm adding a second
VOP_SETATTR in vop_stdallocate() to restore the file size after
extending, which will let the implementation use bzero/VOP_WRITE
instead of VOP_READ/VOP_WRITE when past the original EOF.  This is
more of an issue when calling the vop iteratively, since subsequent
calls don't know what the "correct" file size is.

Cheers,
matthew
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread mdf
On Mon, Apr 18, 2011 at 12:47 PM, Pawel Jakub Dawidek  wrote:
> On Mon, Apr 18, 2011 at 10:28:10PM +0300, Kostik Belousov wrote:
>> > +   if (offset + len > vap->va_size) {
>> > +           VATTR_NULL(vap);
>> > +           vap->va_size = offset + len;
>> > +           error = VOP_SETATTR(vp, vap, td->td_ucred);
>> > +           if (error != 0)
>> > +                   goto out;
>> > +   }
>> I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
>> do auto-extend as needed. Also, see below.
>
> Yeah, also when we extend file size we could skip reading zeros.
>
>> > +           if (offset < vap->va_size) {
> [...]
>> > +           } else {
>> > +                   bzero(buf, cur);
>> > +           }
>> Wouldn't VOP_SETATTR() at the start of the function mostly prevent
>> this bzero from executing ?
>
> Once we drop the vnode lock, the file size can change under us, no?

Yes, which is why the VOP_GETATTR is re-run.  The SETATTR doesn't need
to be re-run, since the purpose was just to check that offset + len is
less than the filesystem's maximum supported file size.

>> I estimated what it would take to do the optimized implementation for UFS,
>> and I think that the following change would allow to lessen the code
>> duplication much.
>>
>> What if the vnode lock drop and looping be handled by the syscall, instead
>> of the vop implementation ? In other words, allow the VOP_ALLOCATE()
>> to  allocate less then requested, and return the allocated amount to
>> the caller. The loop would be centralized then, freeing fs from doing
>> the dance. Also, if fs considers that suitable, it would do a whole
>> allocation in one run.
>
> I'd still go with SEEK_DATA/SEEK_HOLE loop as I suggested on arch@.
> If you would like to spend time on it, having SEEK_DATA/SEEK_HOLE
> support in UFS would be beneficial for other purposes too.

Well, if we had this functionality it could be used.  I'd like the
framework but the filesystems need modification to support it.  We
want it for OneFS at Isilon so eventually when I get to this part of
the project I'll have some wrapper code if someone doesn't get there
first.

Cheers,
matthew
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread mdf
On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov  wrote:
> On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote:
>> Author: mdf
>> Date: Mon Apr 18 16:32:22 2011
>> New Revision: 220791
>> URL: http://svn.freebsd.org/changeset/base/220791
>>
>> Log:
>>   Add the posix_fallocate(2) syscall.  The default implementation in
>>   vop_stdallocate() is filesystem agnostic and will run as slow as a
>>   read/write loop in userspace; however, it serves to correctly
>>   implement the functionality for filesystems that do not implement a
>>   VOP_ALLOCATE.
>>
>>   Note that __FreeBSD_version was already bumped today to 900036 for any
>>   ports which would like to use this function.
>>
>>   Also reserve space in the syscall table for posix_fadvise(2).

>> +#ifdef __notyet__
>> +     /*
>> +      * Check if the filesystem sets f_maxfilesize; if not use
>> +      * VOP_SETATTR to perform the check.
>> +      */
>> +     error = VFS_STATFS(vp->v_mount, &sfs, td);
>> +     if (error != 0)
>> +             goto out;
>> +     if (sfs.f_maxfilesize) {
>> +             if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize ||
>> +                 offset + len > sfs.f_maxfilesize) {
>> +                     error = EFBIG;
>> +                     goto out;
>> +             }
>> +     } else
>> +#endif
>> +     if (offset + len > vap->va_size) {
>> +             VATTR_NULL(vap);
>> +             vap->va_size = offset + len;
>> +             error = VOP_SETATTR(vp, vap, td->td_ucred);
>> +             if (error != 0)
>> +                     goto out;
>> +     }
> I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
> do auto-extend as needed. Also, see below.

The need is, as commented, to return EFBIG when the new file size will
be larger than the FS supports.  Without this code, passing in
something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
out of space.

>> +
>> +     while (len > 0) {
>> +             if (should_yield()) {
>> +                     VOP_UNLOCK(vp, 0);
>> +                     locked = 0;
>> +                     kern_yield(-1);
> Please note that, despite dropping the vnode lock, the snapshot creation
> is still blocked at this point, due to previous vn_start_write().
>
> If doing vn_finished_write() there, then bwillwrite() before
> next iteration is desired.
>> +                     error = vn_lock(vp, LK_EXCLUSIVE);
>> +                     if (error != 0)
>> +                             break;
>> +                     locked = 1;
>> +                     error = VOP_GETATTR(vp, vap, td->td_ucred);
>> +                     if (error != 0)
>> +                             break;
>> +             }
>> +
>> +             /*
>> +              * Read and write back anything below the nominal file
>> +              * size.  There's currently no way outside the filesystem
>> +              * to know whether this area is sparse or not.
>> +              */
>> +             cur = iosize;
>> +             if ((offset % iosize) != 0)
>> +                     cur -= (offset % iosize);
>> +             if (cur > len)
>> +                     cur = len;
>> +             if (offset < vap->va_size) {
>> +                     aiov.iov_base = buf;
>> +                     aiov.iov_len = cur;
>> +                     auio.uio_iov = &aiov;
>> +                     auio.uio_iovcnt = 1;
>> +                     auio.uio_offset = offset;
>> +                     auio.uio_resid = cur;
>> +                     auio.uio_segflg = UIO_SYSSPACE;
>> +                     auio.uio_rw = UIO_READ;
>> +                     auio.uio_td = td;
>> +                     error = VOP_READ(vp, &auio, 0, td->td_ucred);
>> +                     if (error != 0)
>> +                             break;
>> +                     if (auio.uio_resid > 0) {
>> +                             bzero(buf + cur - auio.uio_resid,
>> +                                 auio.uio_resid);
>> +                     }
>> +             } else {
>> +                     bzero(buf, cur);
>> +             }
> Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> this bzero from executing ?

Yes.  If struct statfs had a member indicating the file system's max
file size, then the extend wouldn't be necessary.  We have that
feature locally, but it's only implemented for ufs and our custom file
system, and it requires an ABI change so it's a bit of work to
upstream.  And as with most of those things, it's hard to find the
time to upstream it outside of work hours.

> I estimated what it would take to do the optimized implementation for UFS,
> and I think that the following change would allow to lessen the code
> duplication much.
>
> What if the vnode lock drop and looping be handled by the syscall, instead
> of the vop implementation ? In other words, allow the VOP_ALLOCATE()
> to  allocate less then requested, and return the allocated amount to
> the caller. The loop would be centralized then, freeing fs fr

Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread Kostik Belousov
On Mon, Apr 18, 2011 at 12:49:38PM -0700, m...@freebsd.org wrote:
> On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov  wrote:
> > On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote:
> >> Author: mdf
> >> Date: Mon Apr 18 16:32:22 2011
> >> New Revision: 220791
> >> URL: http://svn.freebsd.org/changeset/base/220791
> >>
> >> Log:
> >>   Add the posix_fallocate(2) syscall.  The default implementation in
> >>   vop_stdallocate() is filesystem agnostic and will run as slow as a
> >>   read/write loop in userspace; however, it serves to correctly
> >>   implement the functionality for filesystems that do not implement a
> >>   VOP_ALLOCATE.
> >>
> >>   Note that __FreeBSD_version was already bumped today to 900036 for any
> >>   ports which would like to use this function.
> >>
> >>   Also reserve space in the syscall table for posix_fadvise(2).
> 
> >> +#ifdef __notyet__
> >> +     /*
> >> +      * Check if the filesystem sets f_maxfilesize; if not use
> >> +      * VOP_SETATTR to perform the check.
> >> +      */
> >> +     error = VFS_STATFS(vp->v_mount, &sfs, td);
> >> +     if (error != 0)
> >> +             goto out;
> >> +     if (sfs.f_maxfilesize) {
> >> +             if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize ||
> >> +                 offset + len > sfs.f_maxfilesize) {
> >> +                     error = EFBIG;
> >> +                     goto out;
> >> +             }
> >> +     } else
> >> +#endif
> >> +     if (offset + len > vap->va_size) {
> >> +             VATTR_NULL(vap);
> >> +             vap->va_size = offset + len;
> >> +             error = VOP_SETATTR(vp, vap, td->td_ucred);
> >> +             if (error != 0)
> >> +                     goto out;
> >> +     }
> > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
> > do auto-extend as needed. Also, see below.
> 
> The need is, as commented, to return EFBIG when the new file size will
> be larger than the FS supports.  Without this code, passing in
> something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem
> out of space.
Handling max file size and not overflowing the fs are different things.
VOP_WRITE() will handle file size on its own too. I see no problem with
exhausting free space if this is what user asked for.

> 
> >> +
> >> +     while (len > 0) {
> >> +             if (should_yield()) {
> >> +                     VOP_UNLOCK(vp, 0);
> >> +                     locked = 0;
> >> +                     kern_yield(-1);
> > Please note that, despite dropping the vnode lock, the snapshot creation
> > is still blocked at this point, due to previous vn_start_write().
> >
> > If doing vn_finished_write() there, then bwillwrite() before
> > next iteration is desired.
> >> +                     error = vn_lock(vp, LK_EXCLUSIVE);
> >> +                     if (error != 0)
> >> +                             break;
> >> +                     locked = 1;
> >> +                     error = VOP_GETATTR(vp, vap, td->td_ucred);
> >> +                     if (error != 0)
> >> +                             break;
> >> +             }
> >> +
> >> +             /*
> >> +              * Read and write back anything below the nominal file
> >> +              * size.  There's currently no way outside the filesystem
> >> +              * to know whether this area is sparse or not.
> >> +              */
> >> +             cur = iosize;
> >> +             if ((offset % iosize) != 0)
> >> +                     cur -= (offset % iosize);
> >> +             if (cur > len)
> >> +                     cur = len;
> >> +             if (offset < vap->va_size) {
> >> +                     aiov.iov_base = buf;
> >> +                     aiov.iov_len = cur;
> >> +                     auio.uio_iov = &aiov;
> >> +                     auio.uio_iovcnt = 1;
> >> +                     auio.uio_offset = offset;
> >> +                     auio.uio_resid = cur;
> >> +                     auio.uio_segflg = UIO_SYSSPACE;
> >> +                     auio.uio_rw = UIO_READ;
> >> +                     auio.uio_td = td;
> >> +                     error = VOP_READ(vp, &auio, 0, td->td_ucred);
> >> +                     if (error != 0)
> >> +                             break;
> >> +                     if (auio.uio_resid > 0) {
> >> +                             bzero(buf + cur - auio.uio_resid,
> >> +                                 auio.uio_resid);
> >> +                     }
> >> +             } else {
> >> +                     bzero(buf, cur);
> >> +             }
> > Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> > this bzero from executing ?
> 
> Yes.  If struct statfs had a member indicating the file system's max
> file size, then the extend wouldn't be necessary.  We have that
> feature locally, but it's only implemented for ufs and our custom file
> system, and it requires an ABI change so it's a bit of work to
> upstream.  And as with most of those things, it's hard to find the
> time to upstr

Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread Pawel Jakub Dawidek
On Mon, Apr 18, 2011 at 10:28:10PM +0300, Kostik Belousov wrote:
> > +   if (offset + len > vap->va_size) {
> > +   VATTR_NULL(vap);
> > +   vap->va_size = offset + len;
> > +   error = VOP_SETATTR(vp, vap, td->td_ucred);
> > +   if (error != 0)
> > +   goto out;
> > +   }
> I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
> do auto-extend as needed. Also, see below.

Yeah, also when we extend file size we could skip reading zeros.

> > +   if (offset < vap->va_size) {
[...]
> > +   } else {
> > +   bzero(buf, cur);
> > +   }
> Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> this bzero from executing ?

Once we drop the vnode lock, the file size can change under us, no?

> I estimated what it would take to do the optimized implementation for UFS,
> and I think that the following change would allow to lessen the code
> duplication much.
> 
> What if the vnode lock drop and looping be handled by the syscall, instead
> of the vop implementation ? In other words, allow the VOP_ALLOCATE()
> to  allocate less then requested, and return the allocated amount to
> the caller. The loop would be centralized then, freeing fs from doing
> the dance. Also, if fs considers that suitable, it would do a whole
> allocation in one run.

I'd still go with SEEK_DATA/SEEK_HOLE loop as I suggested on arch@.
If you would like to spend time on it, having SEEK_DATA/SEEK_HOLE
support in UFS would be beneficial for other purposes too.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp7uMUMXLSum.pgp
Description: PGP signature


Re: svn commit: r220791 - in head: lib/libc/sys sys/compat/freebsd32 sys/kern sys/sys

2011-04-18 Thread Kostik Belousov
On Mon, Apr 18, 2011 at 04:32:22PM +, Matthew D Fleming wrote:
> Author: mdf
> Date: Mon Apr 18 16:32:22 2011
> New Revision: 220791
> URL: http://svn.freebsd.org/changeset/base/220791
> 
> Log:
>   Add the posix_fallocate(2) syscall.  The default implementation in
>   vop_stdallocate() is filesystem agnostic and will run as slow as a
>   read/write loop in userspace; however, it serves to correctly
>   implement the functionality for filesystems that do not implement a
>   VOP_ALLOCATE.
>   
>   Note that __FreeBSD_version was already bumped today to 900036 for any
>   ports which would like to use this function.
>   
>   Also reserve space in the syscall table for posix_fadvise(2).
>   
>   Reviewed by:-arch (previous version)
Thank you for taking the remarks into consideration.

>  
> +int
> +vop_stdallocate(struct vop_allocate_args *ap)
> +{
> +#ifdef __notyet__
> + struct statfs sfs;
> +#endif
> + struct iovec aiov;
> + struct vattr vattr, *vap;
> + struct uio auio;
> + off_t len, cur, offset;
> + uint8_t *buf;
> + struct thread *td;
> + struct vnode *vp;
> + size_t iosize;
> + int error, locked;
> +
> + buf = NULL;
> + error = 0;
> + locked = 1;
> + td = curthread;
> + vap = &vattr;
> + vp = ap->a_vp;
> + len = ap->a_len;
> + offset = ap->a_offset;
> +
> + error = VOP_GETATTR(vp, vap, td->td_ucred);
> + if (error != 0)
> + goto out;
> + iosize = vap->va_blocksize;
> + if (iosize == 0)
> + iosize = BLKDEV_IOSIZE;
> + if (iosize > MAXPHYS)
> + iosize = MAXPHYS;
> + buf = malloc(iosize, M_TEMP, M_WAITOK);
> +
> +#ifdef __notyet__
> + /*
> +  * Check if the filesystem sets f_maxfilesize; if not use
> +  * VOP_SETATTR to perform the check.
> +  */
> + error = VFS_STATFS(vp->v_mount, &sfs, td);
> + if (error != 0)
> + goto out;
> + if (sfs.f_maxfilesize) {
> + if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize ||
> + offset + len > sfs.f_maxfilesize) {
> + error = EFBIG;
> + goto out;
> + }
> + } else
> +#endif
> + if (offset + len > vap->va_size) {
> + VATTR_NULL(vap);
> + vap->va_size = offset + len;
> + error = VOP_SETATTR(vp, vap, td->td_ucred);
> + if (error != 0)
> + goto out;
> + }
I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will
do auto-extend as needed. Also, see below.

> +
> + while (len > 0) {
> + if (should_yield()) {
> + VOP_UNLOCK(vp, 0);
> + locked = 0;
> + kern_yield(-1);
Please note that, despite dropping the vnode lock, the snapshot creation
is still blocked at this point, due to previous vn_start_write().

If doing vn_finished_write() there, then bwillwrite() before
next iteration is desired.
> + error = vn_lock(vp, LK_EXCLUSIVE);
> + if (error != 0)
> + break;
> + locked = 1;
> + error = VOP_GETATTR(vp, vap, td->td_ucred);
> + if (error != 0)
> + break;
> + }
> +
> + /*
> +  * Read and write back anything below the nominal file
> +  * size.  There's currently no way outside the filesystem
> +  * to know whether this area is sparse or not.
> +  */
> + cur = iosize;
> + if ((offset % iosize) != 0)
> + cur -= (offset % iosize);
> + if (cur > len)
> + cur = len;
> + if (offset < vap->va_size) {
> + aiov.iov_base = buf;
> + aiov.iov_len = cur;
> + auio.uio_iov = &aiov;
> + auio.uio_iovcnt = 1;
> + auio.uio_offset = offset;
> + auio.uio_resid = cur;
> + auio.uio_segflg = UIO_SYSSPACE;
> + auio.uio_rw = UIO_READ;
> + auio.uio_td = td;
> + error = VOP_READ(vp, &auio, 0, td->td_ucred);
> + if (error != 0)
> + break;
> + if (auio.uio_resid > 0) {
> + bzero(buf + cur - auio.uio_resid,
> + auio.uio_resid);
> + }
> + } else {
> + bzero(buf, cur);
> + }
Wouldn't VOP_SETATTR() at the start of the function mostly prevent
this bzero from executing ?

> +
> + aiov.iov_base = buf;
> + aiov.iov_len = cur;
> + auio.uio_iov = &aiov;
> + auio.uio_iovcnt = 1;
> + auio.uio_offset = offset;
> + auio.uio_r