Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write

2023-08-27 Thread Al Viro
On Thu, Jun 01, 2023 at 04:58:55PM +0200, Christoph Hellwig wrote:
> All callers of generic_perform_write need to updated ki_pos, move it into
> common code.

> @@ -4034,7 +4037,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   endbyte = pos + status - 1;
>   err = filemap_write_and_wait_range(mapping, pos, endbyte);
>   if (err == 0) {
> - iocb->ki_pos = endbyte + 1;
>   written += status;
>   invalidate_mapping_pages(mapping,
>pos >> PAGE_SHIFT,
> @@ -4047,8 +4049,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   }
>   } else {
>   written = generic_perform_write(iocb, from);
> - if (likely(written > 0))
> - iocb->ki_pos += written;
>   }
>  out:
>   return written ? written : err;

[another late reply, sorry]

That part is somewhat fishy - there's a case where you return a positive value
and advance ->ki_pos by more than that amount.  I really wonder if all callers
of ->write_iter() are OK with that.  Consider e.g. this:

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos, *ppos = file_ppos(f.file);
if (ppos) {
pos = *ppos;   
ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f);
}

return ret;
}

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, 
loff_t *pos)
{
ssize_t ret;

if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
if (unlikely(!access_ok(buf, count)))
return -EFAULT;

ret = rw_verify_area(WRITE, file, pos, count);
if (ret)
return ret;
if (count > MAX_RW_COUNT)
count =  MAX_RW_COUNT;
file_start_write(file);
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else if (file->f_op->write_iter)
ret = new_sync_write(file, buf, count, pos);
else   
ret = -EINVAL;
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
file_end_write(file);
return ret;
}

static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t 
len, loff_t *ppos)
{
struct kiocb kiocb;
struct iov_iter iter;
ssize_t ret; 

init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = (ppos ? *ppos : 0);
iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len);

ret = call_write_iter(filp, &kiocb, &iter);
BUG_ON(ret == -EIOCBQUEUED);
if (ret > 0 && ppos)
*ppos = kiocb.ki_pos;
return ret;
} 

Suppose ->write_iter() ends up doing returning a positive value smaller than
the increment of kiocb.ki_pos.  What do we get?  ret is positive, so
kiocb.ki_pos gets copied into *ppos, which is ksys_write's pos and there
we copy it into file->f_pos.

Is it really OK to have write() return 4096 and advance the file position
by 16K?  AFAICS, userland wouldn't get any indication of something
odd going on - just a short write to a regular file, with followup write
of remaining 12K getting quietly written in the range 16K..28K.

I don't remember what POSIX says about that, but it would qualify as
nasty surprise for any userland program - sure, one can check fsync()
results before closing the sucker and see if everything looks fine,
but the way it's usually discussed could easily lead to assumption that
(synchronous) O_DIRECT writes would not be affected by anything of that
sort.



Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2643,16 +2643,32 @@ xfs_da_read_buf(
>   struct xfs_buf_map  map, *mapp = ↦
>   int nmap = 1;
>   int error;
> + int buf_flags = 0;
>  
>   *bpp = NULL;
>   error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
>   if (error || !nmap)
>   goto out_free;
>  
> + /*
> +  * NOWAIT semantics mean we don't wait on the buffer lock nor do we
> +  * issue IO for this buffer if it is not already in memory. Caller will
> +  * retry. This will return -EAGAIN if the buffer is in memory and cannot
> +  * be locked, and no buffer and no error if it isn't in memory.  We
> +  * translate both of those into a return state of -EAGAIN and *bpp =
> +  * NULL.
> +  */

I would not include this comment.

> + if (flags & XFS_DABUF_NOWAIT)
> + buf_flags |= XBF_TRYLOCK | XBF_INCORE;
>   error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
>   &bp, ops);

what tsting did you do with this?  Because you don't actually _use_
buf_flags anywhere in this patch (presumably they should be the
sixth argument to xfs_trans_read_buf_map() instead of 0).  So I can only
conclude that either you didn't test, or your testing was inadequate.

>   if (error)
>   goto out_free;
> + if (!bp) {
> + ASSERT(flags & XFS_DABUF_NOWAIT);

I don't think this ASSERT is appropriate.

> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>   bp = NULL;
>   }
>  
> - if (*lock_mode == 0)
> - *lock_mode = xfs_ilock_data_map_shared(dp);
> + if (*lock_mode == 0) {
> + *lock_mode =
> + xfs_ilock_data_map_shared_generic(dp,
> + ctx->flags & DIR_CONTEXT_F_NOWAIT);
> + if (!*lock_mode) {
> + error = -EAGAIN;
> + break;
> + }
> + }

'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
And this is far too far indented.

xfs_dir2_lock(dp, ctx, lock_mode);

with:

STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
unsigned int lock_mode)
{
if (*lock_mode)
return;
if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
return xfs_ilock_data_map_shared_nowait(dp);
return xfs_ilock_data_map_shared(dp);
}

... which I think you can use elsewhere in this patch (reformat it to
XFS coding style, of course).  And then you don't need
xfs_ilock_data_map_shared_generic().



Re: [Cluster-devel] [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:28PM +0800, Hao Xu wrote:
> +++ b/include/linux/file.h
> @@ -81,6 +81,8 @@ static inline void fdput_pos(struct fd f)
>   fdput(f);
>  }
>  
> +extern int file_pos_lock_nowait(struct file *file, bool nowait);

No extern on function declarations.



Re: [Cluster-devel] [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:33PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> To enforce nowait semantics, error out -EAGAIN if atime needs to be
> updated.

Squash this into patch 6.  Otherwise patch 6 makes no sense.



Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Add a boolean parameter for file_accessed() to support nowait semantics.
> Currently it is true only with io_uring as its initial caller.

So why do we need to do this as part of this series?  Apparently it
hasn't caused any problems for filemap_read().

> +++ b/mm/filemap.c
> @@ -2723,7 +2723,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct 
> iov_iter *iter,
>   folio_batch_init(&fbatch);
>   } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>  
> - file_accessed(filp);
> + file_accessed(filp, false);
>  
>   return already_read ? already_read : error;
>  }
> @@ -2809,7 +2809,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
> iov_iter *iter)
>   retval = kiocb_write_and_wait(iocb, count);
>   if (retval < 0)
>   return retval;
> - file_accessed(file);
> + file_accessed(file, false);
>  
>   retval = mapping->a_ops->direct_IO(iocb, iter);
>   if (retval >= 0) {
> @@ -2978,7 +2978,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t 
> *ppos,
>  
>  out:
>   folio_batch_release(&fbatch);
> - file_accessed(in);
> + file_accessed(in, false);
>  
>   return total_spliced ? total_spliced : error;
>  }



Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write

2023-08-27 Thread Al Viro
On Sun, Aug 27, 2023 at 08:41:22PM +0100, Al Viro wrote:
> On Thu, Jun 01, 2023 at 04:58:55PM +0200, Christoph Hellwig wrote:
> > All callers of generic_perform_write need to updated ki_pos, move it into
> > common code.
> 
> > @@ -4034,7 +4037,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> > struct iov_iter *from)
> > endbyte = pos + status - 1;
> > err = filemap_write_and_wait_range(mapping, pos, endbyte);
> > if (err == 0) {
> > -   iocb->ki_pos = endbyte + 1;
> > written += status;
> > invalidate_mapping_pages(mapping,
> >  pos >> PAGE_SHIFT,
> > @@ -4047,8 +4049,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> > struct iov_iter *from)
> > }
> > } else {
> > written = generic_perform_write(iocb, from);
> > -   if (likely(written > 0))
> > -   iocb->ki_pos += written;
> > }
> >  out:
> > return written ? written : err;
> 
> [another late reply, sorry]
> 
> That part is somewhat fishy - there's a case where you return a positive value
> and advance ->ki_pos by more than that amount.  I really wonder if all callers
> of ->write_iter() are OK with that.  Consider e.g. this:
> 
> ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret = -EBADF;
> 
> if (f.file) {
> loff_t pos, *ppos = file_ppos(f.file);
> if (ppos) {
> pos = *ppos;   
> ppos = &pos;
> }
> ret = vfs_write(f.file, buf, count, ppos);
> if (ret >= 0 && ppos)
> f.file->f_pos = pos;
> fdput_pos(f);
> }
> 
> return ret;
> }
> 
> ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, 
> loff_t *pos)
> {
> ssize_t ret;
> 
> if (!(file->f_mode & FMODE_WRITE))
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_WRITE))
> return -EINVAL;
> if (unlikely(!access_ok(buf, count)))
> return -EFAULT;
> 
> ret = rw_verify_area(WRITE, file, pos, count);
> if (ret)
> return ret;
> if (count > MAX_RW_COUNT)
> count =  MAX_RW_COUNT;
> file_start_write(file);
> if (file->f_op->write)
> ret = file->f_op->write(file, buf, count, pos);
> else if (file->f_op->write_iter)
> ret = new_sync_write(file, buf, count, pos);
> else   
> ret = -EINVAL;
> if (ret > 0) {
> fsnotify_modify(file);
> add_wchar(current, ret);
> }
> inc_syscw(current);
> file_end_write(file);
> return ret;
> }
> 
> static ssize_t new_sync_write(struct file *filp, const char __user *buf, 
> size_t len, loff_t *ppos)
> {
> struct kiocb kiocb;
> struct iov_iter iter;
> ssize_t ret; 
> 
> init_sync_kiocb(&kiocb, filp);
> kiocb.ki_pos = (ppos ? *ppos : 0);
> iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len);
> 
> ret = call_write_iter(filp, &kiocb, &iter);
> BUG_ON(ret == -EIOCBQUEUED);
> if (ret > 0 && ppos)
> *ppos = kiocb.ki_pos;
> return ret;
> } 
> 
> Suppose ->write_iter() ends up doing returning a positive value smaller than
> the increment of kiocb.ki_pos.  What do we get?  ret is positive, so
> kiocb.ki_pos gets copied into *ppos, which is ksys_write's pos and there
> we copy it into file->f_pos.
> 
> Is it really OK to have write() return 4096 and advance the file position
> by 16K?  AFAICS, userland wouldn't get any indication of something
> odd going on - just a short write to a regular file, with followup write
> of remaining 12K getting quietly written in the range 16K..28K.
> 
> I don't remember what POSIX says about that, but it would qualify as
> nasty surprise for any userland program - sure, one can check fsync()
> results before closing the sucker and see if everything looks fine,
> but the way it's usually discussed could easily lead to assumption that
> (synchronous) O_DIRECT writes would not be affected by anything of that
> sort.

IOW, I suspect that the right thing to do would be something along the lines
of

direct_write_fallback(): on error revert the ->ki_pos update from buffered write

If we fail filemap_write_and_wait_range() on the range the buffered write went
into, we only report the "number of bytes which we direct-written", to quote
the comment in there.  Which is fine, but buffered write has already advanced
iocb->ki_pos, so we need to roll that back.  Otherwise we end up with e.g.
write(2) advancing position by more than the amount it reports having written.

Fixes: 182c25e9c1

Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write

2023-08-27 Thread Al Viro
On Sun, Aug 27, 2023 at 08:41:22PM +0100, Al Viro wrote:

> That part is somewhat fishy - there's a case where you return a positive value
> and advance ->ki_pos by more than that amount.  I really wonder if all callers
> of ->write_iter() are OK with that.

Speaking of which, in case of negative return value we'd better *not* use
->ki_pos; consider e.g. generic_file_write_iter() with O_DSYNC and
vfs_fsync_range() failure.  An error gets returned, but ->ki_pos is left
advanced.  Normal write(2) is fine - it will only update file->f_pos if
->write_iter() has returned a non-negative.  However, io_uring
kiocb_done() starts with
if (req->flags & REQ_F_CUR_POS)
req->file->f_pos = rw->kiocb.ki_pos;
if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
if (!__io_complete_rw_common(req, ret)) {
/*
 * Safe to call io_end from here as we're inline
 * from the submission path.
 */
io_req_io_end(req);
io_req_set_res(req, final_ret,
   io_put_kbuf(req, issue_flags));
return IOU_OK;
}
} else {
io_rw_done(&rw->kiocb, ret);
}
Note that ->f_pos update is *NOT* conditional upon ret >= 0 - it happens
no matter what, provided that original request had ->kiocb.ki_pos equal
to -1 (on a non-FMODE_STREAM file).

Jens, is there any reason for doing that unconditionally?  IMO it's
a bad idea - there's a wide scope for fuckups that way, especially
since write(2) is not sensitive to that and this use of -1 ki_pos
is not particularly encouraged on io_uring side either, AFAICT.
Worse, it's handling of failure exits in the first place, which
already gets little testing...