Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write
On Thu, Jun 01, 2023 at 04:58:55PM +0200, Christoph Hellwig wrote: > All callers of generic_perform_write need to updated ki_pos, move it into > common code. > @@ -4034,7 +4037,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, > struct iov_iter *from) > endbyte = pos + status - 1; > err = filemap_write_and_wait_range(mapping, pos, endbyte); > if (err == 0) { > - iocb->ki_pos = endbyte + 1; > written += status; > invalidate_mapping_pages(mapping, >pos >> PAGE_SHIFT, > @@ -4047,8 +4049,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, > struct iov_iter *from) > } > } else { > written = generic_perform_write(iocb, from); > - if (likely(written > 0)) > - iocb->ki_pos += written; > } > out: > return written ? written : err; [another late reply, sorry] That part is somewhat fishy - there's a case where you return a positive value and advance ->ki_pos by more than that amount. I really wonder if all callers of ->write_iter() are OK with that. Consider e.g. this: ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos, *ppos = file_ppos(f.file); if (ppos) { pos = *ppos; ppos = &pos; } ret = vfs_write(f.file, buf, count, ppos); if (ret >= 0 && ppos) f.file->f_pos = pos; fdput_pos(f); } return ret; } ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { ssize_t ret; if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; if (unlikely(!access_ok(buf, count))) return -EFAULT; ret = rw_verify_area(WRITE, file, pos, count); if (ret) return ret; if (count > MAX_RW_COUNT) count = MAX_RW_COUNT; file_start_write(file); if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else if (file->f_op->write_iter) ret = new_sync_write(file, buf, count, pos); else ret = -EINVAL; if (ret > 0) { fsnotify_modify(file); add_wchar(current, ret); } inc_syscw(current); file_end_write(file); return ret; } static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { struct kiocb kiocb; struct iov_iter iter; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len); ret = call_write_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); if (ret > 0 && ppos) *ppos = kiocb.ki_pos; return ret; } Suppose ->write_iter() ends up doing returning a positive value smaller than the increment of kiocb.ki_pos. What do we get? ret is positive, so kiocb.ki_pos gets copied into *ppos, which is ksys_write's pos and there we copy it into file->f_pos. Is it really OK to have write() return 4096 and advance the file position by 16K? AFAICS, userland wouldn't get any indication of something odd going on - just a short write to a regular file, with followup write of remaining 12K getting quietly written in the range 16K..28K. I don't remember what POSIX says about that, but it would qualify as nasty surprise for any userland program - sure, one can check fsync() results before closing the sucker and see if everything looks fine, but the way it's usually discussed could easily lead to assumption that (synchronous) O_DIRECT writes would not be affected by anything of that sort.
Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote: > +++ b/fs/xfs/libxfs/xfs_da_btree.c > @@ -2643,16 +2643,32 @@ xfs_da_read_buf( > struct xfs_buf_map map, *mapp = ↦ > int nmap = 1; > int error; > + int buf_flags = 0; > > *bpp = NULL; > error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap); > if (error || !nmap) > goto out_free; > > + /* > + * NOWAIT semantics mean we don't wait on the buffer lock nor do we > + * issue IO for this buffer if it is not already in memory. Caller will > + * retry. This will return -EAGAIN if the buffer is in memory and cannot > + * be locked, and no buffer and no error if it isn't in memory. We > + * translate both of those into a return state of -EAGAIN and *bpp = > + * NULL. > + */ I would not include this comment. > + if (flags & XFS_DABUF_NOWAIT) > + buf_flags |= XBF_TRYLOCK | XBF_INCORE; > error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0, > &bp, ops); what tsting did you do with this? Because you don't actually _use_ buf_flags anywhere in this patch (presumably they should be the sixth argument to xfs_trans_read_buf_map() instead of 0). So I can only conclude that either you didn't test, or your testing was inadequate. > if (error) > goto out_free; > + if (!bp) { > + ASSERT(flags & XFS_DABUF_NOWAIT); I don't think this ASSERT is appropriate. > @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents( > bp = NULL; > } > > - if (*lock_mode == 0) > - *lock_mode = xfs_ilock_data_map_shared(dp); > + if (*lock_mode == 0) { > + *lock_mode = > + xfs_ilock_data_map_shared_generic(dp, > + ctx->flags & DIR_CONTEXT_F_NOWAIT); > + if (!*lock_mode) { > + error = -EAGAIN; > + break; > + } > + } 'generic' doesn't seem like a great suffix to mean 'takes nowait flag'. And this is far too far indented. xfs_dir2_lock(dp, ctx, lock_mode); with: STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx, unsigned int lock_mode) { if (*lock_mode) return; if (ctx->flags & DIR_CONTEXT_F_NOWAIT) return xfs_ilock_data_map_shared_nowait(dp); return xfs_ilock_data_map_shared(dp); } ... which I think you can use elsewhere in this patch (reformat it to XFS coding style, of course). And then you don't need xfs_ilock_data_map_shared_generic().
Re: [Cluster-devel] [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock
On Sun, Aug 27, 2023 at 09:28:28PM +0800, Hao Xu wrote: > +++ b/include/linux/file.h > @@ -81,6 +81,8 @@ static inline void fdput_pos(struct fd f) > fdput(f); > } > > +extern int file_pos_lock_nowait(struct file *file, bool nowait); No extern on function declarations.
Re: [Cluster-devel] [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated
On Sun, Aug 27, 2023 at 09:28:33PM +0800, Hao Xu wrote: > From: Hao Xu > > To enforce nowait semantics, error out -EAGAIN if atime needs to be > updated. Squash this into patch 6. Otherwise patch 6 makes no sense.
Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()
On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote: > From: Hao Xu > > Add a boolean parameter for file_accessed() to support nowait semantics. > Currently it is true only with io_uring as its initial caller. So why do we need to do this as part of this series? Apparently it hasn't caused any problems for filemap_read(). > +++ b/mm/filemap.c > @@ -2723,7 +2723,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct > iov_iter *iter, > folio_batch_init(&fbatch); > } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); > > - file_accessed(filp); > + file_accessed(filp, false); > > return already_read ? already_read : error; > } > @@ -2809,7 +2809,7 @@ generic_file_read_iter(struct kiocb *iocb, struct > iov_iter *iter) > retval = kiocb_write_and_wait(iocb, count); > if (retval < 0) > return retval; > - file_accessed(file); > + file_accessed(file, false); > > retval = mapping->a_ops->direct_IO(iocb, iter); > if (retval >= 0) { > @@ -2978,7 +2978,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t > *ppos, > > out: > folio_batch_release(&fbatch); > - file_accessed(in); > + file_accessed(in, false); > > return total_spliced ? total_spliced : error; > }
Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write
On Sun, Aug 27, 2023 at 08:41:22PM +0100, Al Viro wrote: > On Thu, Jun 01, 2023 at 04:58:55PM +0200, Christoph Hellwig wrote: > > All callers of generic_perform_write need to updated ki_pos, move it into > > common code. > > > @@ -4034,7 +4037,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, > > struct iov_iter *from) > > endbyte = pos + status - 1; > > err = filemap_write_and_wait_range(mapping, pos, endbyte); > > if (err == 0) { > > - iocb->ki_pos = endbyte + 1; > > written += status; > > invalidate_mapping_pages(mapping, > > pos >> PAGE_SHIFT, > > @@ -4047,8 +4049,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, > > struct iov_iter *from) > > } > > } else { > > written = generic_perform_write(iocb, from); > > - if (likely(written > 0)) > > - iocb->ki_pos += written; > > } > > out: > > return written ? written : err; > > [another late reply, sorry] > > That part is somewhat fishy - there's a case where you return a positive value > and advance ->ki_pos by more than that amount. I really wonder if all callers > of ->write_iter() are OK with that. Consider e.g. this: > > ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) > { > struct fd f = fdget_pos(fd); > ssize_t ret = -EBADF; > > if (f.file) { > loff_t pos, *ppos = file_ppos(f.file); > if (ppos) { > pos = *ppos; > ppos = &pos; > } > ret = vfs_write(f.file, buf, count, ppos); > if (ret >= 0 && ppos) > f.file->f_pos = pos; > fdput_pos(f); > } > > return ret; > } > > ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, > loff_t *pos) > { > ssize_t ret; > > if (!(file->f_mode & FMODE_WRITE)) > return -EBADF; > if (!(file->f_mode & FMODE_CAN_WRITE)) > return -EINVAL; > if (unlikely(!access_ok(buf, count))) > return -EFAULT; > > ret = rw_verify_area(WRITE, file, pos, count); > if (ret) > return ret; > if (count > MAX_RW_COUNT) > count = MAX_RW_COUNT; > file_start_write(file); > if (file->f_op->write) > ret = file->f_op->write(file, buf, count, pos); > else if (file->f_op->write_iter) > ret = new_sync_write(file, buf, count, pos); > else > ret = -EINVAL; > if (ret > 0) { > fsnotify_modify(file); > add_wchar(current, ret); > } > inc_syscw(current); > file_end_write(file); > return ret; > } > > static ssize_t new_sync_write(struct file *filp, const char __user *buf, > size_t len, loff_t *ppos) > { > struct kiocb kiocb; > struct iov_iter iter; > ssize_t ret; > > init_sync_kiocb(&kiocb, filp); > kiocb.ki_pos = (ppos ? *ppos : 0); > iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len); > > ret = call_write_iter(filp, &kiocb, &iter); > BUG_ON(ret == -EIOCBQUEUED); > if (ret > 0 && ppos) > *ppos = kiocb.ki_pos; > return ret; > } > > Suppose ->write_iter() ends up doing returning a positive value smaller than > the increment of kiocb.ki_pos. What do we get? ret is positive, so > kiocb.ki_pos gets copied into *ppos, which is ksys_write's pos and there > we copy it into file->f_pos. > > Is it really OK to have write() return 4096 and advance the file position > by 16K? AFAICS, userland wouldn't get any indication of something > odd going on - just a short write to a regular file, with followup write > of remaining 12K getting quietly written in the range 16K..28K. > > I don't remember what POSIX says about that, but it would qualify as > nasty surprise for any userland program - sure, one can check fsync() > results before closing the sucker and see if everything looks fine, > but the way it's usually discussed could easily lead to assumption that > (synchronous) O_DIRECT writes would not be affected by anything of that > sort. IOW, I suspect that the right thing to do would be something along the lines of direct_write_fallback(): on error revert the ->ki_pos update from buffered write If we fail filemap_write_and_wait_range() on the range the buffered write went into, we only report the "number of bytes which we direct-written", to quote the comment in there. Which is fine, but buffered write has already advanced iocb->ki_pos, so we need to roll that back. Otherwise we end up with e.g. write(2) advancing position by more than the amount it reports having written. Fixes: 182c25e9c1
Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write
On Sun, Aug 27, 2023 at 08:41:22PM +0100, Al Viro wrote: > That part is somewhat fishy - there's a case where you return a positive value > and advance ->ki_pos by more than that amount. I really wonder if all callers > of ->write_iter() are OK with that. Speaking of which, in case of negative return value we'd better *not* use ->ki_pos; consider e.g. generic_file_write_iter() with O_DSYNC and vfs_fsync_range() failure. An error gets returned, but ->ki_pos is left advanced. Normal write(2) is fine - it will only update file->f_pos if ->write_iter() has returned a non-negative. However, io_uring kiocb_done() starts with if (req->flags & REQ_F_CUR_POS) req->file->f_pos = rw->kiocb.ki_pos; if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) { if (!__io_complete_rw_common(req, ret)) { /* * Safe to call io_end from here as we're inline * from the submission path. */ io_req_io_end(req); io_req_set_res(req, final_ret, io_put_kbuf(req, issue_flags)); return IOU_OK; } } else { io_rw_done(&rw->kiocb, ret); } Note that ->f_pos update is *NOT* conditional upon ret >= 0 - it happens no matter what, provided that original request had ->kiocb.ki_pos equal to -1 (on a non-FMODE_STREAM file). Jens, is there any reason for doing that unconditionally? IMO it's a bad idea - there's a wide scope for fuckups that way, especially since write(2) is not sensitive to that and this use of -1 ki_pos is not particularly encouraged on io_uring side either, AFAICT. Worse, it's handling of failure exits in the first place, which already gets little testing...