Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

        In fact, we *used* to (long ago) pass in the address of "file->f_pos"
        itself to the low-level read/write routines. We then changed it to do
        that indirection through a local copy of pos (and
        file_pos_read/file_pos_write) because we didn't do the proper locking,
        so different read/write versions could mess with each other (and with
        lseek).

        But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
        accesses as per POSIX") did was to add the proper locking at least for
        the cases that we care about deeply, so we *could* say that we have
        three cases:

         - FMODE_ATOMIC_POS: properly locked,
         - FMODE_STREAM: no pos at all
         - otherwise a "mostly don't care - don't mix!"

        and so we could go back to not copying the pos at all, and instead do
        something like

                loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : 
&file->f_pos;
                ret = vfs_write(f.file, buf, count, ppos);

        and perhaps have a long-term plan to try to get rid of the "don't mix"
        case entirely (ie "if you use f_pos, then we'll do the proper
        locking")

        (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
        implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] 
https://lore.kernel.org/linux-fsdevel/CAHk-=whjtzt52snhbgrnmnuxfn3ge9x_e02x8bpxtkqrfyz...@mail.gmail.com/

Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Kirill Smelkov <k...@nexedi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, 
size_t count)
        ssize_t ret = -EBADF;
 
        if (f.file) {
-               loff_t pos, *ppos = file_ppos(f.file);
-               if (ppos) {
-                       pos = *ppos;
-                       ppos = &pos;
-               }
-               ret = vfs_read(f.file, buf, count, ppos);
-               if (ret >= 0 && ppos)
-                       f.file->f_pos = pos;
+               ret = vfs_read(f.file, buf, count, file_ppos(f.file));
                fdput_pos(f);
        }
        return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user 
*buf, size_t count)
        ssize_t ret = -EBADF;
 
        if (f.file) {
-               loff_t pos, *ppos = file_ppos(f.file);
-               if (ppos) {
-                       pos = *ppos;
-                       ppos = &pos;
-               }
-               ret = vfs_write(f.file, buf, count, ppos);
-               if (ret >= 0 && ppos)
-                       f.file->f_pos = pos;
+               ret = vfs_write(f.file, buf, count, file_ppos(f.file));
                fdput_pos(f);
        }
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct 
iovec __user *vec,
        ssize_t ret = -EBADF;
 
        if (f.file) {
-               loff_t pos, *ppos = file_ppos(f.file);
-               if (ppos) {
-                       pos = *ppos;
-                       ppos = &pos;
-               }
-               ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-               if (ret >= 0 && ppos)
-                       f.file->f_pos = pos;
+               ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
                fdput_pos(f);
        }
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct 
iovec __user *vec,
        ssize_t ret = -EBADF;
 
        if (f.file) {
-               loff_t pos, *ppos = file_ppos(f.file);
-               if (ppos) {
-                       pos = *ppos;
-                       ppos = &pos;
-               }
-               ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-               if (ret >= 0 && ppos)
-                       f.file->f_pos = pos;
+               ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
                fdput_pos(f);
        }
 
-- 
2.21.0.593.g511ec345e1

Reply via email to