bug#61300: wc -c doesn't advance stdin position when it's a regular file
"wc -c" without filename arguments is meant to read stdin til EOF and report the number of bytes it has read. When stdin is on a regular file, GNU wc has that optimisation whereby it skips the reading, does a pos = lseek(0,0,SEEK_CUR) to find out its current position within the file, fstat(0) and reports st_size - pos (assuming st_size > pos). However, it does not move the position to the end of the file. That means for instance that: $ echo test > file $ { wc -c; wc -c; } < file 5 5 Instead of 5, then 0: $ { wc -c; cat; } < file 5 test So the optimisation is incomplete. It also reports the size of the file even if it could not possibly read it because it's not open in read mode: { wc -c; } 0>> file 5 IMO, it should only do the optimisation if - fcntl(F_GETFL) to check that the file is opened in O_RDONLY or O_RDWR - current checks for /proc /sys-like filesystems - pos > st_size - lseek(0,st_size,SEEK_POS) is successful. (that leaves a race window above where it could move the cursor backward, but I would think that can be ignored as if something else reads at the same time, there's not much we can expect anyway). -- Stephane
bug#61300: wc -c doesn't advance stdin position when it's a regular file
On 05/02/2023 18:27, Stephane Chazelas wrote: "wc -c" without filename arguments is meant to read stdin til EOF and report the number of bytes it has read. When stdin is on a regular file, GNU wc has that optimisation whereby it skips the reading, does a pos = lseek(0,0,SEEK_CUR) to find out its current position within the file, fstat(0) and reports st_size - pos (assuming st_size > pos). However, it does not move the position to the end of the file. That means for instance that: $ echo test > file $ { wc -c; wc -c; } < file 5 5 Instead of 5, then 0: $ { wc -c; cat; } < file 5 test So the optimisation is incomplete. It also reports the size of the file even if it could not possibly read it because it's not open in read mode: { wc -c; } 0>> file 5 IMO, it should only do the optimisation if - fcntl(F_GETFL) to check that the file is opened in O_RDONLY or O_RDWR - current checks for /proc /sys-like filesystems - pos > st_size - lseek(0,st_size,SEEK_POS) is successful. (that leaves a race window above where it could move the cursor backward, but I would think that can be ignored as if something else reads at the same time, there's not much we can expect anyway). Yes I agree. Adjusting would also avoid the following inconsistencies: $ { wc -c; wc -c; } < file 5 5 $ { wc -l; wc -l; } < file 1 0 $ truncate -s $(getconf PAGESIZE) file $ { wc -c; wc -c; } < file 4096 0 Hopefully the attached addresses this. Note it doesn't add the constraint on the input being readable, which I'll think a bit more about. cheers, PádraigFrom 42f72ec424e7eecd6b56c5b6fca5f377ff73795b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sun, 5 Feb 2023 19:52:31 + Subject: [PATCH] wc: ensure we update file offset * src/wc.c (wc): Update the offset when not reading, and do read if we can't update the offset. * tests/misc/wc-proc.sh: Add a test case. * NEWS: Mention the bug fix. Fixes https://bugs.gnu.org/61300 --- NEWS | 4 src/wc.c | 5 - tests/misc/wc-proc.sh | 12 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index b3cde4a01..1cea8cc32 100644 --- a/NEWS +++ b/NEWS @@ -57,6 +57,10 @@ GNU coreutils NEWS-*- outline -*- sized files larger than SIZE_MAX. [bug introduced in coreutils-8.24] + `wc -c` will again correctly update the read offset of inputs. + Previously it deduced the size of inputs while leaving the offset unchanged. + [bug introduced in coreutils-8.27] + ** Changes in behavior Programs now support the new Ronna (R), and Quetta (Q) SI prefixes, diff --git a/src/wc.c b/src/wc.c index 5f3ef6eee..de04612e9 100644 --- a/src/wc.c +++ b/src/wc.c @@ -446,7 +446,10 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) beyond the end of the file. As in the example above. */ bytes = end_pos < current_pos ? 0 : end_pos - current_pos; - skip_read = true; + if (bytes && 0 <= lseek (fd, bytes, SEEK_CUR)) +skip_read = true; + else +bytes = 0; } else { diff --git a/tests/misc/wc-proc.sh b/tests/misc/wc-proc.sh index 5eb43b982..2b5026405 100755 --- a/tests/misc/wc-proc.sh +++ b/tests/misc/wc-proc.sh @@ -42,6 +42,18 @@ cat <<\EOF > exp EOF compare exp out || fail=1 +# Ensure we update the offset even when not reading, +# which wasn't the case from coreutils-8.27 to coreutils-9.1 +{ wc -c; wc -c; } < no_read > out || fail=1 +{ wc -c; wc -c; } < do_read >> out || fail=1 +cat <<\EOF > exp +2 +0 +1048576 +0 +EOF +compare exp out || fail=1 + # Ensure we don't read too much when reading, # as was the case on 32 bit systems # from coreutils-8.24 to coreutils-9.1 -- 2.26.2
bug#61300: wc -c doesn't advance stdin position when it's a regular file
On 2023-02-05 11:59, Pádraig Brady wrote: Hopefully the attached addresses this. Thanks for fixing that. Note it doesn't add the constraint on the input being readable, which I'll think a bit more about. Let's leave that as-is, please. If 'wc' can output the correct value without reading its input, POSIX does not require 'wc' to do the read, and it seems perverse to modify 'wc' to go to the effort to refuse to tell the user useful information that the user requested and that 'wc' knows.
bug#61300: wc -c doesn't advance stdin position when it's a regular file
On 2023-02-05 20:59, Paul Eggert wrote: On 2023-02-05 11:59, Pádraig Brady wrote: [...] Let's leave that as-is, please. If 'wc' can output the correct value without reading its input, POSIX does not require 'wc' to do the read, and it seems perverse to modify 'wc' to go to the effort to refuse to tell the user useful information that the user requested and that 'wc' knows. [...] But while I would agree it's very unlikely to ever be hit in practice, as I can't think of any reason why one would call wc with its input not input for reading, wc is meant to report how many bytes it has read, not the size of its input (though POSIX seems ambiguous on that). See also (with Pádraig's patch applied): $ { echo test > file; wc -c; echo test2 >&0; cat file; } 0> file 5 test test2 wc has lseek()ed to the end of the file even though it was opened in write-only mode. Compare with: $ { echo test > file; wc -lc; echo test2 >&0; cat file; } 0> file wc: 'standard input': Bad file descriptor 0 0 test2 -- Stephane
bug#61300: wc -c doesn't advance stdin position when it's a regular file
On 06/02/2023 06:27, Stephane Chazelas wrote: On 2023-02-05 20:59, Paul Eggert wrote: On 2023-02-05 11:59, Pádraig Brady wrote: [...] Let's leave that as-is, please. If 'wc' can output the correct value without reading its input, POSIX does not require 'wc' to do the read, and it seems perverse to modify 'wc' to go to the effort to refuse to tell the user useful information that the user requested and that 'wc' knows. [...] But while I would agree it's very unlikely to ever be hit in practice, as I can't think of any reason why one would call wc with its input not input for reading, wc is meant to report how many bytes it has read, not the size of its input (though POSIX seems ambiguous on that). See also (with Pádraig's patch applied): $ { echo test > file; wc -c; echo test2 >&0; cat file; } 0> file 5 test test2 wc has lseek()ed to the end of the file even though it was opened in write-only mode. Compare with: $ { echo test > file; wc -lc; echo test2 >&0; cat file; } 0> file wc: 'standard input': Bad file descriptor 0 0 test2 Some more thoughts on this. Note the orig thread with motivation for the st_size optimization is at: https://lists.gnu.org/archive/html/coreutils/2016-03/msg00020.html Note also wc -c has had an st_size optimization for all sizes since the very first coreutils implementation. A similar edge case to Stehpane's above is also seen when doing the lseek(near_end)+read() method, as shown by: ${ truncate -s 32768 file; wc -c; wc -c; } 0> file wc: 'standard input': Bad file descriptor 28679 wc: 'standard input': Bad file descriptor 0 One possible solution is avoid the above issue is: start_pos=lseek(0,SEEK_CUR); bytes += lseek(near_end) while (read()) { if (did_lseek && read error == EBADF|EINVAL) lseek(start_pos); did_lseek=false; bytes=0; continue; } That would also fix an issue I saw for one file in /sys, where: /sys/devices/pci:00/:00:02.0/rom st_size = 131072, available bytes = 0, wc -c = 127007 (EINVAL) Doing that method for all file sizes rather than just using st_size, would work but also penalize perf for the common case. Consider cached stats on a network file system for example. So I guess in addition to be able to keep the st_size optimization with stdin, consistent with other cases we could verify/restrict to readable also. Note this is only an issue for stdin. Files specified on the command line and explicitly opened, should get a permission error at that stage. Note also if you really want to read, you can always `cat | wc -c` rather than just `wc -c`, so I'm still not sure we should add the readable restriction for stdin, but I'm not very against it at least since it is such an edge case. cheers, Pádraig
bug#61300: wc -c doesn't advance stdin position when it's a regular file
On 2/6/23 11:38, Pádraig Brady wrote: Note also if you really want to read, you can always `cat | wc -c` rather than just `wc -c` Even that's not guaranteed, as 'cat' is not required to use the 'read' system call if it can determine that the standard input contains only NULs without calling 'read'. (GNU 'cat' doesn't do this, but POSIX allows it.) We shouldn't complicate 'wc' (thus slowing it down and worse, possibly introducing a bug) if the only goal is to make 'wc' fail more often in implausible scenarios.