Re: Modify buffering of standard streams via environment variables (not LD_PRELOAD)?

2024-04-20 Thread Carl Edquist via GNU coreutils General Discussion

On Thu, 18 Apr 2024, Zachary Santer wrote:


On Wed, Mar 20, 2024 at 4:54 AM Carl Edquist  wrote:


However, if stdbuf's magic env vars are exported in your shell (either 
by doing a trick like 'export $(env -i stdbuf -oL env)', or else more 
simply by first starting a new shell with 'stdbuf -oL bash'), then 
every command in your pipelines will start with the new default 
line-buffered stdout. That way your line-items from build.sh should get 
passed all the way through the pipeline as they are produced.


Finally had a chance to try to build with 'stdbuf --output=L --error=L 
--' in front of the build script, and it caused some crazy problems.


For what it's worth, when I was trying that out msys2 (since that's what 
you said you were using), I also ran into some very weird errors when just 
trying to export LD_PRELOAD and _STDBUF_O to what stdbuf -oL sets.  It was 
weird because I didn't see issues when just running a command (including 
bash) directly under stdbuf.  I didn't get to the bottom of it though and 
I don't have access to a windows laptop any more to experiment.


Also I might ask, why are you setting "--error=L" ?

Not that this is the problem you're seeing, but in any case stderr is 
unbuffered by default, and you might mess up the output a bit by line 
buffering it, if it's expecting to output partial lines for progress or 
whatever.


Carl


Re: RFE: enable buffering on null-terminated data

2024-03-20 Thread Carl Edquist via GNU coreutils General Discussion



On Tue, 19 Mar 2024, Zachary Santer wrote:


On Tue, Mar 19, 2024 at 1:24 AM Kaz Kylheku  wrote:


But what tee does is set up _IONBF on its output streams,
including stdout.


So it doesn't buffer at all. Awesome. Nevermind.


Yay!  :D

And since tee uses fwrite to copy whatever input is available, that will 
mean 'records' are output on the same boundaries as the input (whether 
that be newlines, nuls, or just block boundaries).  So putting tee in the 
middle of a pipeline shouldn't itself interfere with whatever else you're 
up to.  (AND it's still relatively efficient, compared to some tools like 
cut that putchar a byte at a time.)


My note about pipelines like this though:

$ ./build.sh | sed s/what/ever/ | tee build.log

is that with the default stdio buffering, while all the commands in 
build.sh will be implicitly self-flushing, the sed in the middle will end 
up batching its output into blocks, so tee will also repeat them in 
blocks.


However, if stdbuf's magic env vars are exported in your shell (either by 
doing a trick like 'export $(env -i stdbuf -oL env)', or else more simply 
by first starting a new shell with 'stdbuf -oL bash'), then every command 
in your pipelines will start with the new default line-buffered stdout. 
That way your line-items from build.sh should get passed all the way 
through the pipeline as they are produced.



(But, proof's in the pudding, so whatever works for you :D )


Happy putting all the way!

Carl


Re: RFE: enable buffering on null-terminated data

2024-03-14 Thread Carl Edquist via GNU coreutils General Discussion



On Mon, 11 Mar 2024, Zachary Santer wrote:

On Mon, Mar 11, 2024 at 7:54 AM Carl Edquist  
wrote:


(In my coprocess management library, I effectively run every coproc 
with --output=L by default, by eval'ing the output of 'env -i stdbuf 
-oL env', because most of the time for a coprocess, that's whats 
wanted/necessary.)


Surrounded by 'set -a' and 'set +a', I guess? Now that's interesting.


Ah, no - I use the 'VAR=VAL command line' syntax so that it's specific to 
the command (it's not left exported to the shell).


Effectively the coprocess commands are run with

LD_PRELOAD=... _STDBUF_O=L command line

This allow running shell functions for the command line, which will all 
get the desired stdbuf behavior.  Because you can't pass a shell function 
(within the context of the current shell) as the command to stdbuf.


As far as I can tell, the stdbuf tool sets LD_PRELOAD (to point to 
libstdbuf.so) and your custom buffering options in _STDBUF_{I,O,E}, in the 
environment for the program it runs.  The double-env thing there is just a 
way to cleanly get exactly the env vars that stdbuf sets.  The values 
don't change, but since they are an implementation detail of stdbuf, it's 
a bit more portable to grab the values this way rather than hard code 
them.  This is done only once per shell session to extract the values, and 
save them to a private variable, and then they are used for the command 
line as show above.


Of course, if "command line" starts with "stdbuf --output=0" or whatever, 
that will override the new line-buffered default.



You can definitely export it to your shell though, either with 'set -a' 
like you said, or with the export command.  After that everything you run 
should get line-buffered stdio by default.



I just added that to a script I have that prints lines output by another 
command that it runs, generally a build script, to the command line, but 
updating the same line over and over again. I want to see if it updates 
more continuously like that.


So, a lot of times build scripts run a bunch of individual commands. 
Each of those commands has an implied flush when it terminates, so you 
will get the output from each of them promptly (as each command 
completes), even without using stdbuf.


Where things get sloppy is if you add some stuff in a pipeline after your 
build script, which results in things getting block-buffered along the 
way:


$ ./build.sh | sed s/what/ever/ | tee build.log

And there you will definitely see a difference.


sloppy () {
for x in {1..10}; do sleep .2; echo $x; done |
sed s/^/:::/ | cat
}

{
echo before:
sloppy
echo

export $(env -i stdbuf -oL env)

echo after:
sloppy
}

Yeah, there's really no way to break what I'm doing into a standard 
pipeline.


I admit I'm curious what you're up to  :)


Of course, using line-buffered or unbuffered output in this situation 
makes no sense. Where it might be useful in a pipeline is when an 
earlier command in a pipeline might only print things occasionally, and 
you want those things transformed and printed to the command line 
immediately.


Right ... And in that case, losing the performance benefit of a larger 
block buffer is a smaller price to pay.


My assumption is that line-buffering through setbuf(3) was implemented 
for printing to the command line, so its availability to stdbuf(1) is 
just a useful side effect.


Right, stdbuf(1) leverages setbuf(3).

setbuf(3) tweaks the buffering behavior of stdio streams (stdin, stdout, 
stderr, and anything else you open with, eg, fopen(3)).  It's not really 
limited to terminal applications, but yeah it makes it easier to ensure 
that your calls to printf(3) actually get output after each line (whether 
that's to a file or a pipe or a tty), without having to call an explicit 
fflush(3) of stdout every time.


stdbuf(1) sets LD_PRELOAD to libstdbuf.so for your program, causing it to 
call setbuf(3) at program startup based on the values of _STDBUF_* in the 
environment (which stdbuf(1) also sets).


(That's my read of it anyway.)

In the BUGS section in the man page for stdbuf(1), we see: On GLIBC 
platforms, specifying a buffer size, i.e., using fully buffered mode 
will result in undefined operation.


Eheh xD

Oh, I imagine "undefined operation" means something more like 
"unspecified" here.  stdbuf(1) uses setbuf(3), so the behavior you'll get 
should be whatever the setbuf(3) from the libc on your system does.


I think all this means is that the C/POSIX standards are a bit loose about 
what is required of setbuf(3) when a buffer size is specified, and there 
is room in the standard for it to be interpreted as only a hint.


If I'm not mistaken, then buffer modes other than 0 and L don't actually 
work. Maybe I should count my blessings here. I don't know what's going 
on in 

Re: RFE: enable buffering on null-terminated data

2024-03-11 Thread Carl Edquist via GNU coreutils General Discussion

On Sun, 10 Mar 2024, Zachary Santer wrote:


On Sun, Mar 10, 2024 at 4:36 PM Carl Edquist  wrote:


Out of curiosity, do you have an example command line for your use case?


My use for 'stdbuf --output=L' is to be able to run a command within a
bash coprocess.


Oh, cool, now you're talking!  ;)


(Really, a background process communicating with the parent process 
through FIFOs, since Bash prints a warning message if you try to run 
more than one coprocess at a time. Shouldn't make a difference here.)


(Kind of a side-note ... bash's limited coprocess handling was a long 
standing annoyance for me in the past, to the point that I wrote a bash 
coprocess management library to handle multiple active coprocess and give 
convenient methods for interaction.  Perhaps the trickiest bit about 
multiple coprocesses open at once (which I suspect is the reason support 
was never added to bash) is that you don't want the second and subsequent 
coprocesses to inherit the pipe fds of prior open coprocesses.  This can 
result in deadlock if, for instance, you close your write end to coproc1, 
but coproc1 continues to wait for input because coproc2 also has a copy of 
a write end of the pipe to coproc1's input.  So you need to be smart about 
subsequent coprocesses first closing all fds associated with other 
coprocesses.


Word to the wise: you might encounter this issue (coproc2 prevents coproc1 
from seeing its end-of-input) even though you are rigging this up yourself 
with FIFOs rather than bash's coproc builtin.)




See coproc-buffering, attached.


Thanks!

Without making the command's output either line-buffered or unbuffered, 
what I'm doing there would deadlock. I feed one line in and then expect 
to be able to read a transformed line immediately. If that transformed 
line is stuck in a buffer that's still waiting to be filled, then 
nothing happens.


I swear doing this actually makes sense in my application.


Yeah makes sense!  I am familiar with the problem you're describing.

(In my coprocess management library, I effectively run every coproc with 
--output=L by default, by eval'ing the output of 'env -i stdbuf -oL env', 
because most of the time for a coprocess, that's whats wanted/necessary.)



... Although, for your example coprocess use, where the shell both 
produces the input for the coproc and consumes its output, you might be 
able to simplify things by making the producer and consumer separate 
processes.  Then you could do a simpler 'producer | filter | consumer' 
without having to worry about buffering at all.  But if the producer and 
consumer need to be in the same process (eg they share state and are 
logically interdependent), then yeah that's where you need a coprocess for 
the filter.


... On the other hand, if the issue is that someone is producing one line 
at a time _interactively_ (that is, inputting text or commands from a 
terminal), then you might argue that the performance hit for unbuffered 
output will be insignificant compared to time spent waiting for terminal 
input.




$ ./coproc-buffering 10
Line-buffered:
real0m17.795s
user0m6.234s
sys 0m11.469s
Unbuffered:
real0m21.656s
user0m6.609s
sys 0m14.906s


Yeah, this makes sense in your particular example.

It looks like expand(1) uses putchar(3), so in unbuffered mode this 
translates to one write(2) call for every byte.  sed(1) is not quite as 
bad - in unbuffered it appears to output the line and the newline 
terminator separately, so two write(2) calls for every line.


So in both cases (but especially for expand), line buffering reduces the 
number of write(2) calls.


(Although given your time output, you might say the performance hit for 
unbuffered is not that huge.)



When I initially implemented this thing, I felt lucky that the data I 
was passing in were lines ending in newlines, and not null-terminated, 
since my script gets to benefit from 'stdbuf --output=L'.


:thumbsup:



Truth be told, I don't currently have a need for --output=N.


Mmm-hmm  :)


Of course, sed and all sorts of other Linux command-line tools can 
produce or handle null-terminated data.


Definitely.  So in the general case, theoretically it seems as useful to 
buffer output on nul bytes.


Note that for gnu sed in particular, there is a -u/--unbuffered option, 
which will effectively give you line buffered output, including buffering 
on nul bytes with -z/--null-data .


... I'll be honest though, I am having trouble imagining a realistic 
pipeline that filters filenames with embedded newlines using expand(1) 
;)


...

But, I want to be a good sport here and contrive an actual use case.

So for fun, say I want to use cut(1) (which performs poorly when 
unbuffered) in a coprocess that takes null-terminated file paths on input 
and outputs the first directory component (which possibly contains 
embedded newlines).


The basic command in the coprocess would be:

cut -d/ -f1 -z

but with the default 

Re: RFE: enable buffering on null-terminated data

2024-03-10 Thread Carl Edquist via GNU coreutils General Discussion

Hi Zack,

This sounds like a potentially useful feature (it'd probably belong with a 
corresponding new buffer mode in setbuf(3)) ...


Filenames should be passed between utilities in a null-terminated 
fashion, because the null byte is the only byte that can't appear within 
one.


Out of curiosity, do you have an example command line for your use case?

If I want to buffer output data on null bytes, the closest I can get is 
'stdbuf --output=0', which doesn't buffer at all. This is pretty 
inefficient.


I'm just thinking that find(1), for instance, will end up calling write(2) 
exactly once per filename (-print or -print0) if run under stdbuf 
unbuffered, which is the same as you'd get with a corresponding stdbuf 
line-buffered mode (newline or null-terminated).


It seems that where line buffering improves performance over unbuffered is 
when there are several calls to (for example) printf(3) in constructing a 
single line.  find(1), and some filters like grep(1), will write a line at 
a time in unbuffered mode, and thus don't seem to benefit at all from line 
buffering.  On the other hand, cut(1) appears to putchar(3) a byte at a 
time, which in unbuffered mode will (like you say) be pretty inefficient.


So, depending on your use case, a new null-terminated line buffered option 
may or may not actually improve efficiency over unbuffered mode.



You can run your commands under strace like

stdbuf --output=X  strace -c -ewrite  command ... | ...

to count the number of actual writes for each buffering mode.


Carl


PS, "find -printf" recognizes a '\c' escape to flush the output, in case 
that helps.  So "find -printf '%p\0\c'" would, for instance, already 
behave the same as "stdbuf --output=N  find -print0" with the new stdbuf 
output mode you're suggesting.


(Though again, this doesn't actually seem to be any more efficient than 
running "stdbuf --output=0  find -print0")


On Sun, 10 Mar 2024, Zachary Santer wrote:


Was "stdbuf feature request - line buffering but for null-terminated data"

See below.

On Sun, Mar 10, 2024 at 5:38 AM Pádraig Brady  wrote:


On 09/03/2024 16:30, Zachary Santer wrote:

'stdbuf --output=L' will line-buffer the command's output stream.
Pretty useful, but that's looking for newlines. Filenames should be
passed between utilities in a null-terminated fashion, because the
null byte is the only byte that can't appear within one.

If I want to buffer output data on null bytes, the closest I can get
is 'stdbuf --output=0', which doesn't buffer at all. This is pretty
inefficient.

0 means unbuffered, and Z is already taken for, I guess, zebibytes.
--output=N, then?

Would this require a change to libc implementations, or is it possible now?


This does seem like useful functionality,
but it would require support for libc implementations first.

cheers,
Pádraig





Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-13 Thread Carl Edquist via GNU coreutils General Discussion

On Tue, 13 Dec 2022, Pádraig Brady wrote:

Thanks to both of you for working through the details on this. This does 
seem like a useful feature, and would be appropriate to add.


A summarized patch set would be appreciated at this stage, thanks.

Re HAVE_INOTIFY, that's really a proxy for a linux kernel, and so would 
be most appropriately changed to:


 defined __linux__ || defined __ANDROID__

I'm thinking these hardcoded defines would be best for now at least as 
it covers the vast majority of systems, and avoids complicated (cross) 
compile time checks.


A modularised iopoll.c would be better, given the potential uses by 
other tools, though we'd probably release for just tee initially.


As for interface to this functionality I'm wondering if we could just 
have the existing tee {-p,--output-error} imply the use of poll() on 
output.


I.e. from a high level -p just means to deal more appropriately with non 
file outputs, and as part of that, dealing immediately with closed 
outputs would be an improvement.


Note also tail(1) enables this functionality by default. I'm not sure 
about other utilities, but we can deal with that later if needed.


Thanks Pádraig for the feedback - that all sounds good.

Will try to follow-up sometime this week...

Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-12 Thread Carl Edquist via GNU coreutils General Discussion

On Mon, 12 Dec 2022, Arsen Arsenović wrote:


Hi Carl,

Carl Edquist  writes:


[2. text/x-diff; 0001-tee-only-fstat-outputs-if-pipe_check-is-active.patch]...

[3. text/x-diff; 
0002-tee-skip-pipe-checks-if-input-is-always-ready-for-re.patch]...


Thanks for writing these, and the other patches.  I've once again been 
stripped of time, but I think we've nailed the concept down for the most 
part.  I think we should wait for Pádraig to voice his opinion at this 
point.  Details can be ironed out later, and pretty easily too.


Sure...  Sorry for throwing so many patches at the list  :)

I did have in mind also to send a summarized patch series in a single 
email to make reviewing everything a bit easier.  Pádraig is the 
gatekeeper here, in any case.


The main question on my mind currently is still what's the best way to do 
the platform preprocessor logic for poll vs select ... (assuming 
HAVE_INOTIFY is only in tail.c for "tail -f" mode)


And, i don't know if this would help "sell" it or not, but locally i have 
separated out the iopoll implementation to a separate iopoll.c/.h, and as 
an example successfully added a --pipe-check option to cat and tr.  (Not 
to pressure anyone to take it, but, sometimes it's easier to tell if you 
like an idea or not when you have a patch to look at...)


Personally, i see this broken-pipe detection being useful in tee, cat, 
sed, and grep for starters.  (Though i know sed and grep live outside 
coreutils.)


In some ways i feel like tee is actually the most immediate use case 
though.


cat and sed mostly came into play for me as a way to type commands into a 
socket; though as it seems to me now, "broken-pipe" detection for a socket 
may not even be possible.



All for now.

Have a good one,

Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-12 Thread Carl Edquist via GNU coreutils General Discussion

On Mon, 12 Dec 2022, Arsen Arsenović wrote:


Hi Rob,

Rob Landley  writes:


This sort of thing is why I added -i to toybox's "timeout" command:

-i  Only kill for inactivity (restart timeout when command produces output)

It runs the command's stdout through a pipe and does a poll() with the 
-i seconds value, and signals the program if the poll() expires.


The android guys found it useful, but I was waiting to hear back about 
"cut -DF" before bringing it up here...


That's interesting, might be worth adding to the GNU timeout, however, 
it's not appropriate for what I'm using tee for, since compiler 
processes could appear idle for a long time, if doing LTO for instance.


Thanks Rob for the idea.

Sounds like it could be useful if you know what an appropriate timeout 
ought to be ... though in the general data-driven case i feel like filters 
should only quit for relevant events like end-of-input or write failure. 
(The broken-pipe detection idea being an extension that detects the point 
when future writes will fail.)


Sounds like a neat idea to add to a timeout command (where you know the 
output timeout you want) if it can be made to work with arbitrary 
programs.


Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-10 Thread Carl Edquist via GNU coreutils General Discussion

On Fri, 9 Dec 2022, Carl Edquist wrote:


On Fri, 9 Dec 2022, Arsen Arsenović wrote:

Originally i had imagined (or hoped) that this broken-pipe detection 
could also be used for sockets (that was how the issue came up for 
me), but it seems the semantics for sockets are different than for 
pipes.


This might require POLLPRI or POLLRDHUP or such.  Can you try with 
those to the set of events in pollfd?


Oh interesting!  I had assumed these wouldn't help - POLLPRI is for OOB data, 
and, I had assumed POLLRDHUP was only for polling the read side of a socket 
(thus POLL*RD*HUP), to determine if the remote write end was shutdown.  But 
you are right, poll() returns with POLLRDHUP in revents as soon as the remote 
end is closed.  Thanks for the tip!


I assume the reason this works is, even though i have in mind that i am 
monitoring the socket as an output, the socket serves as an input also (it's 
open for RW).  So, if my interpretation is correct, POLLRDHUP is actually 
monitoring the local read side of the socket.  And so, poll() is actually 
detecting the remote write side getting shutdown.


This is not technically the same thing as monitoring the local output side, 
but if the remote end of the socket closes entirely (shutting down the remote 
read and write ends together), then that tells us about the local output side 
as well.


This definitely seems to work for the case i was playing with, though i'm not 
sure if it would behave as intended if the remote side only shut down its 
read or write end (but not both).


Alright, i got a chance to play around with this more locally with 
socketpairs.  Unfortunately, i've confirmed that POLLRDHUP is clearly 
intended for monitoring the local read end of a socket, not the local 
write end.  (Thus it is not what we want for monitoring an output.)


In each case below, i started with a socket open for RW on both ends, and 
poll()ed the local socket fd with POLLRDHUP.  I observed the following:


- shutdown local read -> POLLRDHUP
- shutdown local write -> nothing detected
- shutdown remote read -> nothing detected
- shutdown remote write -> POLLRDHUP

This is exactly the reverse of what we want when monitoring a socket as an 
output.  Shutting down the local read end should be fine (since we are 
using it for output).  Shutting down the local write end should be 
considered like a broken pipe, since nothing further can be sent. 
Shutting down the remote read end should also be considered like a broken 
pipe, as with a normal pipe.  Shutting down the remote write end should be 
fine, since supposedly the local end of the socket is used for output.


Again, the reason POLLRDHUP worked in my case is that the remote end 
closed completely (shutting down remote read and write).  But for cases 
where either end (local or remote) shuts down only read or write, using 
POLLRDHUP checks for exactly the wrong thing.


It seems that there ought to be a "POLLWRHUP", but as far as i can tell 
that does not exist.


...

My conclusion (as far as i can tell) is that we can't do this iopoll jazz 
correctly for sockets.


The up side though is that that makes it ok to explicitly check that the 
output is a pipe, and then only worry about pipe semantics.  Which makes 
the select-based implementation safe, since pipe fds are not open for RW.


Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-09 Thread Carl Edquist via GNU coreutils General Discussion

Hi Pádraig,

Getting back to this portability question:

On Fri, 2 Dec 2022, Pádraig Brady wrote:


 Anyway if it's possible just to use poll(2) (the system one, not the
 gnulib replacement), that might simplify the portability logic.


Yes it would be better to use poll() if possible,
and that's what I attempted with tail(1) recently.
From the gnulib docs on poll():

"
 Portability problems fixed by Gnulib:

  This function is missing on some platforms:
  mingw, MSVC 14, HP NonStop.

  This function doesn't work on special files like @file{/dev/null} and ttys
  like
  @file{/dev/tty} on some platforms:
  AIX 5.3, Mac OS X 10.4.0

Portability problems not fixed by Gnulib:

  Under Windows, when passing a pipe, Gnulib's @code{poll} replacement might
  return 0 even before the timeout has passed.  Programs using it with pipes
  can
  thus busy wait.

  On some platforms, file descriptors other than sockets do not support
  POLLHUP; they will return a "readable" or "writable" status instead:
  AIX 7.2, HP NonStop.
"

So to use poll() everywhere we'd need the gnulib module.
But empirically the replacement didn't work on macos at least
for this use case, and we know the select emulation wouldn't work on AIX.

So portable use of poll() will be awkward.


Don't remember if i said so already or not, but thanks for the summary!

In my last patch i made a first attempt to define a 
HAVE_POLL_PIPE_CLOSE_DETECTION in configure.ac using AC_RUN_IFELSE to test 
poll() functionality.  Arsen pointed out that this does not work for 
cross-compiling.  (That's too bad, but, i think i understand now why it's 
a problem.)


We could still do just an AC_COMPILE_IFELSE to test whether native poll() 
is available, though if i understand your summary right there are a number 
of platforms where native poll() is available but won't work correctly 
here.


So, if we need to separate out the poll() vs select() implementations 
using a list of platforms, do we have a good idea what list of platforms 
are "ok for native poll pipe detection", or alternatively a list of 
platforms that are not ok?


So tail.c uses:

#if defined _AIX || defined __sun || defined __APPLE__ || HAVE_INOTIFY

but as far as i can tell HAVE_INOTIFY relates to the use of inotify in 
tail, which is specific to "tail -f" mode, and perhaps not relevant to the 
logic in tee (or other filters).


The section you quoted "From the gnulib docs on poll()" makes it sound 
like there might be problems with: Windows, mingw, MSVC 14, HP NonStop, 
AIX 5.3 & 7.2, and Mac OS X 10.4.0.  (Already that makes me wonder about 
_AIX and __APPLE__ being in the list of platforms to use poll().)


Anyway, just trying to get a handle on what might be a good way to do the 
preprocessor macro logic to decide between the poll() and select() 
implementations.


I guess in worst case, it sounds like select() could be used everywhere. 
And there is some appeal to having a single implementation.  But select() 
does seem less efficient if poll() is available; for instance with 
select(), two fd_sets need to be zeroed out (128 bytes each) before every 
call to read(), compared to the poll() version that only needs to 
initialize two struct pollfds (8 bytes each).  So i still see some value 
in doing poll() where possible.



Any thoughts on how to put together preprocessor logic that gets at the 
heart of the issue?


Thanks,
Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-09 Thread Carl Edquist via GNU coreutils General Discussion

On Fri, 9 Dec 2022, Arsen Arsenović wrote:

Similar to the situation here, i was seeing things annoyingly look like 
they are still 'alive' longer than they ought to be when providing 
input from the terminal.


Huh, I never tried that, honestly.


Here is a simple example:

exec 3<> /dev/tcp/example.com/80

echo_crlf () { printf "%s\r\n" "$@"; }

{ echo_crlf "GET / HTTP/1.1"
  echo_crlf "Host: example.com"
  echo_crlf "Connection: close"
  echo_crlf ""
} >&3

cat <&3


In this example, i'm sending the input to the socket (on fd 3) directly 
using the 'printf' builtin and shell redirection, and i request the server 
close its side of the connection after this request.


But if i want to type the request by hand, i might do:

sed -u 's/$/\r/' >&3

to read from the terminal, convert line terminators to CRLF, and write to 
the socket.  Only, similar to the situation with tee and pipes, sed here 
will not detect when the remote read end of the socket has closed, so it 
hangs waiting for more input.



Did polling help your use-case?


Mostly no, as i kind of got into later last time.  It seems the semantics 
are different for sockets than pipes; for some reason it seems the socket 
doesn't behave like a broken pipe until after the remote read end shuts 
down and then another write is made to the local end.  Only after these 
does poll() seem to detect the broken-pipe -like state.


Oh interesting.  That wasn't on my radar at all.  I guess this means 
that when cross-compiling, the configure script is run on the 
cross-compiling host, rather than on the target platform; so any test 
programs in configure.ac with AC_RUN_IFELSE don't necessarily check the 
target platform functionality (?)


Or worse, is unable to run at all (and always fails), if the binary is 
for a different kernel or architecture.


Ok, so i guess for the sake of cross-compiling we are limited to 
compile/link checks.


Originally i had imagined (or hoped) that this broken-pipe detection 
could also be used for sockets (that was how the issue came up for me), 
but it seems the semantics for sockets are different than for pipes.


This might require POLLPRI or POLLRDHUP or such.  Can you try with those 
to the set of events in pollfd?


Oh interesting!  I had assumed these wouldn't help - POLLPRI is for OOB 
data, and, I had assumed POLLRDHUP was only for polling the read side of a 
socket (thus POLL*RD*HUP), to determine if the remote write end was 
shutdown.  But you are right, poll() returns with POLLRDHUP in revents as 
soon as the remote end is closed.  Thanks for the tip!


I assume the reason this works is, even though i have in mind that i am 
monitoring the socket as an output, the socket serves as an input also 
(it's open for RW).  So, if my interpretation is correct, POLLRDHUP is 
actually monitoring the local read side of the socket.  And so, poll() is 
actually detecting the remote write side getting shutdown.


This is not technically the same thing as monitoring the local output 
side, but if the remote end of the socket closes entirely (shutting down 
the remote read and write ends together), then that tells us about the 
local output side as well.


This definitely seems to work for the case i was playing with, though i'm 
not sure if it would behave as intended if the remote side only shut down 
its read or write end (but not both).


Also, my bits/poll.h seems to suggest this is a Linux extension:

#ifdef __USE_GNU
/* These are extensions for Linux.  */
# define POLLMSG0x400
# define POLLREMOVE 0x1000
# define POLLRDHUP  0x2000
#endif


Anyway, this is all Good To Know.  I don't know what the semantics with 
poll() for sockets are supposed to be in the general case (beyond Linux), 
so i don't feel i'm in a good position to advocate for it in coreutils. 
But maybe someone who knows better can comment on this topic.


A quick note, this check only needs to be done a total of once per 
output, it shouldn't be done inside iopoll(), which would result in an 
additional redundant fstat() per read().


Could this be handled by get_next_out?


Sure, either in that function or immediately after it gets called.  But 
also once for stdout before the while (n_outputs) loop.  Alternatively, 
allocate a file type array and populate it in the for loop that does all 
the fopen() calls.


Either way, the idea would be to then replace

if (pipe_check)

with something like

if (pipe_check && first_out_is_pipe)

or (with a file type array)

if (pipe_check && S_ISFIFO (output_types[first_out]))


... Also, i suspect that the pipe_check option can be disabled if the 
_input_ is a regular file (or block device), since (i think) these 
always show up as ready for reading.  (This check would only need to be 
done once for fd 0 at program start.)


Yes, there's no point poll-driving those, since it'll be always 
readable, up to EOF, and never hesitate to bring more data.  It might 

Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-08 Thread Carl Edquist via GNU coreutils General Discussion

On Thu, 8 Dec 2022, Arsen Arsenović wrote:

Apologies for my absence, Tuesdays and Wednesdays are long workdays for 
me.


No need for apologies - I feel like i am the one who should apologize for 
my high volume of email to the list.  People have lives after all!  :)


The timing of this thread caught my attention because I had recently been 
wrestling with a similar issue, trying to use shell utils to talk over a 
socket with the help of bash's /dev/tpc/host/port interface.


Similar to the situation here, i was seeing things annoyingly look like 
they are still 'alive' longer than they ought to be when providing input 
from the terminal.



Biggest item is making a new configure macro based on whether poll() is 
present and and works as intended for pipes.  With 0 timeout, polling 
the write-end of a pipe that is open on both ends for errors does not 
indicate a broken pipe; but polling the write-end of a pipe with the 
read-end closed does indicate a broken pipe.


This might be a bit problematic when cross compiling (which is why I 
imagine systems were hard-coded before).


Oh interesting.  That wasn't on my radar at all.  I guess this means that 
when cross-compiling, the configure script is run on the cross-compiling 
host, rather than on the target platform; so any test programs in 
configure.ac with AC_RUN_IFELSE don't necessarily check the target 
platform functionality (?)


That's too bad.  I had hoped to come up with a better way to indicate a 
working poll() for this feature than maintaining a list of platforms.




So I guess (on Linux at least) that means a "readable event on STDOUT" is
equivalent to (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR).

So, it'll catch the relevant poll() errors (POLLHUP | POLLERR), but 
the inclusion of POLLIN results in the gotcha that it will be a false 
positive if stdout is already open for RW (eg a socket) and there is 
actually data ready.


Ah - yes.  tail.c guards against this by checking the type of the file
descriptor before selecting it, and makes sure it's among the "one-way"
file descriptors:

 if (forever && ignore_fifo_and_pipe (F, n_files))
   {
 /* If stdout is a fifo or pipe, then monitor it
so that we exit if the reader goes away.  */
 struct stat out_stat;
 if (fstat (STDOUT_FILENO, _stat) < 0)
   die (EXIT_FAILURE, errno, _("standard output"));
 monitor_output = (S_ISFIFO (out_stat.st_mode)
   || (HAVE_FIFO_PIPES != 1 && isapipe (STDOUT_FILENO)));

Good catch!  It completely slipped by my mind.


Ah, yeah if we know it's a pipe we shouldn't have to worry about an output 
being open for RW.


Originally i had imagined (or hoped) that this broken-pipe detection could 
also be used for sockets (that was how the issue came up for me), but it 
seems the semantics for sockets are different than for pipes.


Experimentally, it seems that once the remote read end of the socket is 
shutdown, poll() does not detect a broken pipe - it will wait 
indefinitely.  But at this point if a write() is done on the local end of 
the socket, the first write() will succeed, and then _after_ this it will 
behave like a broken pipe - poll() returns POLLERR|POLLHUP, and write() 
results in SIGPIPE/EPIPE.


It's fairly confusing.  But it seems due to the difference in semantics 
with sockets, likely this broken-pipe detection will only really work 
properly for pipes.


So yeah, back to your point, there is a little room for improvement here 
by fstat()ing the output and only doing the iopoll() waiting if the output 
is a pipe.


A quick note, this check only needs to be done a total of once per output, 
it shouldn't be done inside iopoll(), which would result in an additional 
redundant fstat() per read().


... Also, i suspect that the pipe_check option can be disabled if the 
_input_ is a regular file (or block device), since (i think) these always 
show up as ready for reading.  (This check would only need to be done once 
for fd 0 at program start.)


But ... one step at a time!  :)


Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-06 Thread Carl Edquist via GNU coreutils General Discussion

Hi Arsen, thanks for your feedback!

On Tue, 6 Dec 2022, Arsen Arsenović wrote:

The stubborn part of me might say, for platforms that do not natively 
support poll(2), we could simply leave out support for this feature. 
If that's not acceptable, we could add a select(2)-based fallback for 
platforms that do not have a native poll(2).


There's no need to omit it.  iopoll() seems sufficiently easy to 
implement via select():


So, you're right...

I think the stubborn part of me was reacting (badly) to what looked to be 
a messy situation in tail.c's mix of poll() and select().  I got further 
disenchanted when i saw gnulib's select()-based emulation of poll()  :(


But hey, if it works...


Anyway a few comments:

(1.)


+#if defined _AIX || defined __sun || defined __APPLE__ || HAVE_INOTIFY


I know this is what tail.c has, but I would like to seriously rethink 
this.  What do these macros have to do with poll()?  (And just as 
importantly, how can you tell by looking at it?)


First of all, I think poll(2) (the native system one, not the gnulib 
emulation) should be used wherever possible.  Which is to say, select() 
should only be used where native poll() is unavailable (or if it's 
"available" but broken in some important way).


I believe the HAVE_INOTIFY bit is specific to tail.c, which relates to its 
inotify mode.  I don't think it should be part of the equation here in 
tee.c.


The remaining macros _AIX, __sun, __APPLE__ -- do these cover all systems 
that have a working native poll() -- eg, Linux and the BSDs?  (I suspect 
not?)  And they do not clearly communicate anything to the reader about 
why they were picked / what they represent in relation to the availability 
of poll().


*Ideally,* there might be some configure macro like HAVE_POLL, except one 
that refers specifically to the system one and not the gnulib emulation. 
Something like "HAVE_NATIVE_poll".  Then it would be clear what the 
preprocessor logic is trying to accomplish.



(2.) if we want to use #if preprocessor logic (and preferably "#if 
HAVE_NATIVE_poll") for this poll/select fallback mechanism, I'd suggest to 
include each entire version of the function (including the function header 
and comments) in separate preprocessor blocks.  (One complete function in 
#if, and the other in #else.) The iopoll() implementation I provided is 
straightforward by itself, but its readability gets trampled badly when 
trying to mix the body of the select()-based implementation in with it. 
Just my 2 cents.



(3.)


+  FD_SET(fdout, );
+  FD_SET(fdin, );


Is it documented somewhere that the 'exceptfds' argument (xfd) to 
select(2) can be used in a useful way here?


The only thing I've seen (from select_tut(2)) is that these can be used to 
detect out-of-band data for reading on a socket, or that control status 
information is available for a pty in packet mode (tty_ioctl(4)).  I 
believe both of these would be false positives if encountered on input, 
and it's not clear what they would mean for output.



(4.)


+  /* readable event on STDOUT is equivalent to POLLERR,
+ and implies an error condition on output like broken pipe.  */


I know this is what the comment from tail.c says, but is it actually 
documented to be true somewhere?  And on what platforms?  I don't see it 
being documented in my select(2) on Linux, anyway.  (Though it does seem 
to work.)  Wondering if this behavior is "standard".



I guess there is also a little gotcha.  In tee, all the outputs *except* 
stdout are explicitly open()ed, but stdout itself is not, and therefore it 
inherits the open mode that fd 1 had when tee was exec()ed.


In particular, if fd 1 / stdout is a socket, open for RW, then a "readable 
event on STDOUT" could be a legitimate indication that data is ready for 
reading.  So that would be a false positive causing stdout to be removed 
even though successful writes are still possible.


(Maybe nobody will care about this, but it's an interesting corner-case.)

...

"But other than that", sure!  I guess I'm all for it  :)

Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-02 Thread Carl Edquist via GNU coreutils General Discussion

On Fri, 2 Dec 2022, Arsen Arsenović wrote:

I'm concerned with adding such a behavior change by default still.  I 
can imagine this "lifetime extension" properly having been relied on in 
the last many decades it has been around for ;)


That's fair! :)  I'd be curious to hear about a use-case for that; but 
anyway yeah if it's an option at least there won't be any surprises.



Although tee has multiple outputs, you only need to monitor a single 
output fd at a time.  Because the only case you actually need to catch 
is when the final valid output becomes a broken pipe.  (So I don't 
think it's necessary to poll(2) all the output fds together.)


That is technically true, but I think coupling this to two FDs might 
prove a bit inelegant in implementation (since you'd have to decide 
which entry from an unorganized list with holes do you pick?  any of 
them could spontaneously go away), so I'm not sure the implementation 
would be cleaner that way.


It'd be best to explain with a patch - I'll plan to send one later for a 
concrete example.


But to try to answer your question:

you'd have to decide which entry from an unorganized list with holes do 
you pick?


So as you've seen there is a "descriptors" array corresponding to all the 
outputs.  What I had in mind is to maintain an int that keeps track of the 
index of the first item in the descriptors array that is still valid. 
(They are actually FILE *'s, which are set to NULL when they become 
invalid.)


So, you'd start with the first one (which happens always to be stdout). 
If the output at the current index (starting with stdout) ever becomes 
invalid (due to broken pipe detection, or due to other non-fatal write 
failure), then increase the index until the next valid output is found 
(skipping NULL entries).  If the last output is closed, it's not really 
important what happens to the index - it can be left as-is or set to -1; 
it won't be used again in any case.



any of them could spontaneously go away


I think it should be OK just to check the current output index in the list 
and advance if that one closes.  If a pipe becomes broken in the middle of 
the list, I think it's fine to let it be.


Why is this OK?

First of all, the current index still refers to a valid output.  That 
means you are _not_ in a situation where all outputs are broken pipes, so 
the right thing to do is to continue waiting for input.


Then, if input arrives, it will get written to all the valid (ie, 
non-NULL) outputs, including the one that has now become a broken pipe 
(without being detected).


But this is OK, because (as we've discussed before) we should already be 
ignoring SIGPIPE (and handling EPIPE), to prevent races that can happen 
where a fatal SIGPIPE comes in after a read() and before the corresponding 
write().  So, this broken pipe gets removed due to the write failure 
(EPIPE), rather than the broken-pipe detection.


But it does not affect the lifetime of the tee process, as any time poll() 
is waiting, the index will point to an output that is still valid and is 
not (yet) a broken pipe.


...


But the proof is in the pudding; so I'll try to draft something up and you 
can see what you think technically & aesthetically...



Cheers!
Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-02 Thread Carl Edquist via GNU coreutils General Discussion

PS:

On Fri, 2 Dec 2022, Carl Edquist wrote:

On the topic of implementation - I was thinking more about a general 
solution for filter utils, and I am thinking the key thing is to provide 
a replacement (wrapper) for read(2), that polls two fds together (one 
input and one ouput), with no timeout.


It would check for POLLIN on input (in which case do the read()). 
Otherwise if there is an error (POLLERR or POLLHUP) on input, treat it 
as EOF. Otherwise if there's an error on output, remove this output, or 
handle it similar to SIGPIPE/EPIPE.


(Nothing is written to the output fd here, it's just used for polling.)

...

I think this general approach of using poll(2) for a single input along 
with a single ouput could be used for both tee and other filters that 
only write to stdout.



A question for Pádraig (or anyone else who might know) -

Are there portability concerns with the system poll(2), or just with the 
gnulib replacement for poll(), which is based on select(2) ?


The commit message (3a1c328cd) seems to suggest the latter.

In other words, are there any platforms that do not support using the 
system poll(2) directly?


tail.c seems to use poll(2) for _AIX || __sun || __APPLE__ || HAVE_INOTIFY 
and *otherwise* it uses select(2).


But I am thinking, if it's available, poll() is the more appropriate 
interface for this check in the first place.


The 'trick' in tail.c of using select() to test when STDOUT becomes 
available for reading, and taking that as an indication of an error 
condition, doesn't seem to be documented (in my linux select(2)), and also 
seems like it might not work as intended if fd 1 is open for read-write.



Anyway if it's possible just to use poll(2) (the system one, not the 
gnulib replacement), that might simplify the portability logic.



What do you think?


Thanks,
Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-12-02 Thread Carl Edquist via GNU coreutils General Discussion

On Wed, 30 Nov 2022, Arsen Arsenović wrote:


Carl Edquist  writes:

It sounds like one way or another you want to copy your endless but 
intermittent input to multiple output pipes, but you want to quit as 
soon as all the output pipes become broken.


Precisely.  The most important requirement there is that the tee-based 
substitute imitates the lifetime of it's longest lived output.  Now I'm 
thinking, maybe --pipe-check should also block SIGPIPE, to prevent the 
race between poll, process death and write (which would result in the 
process getting killed, as it'd happen right now, to see what I mean, 
try ``tee >(sleep 100) >(:)'' and press enter after a bit; a race could 
make --pipe-check behave like that).


Right, you need to ignore SIGPIPE (like "tee -p" does) for multiple 
outputs if any of them can exit early...  Sometimes I forget that '-p' is 
not on by default for tee.  I don't think I've encountered a use-case for 
specifically wanting this option to be off.



I'll keep this in mind, for v2, which is currently waiting on me having 
some time to research the portability of this whole thing, and for a 
decision on whether to even include this feature is made.


On the topic of implementation - I was thinking more about a general 
solution for filter utils, and I am thinking the key thing is to provide a 
replacement (wrapper) for read(2), that polls two fds together (one input 
and one ouput), with no timeout.


It would check for POLLIN on input (in which case do the read()). 
Otherwise if there is an error (POLLERR or POLLHUP) on input, treat it as 
EOF.  Otherwise if there's an error on output, remove this output, or 
handle it similar to SIGPIPE/EPIPE.


(Nothing is written to the output fd here, it's just used for polling.)

...

Although tee has multiple outputs, you only need to monitor a single 
output fd at a time.  Because the only case you actually need to catch is 
when the final valid output becomes a broken pipe.  (So I don't think it's 
necessary to poll(2) all the output fds together.)


I think this general approach of using poll(2) for a single input along 
with a single ouput could be used for both tee and other filters that only 
write to stdout.


(But again, "tail -f" is different, because it checks for regular files to 
grow as the source of intermittent input - which I don't think is 
something you can use poll() to wait for.)



I'll try to put together a patch do demo what I have in mind...


But if you don't have control over that, the fundamental problem is 
detecting broken pipes *without writing to them*, and I don't think 
that can be solved with any amount of extra pipes and fd redirection...


I imagine that, technically, this is attainable by editing the process 
substitutions involved to also signal the original process back; 
however, this feels less elegant and generally useful than tee handling 
this, given that tee's use-case is redirecting data to many places.


Ah, yeah, I bet you could rig something up waiting on processes and/or 
using signals.  In the general case there is a slight difference between 
"waiting for a process to terminate" and "waiting for the pipe to become 
broken" (the process attached to the read-end of the pipe could close its 
stdin early, or it could exit after forking a child that keeps the pipe 
open), but yeah in the most common case the two events happen together.



Have a nice day :)

Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-11-29 Thread Carl Edquist via GNU coreutils General Discussion

On Tue, 29 Nov 2022, Arsen Arsenović wrote:

The issue here isn't the compilers hanging, it's tee living longer than 
all the compilers do because it's stdin doesn't EOF (it'd be preferable 
for it to only live as long as the last of the compilers).


I can imagine attempting to implement this with enough pipe and fd 
redirection magic, but I'm not sure how (in)elegant that becomes.


It sounds like one way or another you want to copy your endless but 
intermittent input to multiple output pipes, but you want to quit as soon 
as all the output pipes become broken.


To me, tee sounds like exactly the place to do that.  Otherwise, you'd 
have to add the broken-pipe detection (as in your patch) to your own 
program, along with the rest of tee's basic functionality  :)


It would be one thing if you controlled the programs that consume the 
input (you could have them handle 'heartbeats' in the input stream, and 
once these programs terminate, the heartbeats would trip on the broken 
pipe).  (However (in)elegant that becomes to implement...)


But if you don't have control over that, the fundamental problem is 
detecting broken pipes *without writing to them*, and I don't think that 
can be solved with any amount of extra pipes and fd redirection...


Carl


Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-11-29 Thread Carl Edquist via GNU coreutils General Discussion

On Tue, 29 Nov 2022, Pádraig Brady wrote:


On 29/11/2022 17:32, Carl Edquist wrote:

 ...

 If this kind of detect-broken-output-pipe logic were added to filter
 utils generally, the above example (with 4 cats) would return to the
 shell immediately.


Right. Thanks for discussing the more general pattern.
I.e. that SIGPIPE doesn't cascade back up the pipeline,
only upon attempted write to the pipe.
So it's not really a tee issue, more of a general pattern.

So it wouldn't be wrong to add this to tee (by default),
but I'm not sure how useful it is given this is a general issue for all 
filters.


That makes sense; though I suppose it would continue to become more useful 
for these types of cases as it gets added to more filters  :)


Also I'm a bit wary of inducing SIGPIPE as traditionally it hasn't been 
handled well:


But wait now, are we talking about inducing SIGPIPE?

In the current patch for tee, I think the idea was just to remove an 
output from the list when it's detected to be a broken pipe, allowing tee 
to exit sooner if all of its outputs are detected to be broken.


Similarly for the general filter case (usually with only a single output), 
the idea would be to allow the program to exit right away if it's detected 
that the output is a broken pipe.



https://www.pixelbeat.org/programming/sigpipe_handling.html


Actually, to me, if anything this page seems to serve as motivation to add 
broken-pipe checking to coreutils filters in general.


The article you linked ("Don't fear SIGPIPE!") discusses three topics:


1. First it discusses default SIGPIPE behavior - programs that do not 
specifically handle SIGPIPE do the right thing by default (they get 
killed) when they continue writing to their output pipe after it 
breaks.


We wouldn't be changing this for any filters in coreutils.  Writing to a 
broken pipe should still produce a SIGPIPE to kill the program.  (Even 
with broken-pipe detection, this can still happen if input becomes ready, 
then the output pipe's read end is closed, then the write is attempted.)


tee is actually a special case (if -p is given), because then SIGPIPE does 
not kill the process, but the write will still fail with EPIPE and tee 
will remove that output.


We also wouldn't really be inducing anything new for programs earlier in 
the pipeline.  If they don't handle SIGPIPE, they will just get killed 
with it more promptly - they will end up writing to a broken pipe one 
write(2) call sooner.



2. The "Incorrect handling of SIGPIPE" section discusses programs that 
attempt to handle SIGPIPE but do so poorly.  This doesn't apply to us 
either.  Filters that add broken-pipe detection do not need to add SIGPIPE 
handling.  And programs that handle it poorly, earlier in the pipeline, 
will have their problems regardless.  (Again, just one write(2) call 
sooner.)



3. Finally, the "Cases where SIGPIPE is not useful" section actually 
highlights why we *should* add this broken-pipe checking to filters in 
general.


The "Intermittent sources" subsection discusses exactly what we are 
talking about fixing:


For example 'cat | grep -m1 exit' will only exit, when you type a line 
after you type "exit".


If we added broken-pipe checking to cat, then this example would behave 
like the user would have wanted - typing "exit" would cause grep to exit, 
and cat will detect it's output pipe breaking, and exit immediately.


The other example about 'tail' was fixed already, as this kind of checking 
was added to tail, as we've discussed.  It's a good start!  The more utils 
we add it to, the more will be able to benefit.


The "Multiple outputs" subsection is specific to tee, and if anything 
perhaps suggests that the '-p' option should be on by default.  That is, 
it makes an argument for why it makes sense for tee to avoid letting a 
SIGPIPE kill it, but rather only to exit when all the input is consumed or 
all the outputs have been removed due to write errors.


The "Long lived services" subsection is a generalization of what was just 
said about tee - namely that it makes sense that some programs want to 
continue after a failed write attempt into a broken pipe, and such 
programs need to handle or ignore SIGPIPE.  This is true for such programs 
already, and adding broken-pipe checking to a filter in the same pipeline 
doesn't change that at all.  (Again, it will just cause them to get a 
SIGPIPE/EPIPE *promptly* - one write call sooner - when the final consumer 
of the output completes.)


...

Or perhaps when you mention "inducing SIGPIPE", you are referring to how 
tail(1) does things currently (when it detects a broken output), by 
attempting raise(SIGPIPE) followed by exit(EXIT_FAILURE).  It seems this 
is just an attempt to make it look to the waiting parent process that tail 
died trying to write to a broken pipe (somewhat of a white lie).  Most 
likely it could just exit(EXIT_FAILURE) without confusing the caller.  So 
if you'd like to avoid that, it's 

Re: [PATCH] tee: Add --pipe-check to allow instantly detecting closed outputs

2022-11-29 Thread Carl Edquist via GNU coreutils General Discussion

Hi all,

On Mon, 28 Nov 2022, Arsen Arsenović wrote:


Pádraig Brady  writes:


Trying to understand your use case better,
...


The bug we observed is that on occasion, for instance when running with 
a tty, or with a script that (for some reason) has a pipe on stdin, the 
tee-based "compiler" would hang.  To replicate this, try:


 tee >(gcc test.c -o a.out.1) >(gcc test.c -o a.out.2)

in a tty (here, the stdin is meant to be irrelevant).


If I may try to provide a simple example of the problem, consider the 
following command line:


tee >(sleep 3) | sleep 5

Let tee's stdin be a terminal to supply the "intermittent input".

You'll see that after 5 seconds, this will hang indefinitely until you hit 
Enter.


For the first 3 seconds, when hitting the Enter key, tee will successfully 
write the line to each pipe.  Between 3 and 5 seconds, the pipe to "sleep 
3" will be broken, which tee will notice, and then tee will continue 
writing the lines to the "sleep 5" pipe.


But after 5 seconds, when "sleep 5" has terminated and that pipe becomes 
broken, tee will continue to "hang" waiting for input (in this case the 
intermittent input from the terminal) indefinitely, despite the fact that 
all of tee's outputs are now broken pipes.  tee will only "notice" that 
the "sleep 5" pipe is broken when it receives input after that point, 
because then the write to that pipe fails with EPIPE (and/or a SIGPIPE is 
delivered).


...

It seems the ideal thing to happen here is for tee to terminate once it 
determines that all of its outputs are broken pipes.  It comes close to 
this already, but it only learns about this when write attempts fail, and 
it only attempts a write when it has input to tee.


As I suppose was suggested in the patch, perhaps poll(2) could be used to 
wait for POLLIN from fd 0, and POLLHUP for outputs (perhaps limited to 
pipes / sockets).


The patch subject suggests adding --pipe-check as an option, but on first 
blush it seems like this would actually be a good thing to do by 
default...


Cheers,
Carl


Re: [PATCH] ls: add --files0-from=FILE option

2021-04-26 Thread Carl Edquist via GNU coreutils General Discussion

On Wed, 21 Apr 2021, Carl Edquist wrote:


Thanks Pádraig for the thoughtful reply!

You bring up some good points, which for the sake of interesting
discussion i'd like to follow up on also.  (Maybe later this week...)



So to follow up - i don't have any action items here, just some details 
that i thought might make for interesting discussion.



On Tue, 20 Apr 2021, p...@draigbrady.com wrote:


 In this case, xargs batches *13* separate invocations of ls; so the
 overall sorting is completely lost.

 But with the new option:

  [linux]$ find -name '*.[ch]' -print0 | ls -lrSh --files0-from=-

 The sizes all scroll in order


One can also implement this functionality with the DSU pattern like:

  nlargest=10
  find . -printf '%s\t%p\0' |
  sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
  xargs -r0 ls -lUd --color=auto --


I appreciate you taking the time to write out a full DSU example.

I was going to save this topic for a separate thread, but yeah one of my 
observations over the years is i would see people make (i would say 
oversimplified) comments that "you can do that with find", but they rarely 
go through the effort to show it (and thus demonstrate just how much more 
complicated the equivalent find-based solution is to type out).


(More in this vein a bit later.)



Arguably that's more scalable as the sort operation will not
be restricted by available RAM, and will use temp storage.


Agreed!  This is a good point to keep in mind in general for ls.

On the one hand, ls already has file metadata information available to it 
in raw format (eg, size, timestamps), so there isn't the overhead to 
convert relevant keys to sortable strings and write it all across pipes 
for IPC.


But on the other hand, as you are saying, for the things that find(1) can 
give sort(1) to sort (which is _most_ of the ls --sort modes), sort(1) can 
sort a bit more efficiently and can handle situations with very many file 
names and very limited system memory.


But by the same argument,

find . -mindepth 1 -maxdepth 1 ! -name .\* -printf '%f\0' |
sort -z | xargs -r0 ls $LS_OPTIONS -dU --

is more scalable than a bare 'ls' invocation, since there can be literally 
millions of entries in a single directory.


But 'ls' is certainly easier to type.

And if you want to sort on anything more interesting than the name itself, 
the DSU pipeline just gets more complex.


So at some point you can weigh scalability vs usability based on the 
situation.



(In the linux tree example above, the "ls -lrSh --files0-from=-" run with 
45648 input source files has a maxrss of ~16M, so for instance this use 
case is small enough for me not to worry about mem use.)



[By the same token, the sorting of shell globs in general (which in bash 
can be expanded to an arbitrary number of arguments), can be more scalably 
done outside of the shell (with find|sort) than in a shell glob itself. 
And while that is not a coreutils issue, the point is that despite this, 
(as with ls) shell globs can still be more convenient to use for sorting a 
set of files, perhaps in most cases, than an equivalent multi-tool 
pipeline.]



But doing the sorting in ls (rather than find|sort|cut|xargs) is not just 
easier to type -- it's also easier to _get it right_ without a lot of 
trial and error.


And for casual interactive use, that's kind of a feature, too.


For instance, in your DSU example, i spy a subtle bug in the undecorate 
step: looks like it should be "cut -zf2-" instead of "cut -zf2" ... 
because after all, tabs are legal in filenames, too.


The DSU in pixelbeat's "newest" script (mentioned later) is also not free 
from filename handling issues [1].


(Point being - even for experienced users, writing a replacement for ls's 
internal sorting with an equivalent DSU pipeline can be tricky & time 
consuming to get right, and easy to get subtly wrong.)



Also there are some other subtle differences that might be worth keeping 
in mind:


- To get the same sort order as ls, you actually need "sort -rzk1,1n" 
instead of just "sort -zk1,1n", since for tie-breaking ls sorts the 
keys and the names in opposite directions for ls -t and -S (whether or not 
you pass -r to ls).


(Likewise if you want the same order as ls -S without the -r, you need a 
slightly different "sort -zk1,1nr".)


- when you put 'ls' after a pipe (as in "... | ls -lrSh --files0-from=-") 
you typically get the alias (eg, slackware has ls='/bin/ls $LS_OPTIONS').
Meanwhile 'xargs ls' is whichever 'ls' is first in PATH, and does not 
include LS_OPTIONS.  Not a big deal, but, buyer beware.


- lastly (and perhaps the only thing you can't do anything about), when 
you use 'xargs ls' rather than a single ls invocation, you lose the 
aggregate width alignments for any ls formats with column output (in 
this case -l, but it's also true for -C and -x).




Also it's more robust as ls is the last step in the pipeline,
presenting the files to the user. If you wanted the 

Re: [PATCH] ls: add --files0-from=FILE option

2021-04-21 Thread Carl Edquist via GNU coreutils General Discussion
On Tue, 20 Apr 2021, p...@draigbrady.com wrote:

> One can also implement this functionality with the DSU pattern like:
>
>   nlargest=10
>   find . -printf '%s\t%p\0' |
>   sort -z -k1,1n | tail -z -n"$nlargest" | cut -z -f2 |
>   xargs -r0 ls -lUd --color=auto --
>
> Arguably that's more scalable as the sort operation will not
> be restricted by available RAM, and will use temp storage.
>
> Also it's more robust as ls is the last step in the pipeline,
> presenting the files to the user. If you wanted the largest 10
> with ls --files0-from then file names containing a newline
> would mess up further processing by tail etc.
>
>>  Similarly, say you would like to view / scroll through your extensive mp3
>>  collection in chronological order (based on when you added the files to
>>  your collection).  You can do it now with the new option:
>>
>>   [music]$ find -name \*.mp3 -print0 | ls -lrth --files0-from=-
>
> I've used a https://www.pixelbeat.org/scripts/newest script
> for a long time with similar find|xargs technique as I described above.
>
> In saying all of the above, I do agree though that for consistency
> commands that need to process all arguments with global context,
> should have a --files0-from option.
> Currently that's du and wc for total counts, and sort(1) for sorting.
> Since ls has sorting functionality, it should have this option too.


Thanks Pádraig for the thoughtful reply!

You bring up some good points, which for the sake of interesting 
discussion i'd like to follow up on also.  (Maybe later this week...)

Have a good day!

Carl

Re: [PATCH] ls: add --files0-from=FILE option

2021-04-21 Thread Carl Edquist via GNU coreutils General Discussion

Hi Berny,

On Wed, 21 Apr 2021, Bernhard Voelker wrote:

shouldn't it use the 'argv-iter' gnulib module (like du.c and wc.c) 
instead of directly using the underlying ...



+#include "readtokens0.h"


I considered this, too!  :)

I think the short answer is that du and wc don't actually need to keep any 
information around from all the files they process except for the running 
totals, so "at scale" the argv-iter version for files0 processing allows 
for a constant (?) memory usage in du and wc regardless of the number of 
input items.


But for ls, even if using argv-iter would result in one less copy of the 
filenames, the scaled memory requirements are still the same for sorting 
(fileinfo for each entry has to be kept till the end).


[If anything, i would look more carefully into whether gobble_file in ls.c 
really needs to make a copy of each filename, rather than just using the 
command line arguments or names from files0 input directly.]



So i ended up following the pattern closer to what's in sort.c.  Note that 
argv-iter is not used in there either, perhaps for a similar reason (it's 
not going to bring the memory requirements down to O(1), the way it could 
for wc and du).



The one case ls might possibly find an improvement with argv-iter is for 
unsorted, non-column output (so, specifically -U1), where the names are 
only needed once.  But in that case there's no need to use the 
'--files0-from' option for global sort or summary info -- you could use 
'xargs -0 ls -U1 ...' instead for identical output.



It just didn't seem like there was a strong argument for it (same memory 
scaling regardless).  And the other frank truth is (if you look at wc.c 
and du.c for examples), it seemed like adding argv_iterator processing 
would significantly complicate the code.



Those were my thoughts anyway  :)

Have a good one,
Carl



Re: [PATCH] ls: add --files0-from=FILE option

2021-04-19 Thread Carl Edquist via GNU coreutils General Discussion

On Mon, 19 Apr 2021, Carl Edquist wrote:

I'm submitting for your consideration here a patch to add the standard 
'--files0-from=FILE' option to ls.


Oops! Somehow I ended up attaching an old version of the patch - this one 
includes some cleanups and help output.


Thanks,
CarlFrom 30fb711962bbd432aea4268baf93d7ba17aae4bc Mon Sep 17 00:00:00 2001
From: Carl Edquist 
Date: Fri, 10 May 2019 17:05:47 -0500
Subject: [PATCH] ls: add --files0-from=FILE option

Useful for things like

find [...] -type f -print0 | ls -lrSh --files0-from=-

where an attempt to do the same with 'find -print0 | xargs -0' or
'find -exec' would fail to fit all the file names onto a single command
line, and thus ls would not sort all the input items together, nor
align the columns from -l across all items.

* src/ls.c (files_from): New var for input filename.
(long_options): Add new long option.
(read_files0): Add helper function to consume files0 input.
(main): Add logic for files_from handling.
(decode_switches): Handle FILES0_FROM_OPTION.
* tests/ls/files0-from-option.sh: Excercise new option.
* tests/local.mk: Include new test.
* doc/coreutils.texi: Document --files0-from=FILE option.
* NEWS: Mention the new feature.
---
 NEWS   |  3 ++
 doc/coreutils.texi |  2 ++
 src/ls.c   | 69 +++---
 tests/local.mk |  1 +
 tests/ls/files0-from-option.sh | 40 
 5 files changed, 111 insertions(+), 4 deletions(-)
 create mode 100755 tests/ls/files0-from-option.sh

diff --git a/NEWS b/NEWS
index 090fbc7..7f7f86b 100644
--- a/NEWS
+++ b/NEWS
@@ -73,6 +73,9 @@ GNU coreutils NEWS-*- outline -*-
   ls now accepts the --sort=width option, to sort by file name width.
   This is useful to more compactly organize the default vertical column output.
 
+  ls now accepts the --files0-from=FILE option, where FILE contains a
+  list of NUL-terminated file names.
+
   nl --line-increment can now take a negative number to decrement the count.
 
 ** Improvements
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 3e3aedb..e008746 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -7467,6 +7467,8 @@ option has been specified (@option{--classify} (@option{-F}),
 @option{--dereference} (@option{-L}), or
 @option{--dereference-command-line} (@option{-H})).
 
+@filesZeroFromOption{ls,,sorted output (and aligned columns in -l mode)}
+
 @item --group-directories-first
 @opindex --group-directories-first
 Group all the directories before the files and then sort the
diff --git a/src/ls.c b/src/ls.c
index 4586b5e..0d0678b 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -102,6 +102,7 @@
 #include "mpsort.h"
 #include "obstack.h"
 #include "quote.h"
+#include "readtokens0.h"
 #include "smack.h"
 #include "stat-size.h"
 #include "stat-time.h"
@@ -356,6 +357,9 @@ static bool align_variable_outer_quotes;
 static void **sorted_file;
 static size_t sorted_file_alloc;
 
+/* input filename for --files0-from */
+static char *files_from;
+
 /* When true, in a color listing, color each symlink name according to the
type of file it points to.  Otherwise, color them according to the 'ln'
directive in LS_COLORS.  Dangling (orphan) symlinks are treated specially,
@@ -833,6 +837,7 @@ enum
   COLOR_OPTION,
   DEREFERENCE_COMMAND_LINE_SYMLINK_TO_DIR_OPTION,
   FILE_TYPE_INDICATOR_OPTION,
+  FILES0_FROM_OPTION,
   FORMAT_OPTION,
   FULL_TIME_OPTION,
   GROUP_DIRECTORIES_FIRST_OPTION,
@@ -892,6 +897,7 @@ static struct option const long_options[] =
   {"block-size", required_argument, NULL, BLOCK_SIZE_OPTION},
   {"context", no_argument, 0, 'Z'},
   {"author", no_argument, NULL, AUTHOR_OPTION},
+  {"files0-from", required_argument, NULL, FILES0_FROM_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -1626,11 +1632,38 @@ signal_restore (void)
   signal_setup (false);
 }
 
+static void
+read_files0(struct Tokens *tokp)
+{
+  FILE *stream;
+  if (STREQ (files_from, "-"))
+stream = stdin;
+  else
+{
+  stream = fopen (files_from, "r");
+  if (stream == NULL)
+die (EXIT_FAILURE, errno, _("cannot open %s for reading"),
+ quoteaf (files_from));
+}
+
+  readtokens0_init (tokp);
+
+  if (! readtokens0 (stream, tokp) || fclose (stream) != 0)
+die (LS_FAILURE, 0, _("cannot read file names from %s"),
+ quoteaf (files_from));
+
+  if (! tokp->n_tok)
+die (LS_FAILURE, 0, _("no input from %s"),
+ quoteaf (files_from));
+}
+
 int
 main (int argc, char **argv)
 {
   int i;
   struct pending *thispend;
+  struct Tokens tok;
+  char **files;
   int n_files;
 
   initialize_main (, );
@@ -1727,6 +1760,23 @@ main (int argc, char **argv)
   clear_files ();
 
   n_files = argc - i;
+  files = argv + i;
+
+  if (files_from)
+{
+  if (n_files > 0)
+{
+  error (0, 0, _("extra operand %s"), quoteaf