Hi, I gave it second look. I think it could be a flaw in your script after all.
===================================== a.sh ===================================== tmpdir=/var/tmp/FIFOs$$ trap "exec rm -rf $tmpdir" EXIT INT TERM PIPE mkdir "$tmpdir" || exit i=0; while test "$((i+=1))" -le 100; do fifo=$tmpdir/FIFO$i mkfifo "$fifo" || exit echo "this is FIFO $i" >"$fifo" & read foo <"$fifo" done ================================================================================ $ truss -deafl bash -x a.sh ######### debug output from -x. 2164 is our shell PID + fifo=/var/tmp/FIFOs2164/FIFO3 2164/1: 0.143930 write(2, " + f i f o = / v a r /".., 32) = 32 + mkfifo /var/tmp/FIFOs2164/FIFO3 2164/1: 0.144186 write(2, " + m k f i f o / v a".., 34) = 34 ######### running mkfifo 2175/1: 0.144673 forkx() (returning as child ...) = 2164 ... 2175/1: 0.153917 execve("/usr/bin/mkfifo", 0x96328E508, 0x963286208) argc = 2 ... 2175/1: 0.164705 mknodat(AT_FDCWD, "/var/tmp/FIFOs2164/FIFO3", 010644, 0x00000000) = 0 2175/1: 0.164880 _exit(0) ######## Now in case of the hang the read trace gets printed prior to write (by the echo). Maybe ######## creating subshell takes some time? + read foo 2164/1: 0.170598 write(2, " + r e a d f o o\n", 11) = 11 ######## The write subshell is created 2177/1: 0.166957 forkx() (returning as child ...) = 2164 ... ######## The subshell prints it's trace message + echo 'this is FIFO 3' 2177/1: 0.173246 write(2, " + e c h o ' t h i s".., 24) = 24 ######## The subshell opens the fifo and prints the output there 2177/1: 0.175586 openat(AT_FDCWD, "/var/tmp/FIFOs2164/FIFO3", O_WRONLY|O_CREAT|O_TRUNC|O_XPG4OPEN, 0666) = 3 2177/1: 0.175763 fcntl(3, F_DUP2FD, 0x00000001) = 1 2177/1: 0.175889 close(3) = 0 2177/1: 0.176280 write(1, " t h i s i s F I F O".., 15) = 15 ######## And exits 2177/1: 0.176580 _exit(0) ######## Our main shell receives SIGCLD 2164/1: 0.177089 Received signal #18, SIGCLD, in openat() [caught] 2164/1: siginfo: SIGCLD CLD_EXITED pid=2177 status=0x0000 ######## Tries to open the fifo, but gets ERESTART (???) 2164/1: 0.178976 openat(AT_FDCWD, "/var/tmp/FIFOs2164/FIFO3", O_RDONLY|O_XPG4OPEN) Err#91 ERESTART 2164/1: 0.179158 lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 2164/1: 0.179296 waitid(P_ALL, 0, 0x7FDC2CBA3CF0, WEXITED|WTRAPPED|WNOHANG) = 0 2164/1: 0.179453 waitid(P_ALL, 0, 0x7FDC2CBA3CF0, WEXITED|WTRAPPED|WNOHANG) Err#10 ECHILD 2164/1: 0.179582 setcontext(0x7FDC2CBA3B50) ####### Tries again ??? 2164/1: openat(AT_FDCWD, "/var/tmp/FIFOs2164/FIFO3", O_RDONLY|O_XPG4OPEN) (sleeping...) In good cycle I can see ######## First open the fifo 2895/1: 0.271068 openat(AT_FDCWD, "/var/tmp/FIFOs2895/FIFO5", O_RDONLY|O_XPG4OPEN) = 3 ######## Then signal is received 2895/1: 0.274172 Received signal #18, SIGCLD [caught] 2895/1: siginfo: SIGCLD CLD_EXITED pid=2916 status=0x0000 Another workaround thus seems to be to delay the writing side ( echo "this is FIFO $i" >"$fifo" sleep 1 ) & What confuses me slightly is that during the "bad" run the open is executed again. My guess (at the moment) is that libc itself interprets ERESTART and retries the open internally again. Another thing that confuses me is that during the "bad" run the write to the pipe didn't sleep waiting for reader. I would say it should. But then maybe the stuff is really happening at the same time in parallel and the truss output is not exact representation of the timing. That's where I am now. > >Hmm, so we are waiting on a condition. That needs to be investigated > >deeper. It would be great if you could open a case for this as we have > >to prioritize our work ... > > How would I go about it? I'm not an Oracle customer, nor could I justify the > expense to become one. I use my test Solaris VMs only to test modernish, as > the free-of-charge licence permits. I've previously been given to understand > that only customers can report Solaris bugs. But I'd like to give some bug > reports back if you give me a way to do it. Yeah, you have to be customer in order to open a case. I happen to maintain bash in Solaris (among other things) which made me curious. I will file the bug (as it is strange behavior if nothing else), but can not make any promises. > >At the moment it looks like a bug in Solaris to me, but it shows only on > >VirtualBox. > > Thanks, that's good to have confirmed! It was hoping as much -- it would > have been hard to believe that something this basic is broken on Solaris in > general. Heh :) I am heavy shell scripter/user and I have found multiple bugs in various shells. But I don't remember single issue with bash. > By the way, a side effect of developing modernish has been to put me on a > crusade of sorts against shell bugs on all the shells, as I keep running > into them. Must be fun for sure :) > The "standards compliant" /usr/xpg4/bin/sh (ksh88) on Solaris has a > LOT of bugs, including some really serious ones that I wrote about on > the Austin Group list last year. Right, the xpnN tools are meant to be as stable as possible. That means that your Solaris 8 scripts should still run fine on current release. (so should binaries and kernel modules by the way). This compatibility of course comes with a price. The code there can not evolve or at least we are hesitant to do so. > Where/how could I reach a person who is interested in receiving bug > reports? We can take this offline and I'm happy to submit the bugs internally. Also I'm curious what have you found :) But to be fair, generally the chances of someone working on this without customer contract is small. Cheers -- Vlad