Bug#593504: mawk: input buffering should be documented

Dima Kogan Sat, 14 Oct 2017 22:09:31 -0700

Resurrecting old bug thread because this hasn't yet been resolved.

This report was about an unexpected behavior of fflush(). The below was
expected to print() once per second, but instead it looks like it
printed all at once after all the sleeps:


$ for i in `seq 10` ; do echo x ; sleep 1 ; done | mawk '{ print $1 ; fflush() 
}' | ts
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x
Aug 18 12:17:14 x

I ran a few tests, and it looks like fflush() isn't doing anything
wrong, but rather the problem is that mawk buffers the INPUT in addition
to the output.

I did this:

$ date;                                                                  \
  for i in `seq 5` ; do                                                  \
      echo x;                                                            \
      sleep 1;                                                           \
  done | ~/projects/ltrace/ltrace -t -x '*@libc.so.6' -o /tmp/ltrace.out \
         mawk '{ print $1; fflush() }' | ts;                             \
  egrep 'flush' /tmp/ltrace.out | grep -v resumed

  Sat Oct 14 21:48:42 PDT 2017
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  Oct 14 21:48:47 x
  21:48:47 fflush( <unfinished ...>
  21:48:47 ffl...@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 ffl...@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 ffl...@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 ffl...@libc.so.6( <unfinished ...>
  21:48:47 fflush( <unfinished ...>
  21:48:47 ffl...@libc.so.6( <unfinished ...>


I.e. we started at 21:48:42, and I'd expect prints at :42, :43, :44, ...
We don't get those. Ltrace DOES show the fflush() libc calls, but those
all happen at the end, not as the data comes in. This is wrong, and
implies that the problem is on the data READING end, not the WRITING
end. And I can prove it even more conclusively:

  $ date; \
    for i in `seq 5` ; do echo x ; sleep 1 ; done | mawk '{ system("date") }'

  Sat Oct 14 21:53:05 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017
  Sat Oct 14 21:53:10 PDT 2017

I would expect mawk to read the data a line at a time and to invoke the
'date' command each time. Instead we get a bunch of dates at the end,
meaning that mawk is buffering the input.

Finally, looking at the syscalls I get this:

  $ for i in `seq 5` ; do echo x ; sleep 1 ; done | strace -f -e 
read,execve,clone -t mawk '{ system("date") }'

  21:56:48 execve("/usr/bin/mawk", ["mawk", "{ system(\"date\") }"], [/* 40 
vars */]) = 0
  21:56:48 read(3, 
"\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200V\0\0\0\0\0\0"..., 832) = 832
  21:56:48 read(3, 
"\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
  21:56:48 read(0, "x\n", 4096)           = 2
  21:56:48 read(0, "x\n", 4094)           = 2
  21:56:49 read(0, "x\n", 4092)           = 2
  21:56:50 read(0, "x\n", 4090)           = 2
  21:56:51 read(0, "x\n", 4088)           = 2
  21:56:52 read(0, "", 4086)              = 0
  21:56:53 clone(strace: Process 16988 attached
  child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7f5b16f029d0) = 16988
  [pid 16988] 21:56:53 execve("/bin/sh", ["/bin/sh", "-c", "date"], [/* 40 vars 
*/]) = 0
  [pid 16988] 21:56:53 read(3, 
"\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
  [pid 16988] 21:56:53 clone(strace: Process 16989 attached
  child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7f292bae69d0) = 16989
  [pid 16989] 21:56:53 execve("/bin/date", ["date"], [/* 40 vars */]) = 0
  ....

I.e. mawk is reading the data when it comes in (the read() syscalls are
not delayed), but it must be chunking the data in userspace in some way
before doing any processing. I'm guessing this is a performance
optimization. When using gawk this doesn't happen, and the only
buffering to worry about is the output buffering, which can be disabled
with fflush(). If this extra layer of buffering isn't a bug (and I'm
assuming it's not), then there really isn't anything to do, but we
REALLY should add a note in the manpage, maybe next to where fflush() is
documented. Because this layer of buffering breaks the thing that
users like me try to achive with fflush().

Thanks!

Bug#593504: mawk: input buffering should be documented

Reply via email to