Date:        Mon, 13 Jan 2020 20:26:33 -0800
    From:        "Ronald F. Guilmette" <r...@tristatelogic.com>
    Message-ID:  <38942.1578975...@segfault.tristatelogic.com>

  | While developing some C code recently, and testing that, I came upon a
  | VERY surprising and unexpected result.  It appears that various flavors
  | of *NIX are in agreement that after a fork, the set of flag bits (e.g.
  | O_NONBLOCK) that are associated with any specific file descriptor that
  | was open prior to the fork will, after the fork, exist only as a single
  | shared set of flag bits... *not* one copy for the parent and a separate
  | set for the child.

That is correct.

  | My reading of both the old and faded hardcopy of the 1993 POSIX API
  | standard that I have here, as well as the newer draft

If you can point at language in the current draft that suggested the
interpretation you have, and I agree it can be read the way you are
reading it, then I will file a defect report, as that would be incorrect.

Remember that the standard describes the way the implementations work
(which as you pointed out is the same for this issue, everywhere), it
does not legislate how they must behave (though to claim to conform one
must imlement what the standard reqires ... very few of the free OS versions.
if any, claim to conform howwver).    But the standard should be able to
be used by people like you so you can correctly understand the way that
systems work (including the places where different implementations are
different - not that that is relevant here).  So if you are reasonably
interpreting the standard to say something it shouldn't, it should be
corrected.

But you do need some experience reading standards, they are not intended
as tutorials - and the language tends to be very precise, a slight difference
in wording, which in normal text would signify nothing at all, in a
standard can entirely change the meaning of what was said (which is similar
in a way to legal documents, like statutes, contracts, etc - the language is
very precisely defined.)

  | I have prepared two short example programs exclusively to illustrate the
  | issue.

There was no real need, we all know (and expect) the behaviour you mentioned.

But no, sorry, I do not have time to trawl through the standard looking
for where it says how things should work, but I will look at any text you
point out to me which you believe says something different.

(Give references to the 2018 draft, not the ancient 1993 version, much
has changed in that period.   You can reference section numbers, page
numbers, or line numbers (but also quote the text in question))

  | This occurs in both cases shortly after the recently forked child processes
  | have expressely and deliberately un-set the O_NONBLOCK flag for a file
  | descriptor that they have inherited a copy of at the time of the preceeding
  | fork.

That is what is supposed to happen.

  | How can diddling the flags for one file descriptor cause the flags of a
  | different file descriptor to magically change also?

Short (or not so short) answer, references to files in unix pass through
three levels of indirection.   This is ancient, from the earliest days.

At the bottom (close to the file) there is the inode (or vnode, or ...)
which is the actual description of a file, and contains all of the file's
data (the content of the file, and the meta-data like owner, access/modify
times (etc), permission bits, ...)   That one is identified by the combination
of an "inode number" and device (description of the filesyste, in which the
inode is housed).    (File names are simply data in directories that map
human strings to inode numbers - the device for the inode is the same as
the device holding the directory .... Names have files, files do not have
names).

Next up is the file table.   An entry in this is created for each open()
(and similar operations, like socket() etc) - the operations that create
a file descriptor from other data (a file name in the case of open()).
This is a single global shared table accessed by all of the processes
currently running.   There can be many file table entries referencing the
same file, for example
        fd1 = open("file", ...);
        fd2 = open("file", ...);
        fd3 = open("file", ...);
will generate 3 different file table entries.   Each entry contains a
reference to the inode (all 3 of these would reference the same one),
plus the access mode (read/write/read-write) (one of the missing parameters
buried in the elipsis above) the current file offset, and almost all of
the mode flags relating to the file (all the ones that can be manipulated
using the fcntl(F_SETFL, ...) and fcntl(F_GETFL, ...) operations.  That
includes the non blocking mode.

Next up is the process file descriptor table.   This maps the small
integer file descrioptors (0, 1, 2, ...) that each process is given
from open() (etc) to references to the file table.   All the system
calls that manipulate file descriptors are manipulating this table
(dup(), fcntl(F_DUPFD), fork(), ...).   There can be many file descriptors
that reference the same file table entry (the only way to get a new file
table entry is via one of the file deescriptor creating sys calls, like
open() - everything else (like dup(), fork()) simply copies the reference
to a single file table entry from one file descriptor table slot to
another (either in the same process, or a different one, the SCM_RIGHTS
way of passing a file descriptor through a socket to another process does
the same thing).   Aside from the reference to the file table entry, about
the only interesting thing in the file descriptor table is the close on exec
flag.   That is (aside from the small integer itself) the only (current)
open file property that is per-process-file-descriptor.   Those flags (well
that flag, as there is just one currently) are manipulated by the
fcntl(F_SETFD, ..) and fcntl(F_GETFD, ...) operations.   (The flags given
to open() contain (mostly) file table flags, but can also contain file
descriptor flags).

If it didn't work like this, nothing in unix work work as expected,
consider a simple script

        #! /bin/sh
        cat file1
        echo ---------------------
        cat file2

which you run as

        script > file3

There you expect (I hope) that file3 will end up containing the contents
of file1, then a line of dashes (or hyphens, or minus signs, whatever you
prefer to call that character, please ignore here that writing dashes
using echo is an unsafe operation) and then the contents of file2.

But consider how that works:

1: a shell is created in a new process to run that script.  When started
the shell you typed the command into will have set the new shell's stdout
(fd 1) to be a reference to "file3" which it will have created (if necessary)
and openened for writing.

2. this new shell parses the script and one after another executes the 3
commands contaned in it (at this level the #! line is just a comment).

3. for each command, it forks, then the child exec's the appropriate command
(cat, or echo in this case .. we'll just ignore the probability that "echo"
is built into the shell here, if it is, to work properly, the shell must at
least pretend to operate as if it were not, just be more efficient).


Consider what would happen here if your interpretation were correct, and
each forked child had a new reference to the output file (file3).   When the
script starts, all would be OK, and the first cat would be OK too, as nothing
has changed yet.   But when the "echo" was forked, its new reference to the
file would be still at offset 0, not at the end of the data now in file3
that was placed there by the first "cat" command.  That would be because you
expect that the file meta-data is not shared between parent/shild after a
fork(), so nothing that firfst cat changed (such as the current offset in the
file) would be reflected back into the parent process, so even if there was
a mechanism to do so, the shell running the script would have no way to
inform the echo command where to start in the file - when the fork() happens
to start echo, a new file reference would be made, and the offset would be
zero again.

What actually happens, is that the shell (running the script) has fd 1,
that reference its file descriptor table, and from there a file table
entry (which then references the inode that the directory lookup of "file3"
says contains the appropriate data).   When it forks, the file descriptor
table is copied, more or less unchanged, from parent to child.  The child
has as its fd 1 (its standard output) a reference to the exact same file
table entry.   The cat is exec'd (which changes nothing about the file
descriptor table, unless the close-on-exec flag was set, which it will not
be in this example) cat runs with its standard output (fd 1) referencing
the same file table entry that was created when file3 was initially opened.
As it writes data into the file, the file offset field of that table is
updated (this is how write(1, ...) write(1, ...) write(1, ...) (which is
approximately what cat does) appends data to the file, each write follows
the one that preceded it, unless someone does an lseek() which changes the
offset field explicitly).   When cat is finished, and the script shell
whas waited for it, the file descriptor table offset field will be at the
size of file1 (as an offset into file3) as that is how much data has been
copied.   The shell running the script is sharing that same file descriptor
table, so its offset into its standard output is the same.   And when it
forks again to run the echo command, exho starts with its standard output
referencing the same file table entry, with the same offset, and the line
of dashes it writes, will be placed immediately after the contents of that
came from file1, in file3, and the file offset will be advanced by the
number of dashes, plus 1 for the trailing newline that is also written.
The same thing happens when the second cat starts, and its data goes after
all of that.

Now, I'll admit, it is surprising that the "non-blocking" flag  (and perhaps
one or two others) is in the file table, and not the file descriptor table
where it would make a lot more sense.   But that is the way it was done when
it was initially implemented, and now we're stuck with it.

My guess for why is that the file table has lots of flags, always did,
so applications are used to that, and deal with ones they don't understand
or care about just fine.   On the other hand, the file descriptor table,
since creation, has only ever had one flag (initially there were none),
the close on exec flag, and there used to be no #define'd name for that
flag, so old applications tend to assume that getting any non-zero value
from fcntl(F_GETFD) means "close on exec is set" ... so adding new flags
there is likely to break a lot of old code.   But that is just my guess.

  | I am fairly thick skinned, so by all means, please feel free to tell me
  | if you think I'm just crazy, and if in fact the standard does not mean
  | what it says when it says "The child process shall have its own copy of
  | the parent's file descriptors."

It does mean exactly that.   But what that means is its own copy of the
file descriptor table - when it does close(1) that affects only its own
copy of its standard output, it releases its reference to the file table
entry that is referencing "file3"'s inode table entry, without affecting
the similar reference in the parent (or any other) process.

But you need to understand the 3 level reference chain to really understand
what that implies, and you will need to read the standard vaery carefully,
paying particular attention to slight wording differences (which is what
Mark, that is, shareware_systems, was attempting to point out, I believe).

It does not mean that it has its own copy of everything related to the
file (there is only one copy of the data in the file for example,
filesystems that implement snapshots ignored here, if one of the two
processes changes the data, that changes it for all of the,  If someone
changes the access time, that changes it for all of them, etc.  For
better or worse, "non blocking mode", and (far more reasonably) the file
offset (current pointer into the file) are just the same - except those
are only shared amongst file descriptors that reference the same file
table entry, this is wht

        cat file & cat file

where both cat processrs run at the same time each produces a complete
copy of the file - each cat process does open("file") which each makes
a new file table entry, and hence each has its own offset.  When the first
cat reads from the file, the offset of the other file table entry does
not change.    The othput will be all mixed up, since both processes
share their parent's standard output, meaning they share one output
file offset pointer, so all of the output from both copies of the file
will appear there, in whatever order the schedueller allowed the two
cat processes to run, and depending upon just how quickly they each
create the output.   Every byte from "file" will appear twice, and if
you were able to colour the bytes from the first cat red, and the ones
from the second cat blue, and you looked at only the red (or blue) output
data, each would be a complete, ordered, copy of "file" - but as a whole
you'd have red and blue sections all mixed together in a meaningless
pattern (and since the colours do not exist in reality, unscrambling
this, when it happens, is hard).

kre

Reply via email to