Date: Mon, 13 Jan 2020 20:26:33 -0800 From: "Ronald F. Guilmette" <r...@tristatelogic.com> Message-ID: <38942.1578975...@segfault.tristatelogic.com>
| While developing some C code recently, and testing that, I came upon a | VERY surprising and unexpected result. It appears that various flavors | of *NIX are in agreement that after a fork, the set of flag bits (e.g. | O_NONBLOCK) that are associated with any specific file descriptor that | was open prior to the fork will, after the fork, exist only as a single | shared set of flag bits... *not* one copy for the parent and a separate | set for the child. That is correct. | My reading of both the old and faded hardcopy of the 1993 POSIX API | standard that I have here, as well as the newer draft If you can point at language in the current draft that suggested the interpretation you have, and I agree it can be read the way you are reading it, then I will file a defect report, as that would be incorrect. Remember that the standard describes the way the implementations work (which as you pointed out is the same for this issue, everywhere), it does not legislate how they must behave (though to claim to conform one must imlement what the standard reqires ... very few of the free OS versions. if any, claim to conform howwver). But the standard should be able to be used by people like you so you can correctly understand the way that systems work (including the places where different implementations are different - not that that is relevant here). So if you are reasonably interpreting the standard to say something it shouldn't, it should be corrected. But you do need some experience reading standards, they are not intended as tutorials - and the language tends to be very precise, a slight difference in wording, which in normal text would signify nothing at all, in a standard can entirely change the meaning of what was said (which is similar in a way to legal documents, like statutes, contracts, etc - the language is very precisely defined.) | I have prepared two short example programs exclusively to illustrate the | issue. There was no real need, we all know (and expect) the behaviour you mentioned. But no, sorry, I do not have time to trawl through the standard looking for where it says how things should work, but I will look at any text you point out to me which you believe says something different. (Give references to the 2018 draft, not the ancient 1993 version, much has changed in that period. You can reference section numbers, page numbers, or line numbers (but also quote the text in question)) | This occurs in both cases shortly after the recently forked child processes | have expressely and deliberately un-set the O_NONBLOCK flag for a file | descriptor that they have inherited a copy of at the time of the preceeding | fork. That is what is supposed to happen. | How can diddling the flags for one file descriptor cause the flags of a | different file descriptor to magically change also? Short (or not so short) answer, references to files in unix pass through three levels of indirection. This is ancient, from the earliest days. At the bottom (close to the file) there is the inode (or vnode, or ...) which is the actual description of a file, and contains all of the file's data (the content of the file, and the meta-data like owner, access/modify times (etc), permission bits, ...) That one is identified by the combination of an "inode number" and device (description of the filesyste, in which the inode is housed). (File names are simply data in directories that map human strings to inode numbers - the device for the inode is the same as the device holding the directory .... Names have files, files do not have names). Next up is the file table. An entry in this is created for each open() (and similar operations, like socket() etc) - the operations that create a file descriptor from other data (a file name in the case of open()). This is a single global shared table accessed by all of the processes currently running. There can be many file table entries referencing the same file, for example fd1 = open("file", ...); fd2 = open("file", ...); fd3 = open("file", ...); will generate 3 different file table entries. Each entry contains a reference to the inode (all 3 of these would reference the same one), plus the access mode (read/write/read-write) (one of the missing parameters buried in the elipsis above) the current file offset, and almost all of the mode flags relating to the file (all the ones that can be manipulated using the fcntl(F_SETFL, ...) and fcntl(F_GETFL, ...) operations. That includes the non blocking mode. Next up is the process file descriptor table. This maps the small integer file descrioptors (0, 1, 2, ...) that each process is given from open() (etc) to references to the file table. All the system calls that manipulate file descriptors are manipulating this table (dup(), fcntl(F_DUPFD), fork(), ...). There can be many file descriptors that reference the same file table entry (the only way to get a new file table entry is via one of the file deescriptor creating sys calls, like open() - everything else (like dup(), fork()) simply copies the reference to a single file table entry from one file descriptor table slot to another (either in the same process, or a different one, the SCM_RIGHTS way of passing a file descriptor through a socket to another process does the same thing). Aside from the reference to the file table entry, about the only interesting thing in the file descriptor table is the close on exec flag. That is (aside from the small integer itself) the only (current) open file property that is per-process-file-descriptor. Those flags (well that flag, as there is just one currently) are manipulated by the fcntl(F_SETFD, ..) and fcntl(F_GETFD, ...) operations. (The flags given to open() contain (mostly) file table flags, but can also contain file descriptor flags). If it didn't work like this, nothing in unix work work as expected, consider a simple script #! /bin/sh cat file1 echo --------------------- cat file2 which you run as script > file3 There you expect (I hope) that file3 will end up containing the contents of file1, then a line of dashes (or hyphens, or minus signs, whatever you prefer to call that character, please ignore here that writing dashes using echo is an unsafe operation) and then the contents of file2. But consider how that works: 1: a shell is created in a new process to run that script. When started the shell you typed the command into will have set the new shell's stdout (fd 1) to be a reference to "file3" which it will have created (if necessary) and openened for writing. 2. this new shell parses the script and one after another executes the 3 commands contaned in it (at this level the #! line is just a comment). 3. for each command, it forks, then the child exec's the appropriate command (cat, or echo in this case .. we'll just ignore the probability that "echo" is built into the shell here, if it is, to work properly, the shell must at least pretend to operate as if it were not, just be more efficient). Consider what would happen here if your interpretation were correct, and each forked child had a new reference to the output file (file3). When the script starts, all would be OK, and the first cat would be OK too, as nothing has changed yet. But when the "echo" was forked, its new reference to the file would be still at offset 0, not at the end of the data now in file3 that was placed there by the first "cat" command. That would be because you expect that the file meta-data is not shared between parent/shild after a fork(), so nothing that firfst cat changed (such as the current offset in the file) would be reflected back into the parent process, so even if there was a mechanism to do so, the shell running the script would have no way to inform the echo command where to start in the file - when the fork() happens to start echo, a new file reference would be made, and the offset would be zero again. What actually happens, is that the shell (running the script) has fd 1, that reference its file descriptor table, and from there a file table entry (which then references the inode that the directory lookup of "file3" says contains the appropriate data). When it forks, the file descriptor table is copied, more or less unchanged, from parent to child. The child has as its fd 1 (its standard output) a reference to the exact same file table entry. The cat is exec'd (which changes nothing about the file descriptor table, unless the close-on-exec flag was set, which it will not be in this example) cat runs with its standard output (fd 1) referencing the same file table entry that was created when file3 was initially opened. As it writes data into the file, the file offset field of that table is updated (this is how write(1, ...) write(1, ...) write(1, ...) (which is approximately what cat does) appends data to the file, each write follows the one that preceded it, unless someone does an lseek() which changes the offset field explicitly). When cat is finished, and the script shell whas waited for it, the file descriptor table offset field will be at the size of file1 (as an offset into file3) as that is how much data has been copied. The shell running the script is sharing that same file descriptor table, so its offset into its standard output is the same. And when it forks again to run the echo command, exho starts with its standard output referencing the same file table entry, with the same offset, and the line of dashes it writes, will be placed immediately after the contents of that came from file1, in file3, and the file offset will be advanced by the number of dashes, plus 1 for the trailing newline that is also written. The same thing happens when the second cat starts, and its data goes after all of that. Now, I'll admit, it is surprising that the "non-blocking" flag (and perhaps one or two others) is in the file table, and not the file descriptor table where it would make a lot more sense. But that is the way it was done when it was initially implemented, and now we're stuck with it. My guess for why is that the file table has lots of flags, always did, so applications are used to that, and deal with ones they don't understand or care about just fine. On the other hand, the file descriptor table, since creation, has only ever had one flag (initially there were none), the close on exec flag, and there used to be no #define'd name for that flag, so old applications tend to assume that getting any non-zero value from fcntl(F_GETFD) means "close on exec is set" ... so adding new flags there is likely to break a lot of old code. But that is just my guess. | I am fairly thick skinned, so by all means, please feel free to tell me | if you think I'm just crazy, and if in fact the standard does not mean | what it says when it says "The child process shall have its own copy of | the parent's file descriptors." It does mean exactly that. But what that means is its own copy of the file descriptor table - when it does close(1) that affects only its own copy of its standard output, it releases its reference to the file table entry that is referencing "file3"'s inode table entry, without affecting the similar reference in the parent (or any other) process. But you need to understand the 3 level reference chain to really understand what that implies, and you will need to read the standard vaery carefully, paying particular attention to slight wording differences (which is what Mark, that is, shareware_systems, was attempting to point out, I believe). It does not mean that it has its own copy of everything related to the file (there is only one copy of the data in the file for example, filesystems that implement snapshots ignored here, if one of the two processes changes the data, that changes it for all of the, If someone changes the access time, that changes it for all of them, etc. For better or worse, "non blocking mode", and (far more reasonably) the file offset (current pointer into the file) are just the same - except those are only shared amongst file descriptors that reference the same file table entry, this is wht cat file & cat file where both cat processrs run at the same time each produces a complete copy of the file - each cat process does open("file") which each makes a new file table entry, and hence each has its own offset. When the first cat reads from the file, the offset of the other file table entry does not change. The othput will be all mixed up, since both processes share their parent's standard output, meaning they share one output file offset pointer, so all of the output from both copies of the file will appear there, in whatever order the schedueller allowed the two cat processes to run, and depending upon just how quickly they each create the output. Every byte from "file" will appear twice, and if you were able to colour the bytes from the first cat red, and the ones from the second cat blue, and you looked at only the red (or blue) output data, each would be a complete, ordered, copy of "file" - but as a whole you'd have red and blue sections all mixed together in a meaningless pattern (and since the colours do not exist in reality, unscrambling this, when it happens, is hard). kre