Re: sys_write() racy for multi-threaded append?

Michael K. Edwards Fri, 09 Mar 2007 22:45:46 -0800

I apologize for throwing around words like "stupid".  Whether or not
the current semantics can be improved, that's not a constructive way
to characterize them.  I'm sorry.


As three people have ably pointed out :-), the particular case of a
pipe/FIFO isn't seekable and doesn't need the f_pos member anyway
(it's effectively always O_APPEND).  That's what I get for checking
against standards documents at 3AM.  Of course, this has nothing to do
with the point that led me to comment on pipes/FIFOs (which was that
there exist file types that never return 0<ret<nbytes).  And it was in
the context of a very explicit aside that f_pos is not _interesting_
on a pipe/FIFO, except as an indicator of total bytes written.  You
could only peek at this with an (admittedly non-portable) llseek(fd,
0, SEEK_CUR) anyway -- which you would only do for diagnostic
purposes.  But diagnosis of odd corner cases (rarely in my code,
usually in other people's) is what I do day in and day out, so for me
it would be worth having.

In any case, you're all right that the standard doesn't require you to
do anything useful with f_pos on a pipe/FIFO.  But you're permitted to
make it useful if you want to:

<1003.1 lseek()>
The behavior of lseek() on devices which are incapable of seeking is
implementation-defined. The value of the file offset associated with
such a device is undefined.
</1003.1>

Tracking f_pos accurately when writes from multiple threads hit the
same fd (pipe or not) isn't portable, but I recall situations where it
would have been useful.  And if f_pos has to be kept at all in the
uncontended case, it costs you little or nothing to do it in a
thread-safe manner -- as long as you don't overconstrain the semantics
such that you forbid the transient overshoot associated with a short
write.  In fact, unless there's something I've missed, increasing
f_pos before entering vfs_write() happens to be _faster_ than the
current code for common load patterns, both single- and multi-threaded
(although getting the full benefit in the multi-threaded case will
take some fiddling with f_count placement).

I say it costs "little or nothing" only because altering an loff_t
atomically is not free.  But even on x86, with its inability to
atomically modify any 64-bit entity in memory, an uncontended spinlock
on a cacheline already in L1 is so cheap that making the f_pos changes
atomic will (I think) be lost in the noise.

In any case, rewriting read_write.c is proving interesting.  I'll let
you all know if anything comes of it.  In the meantime, thanks for
your (really quite friendly under the circumstances) comments.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sys_write() racy for multi-threaded append?

Reply via email to