On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > I still don't see what could be wrong with the pipe_inode_info thing, > but the fact that it's been so consistent in your traces does make me > suspect it really is *that* particular slab.
I think I finally found it. I've spent waaayy too much time looking at and thinking about that code without seeing anything wrong, but this morning I woke up and thought to myself "What if.." And looking at the code again, I went "BINGO". All our reference counting etc seems right, but we have one very subtle bug: on the freeing path, we have a pattern like this: spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); __pipe_unlock(pipe); if (kill) free_pipe_info(pipe); which on the face of it is trying to be very careful in not accessing the pipe-info after it is released by having that "kill" flag, and doing the release last. And it's complete garbage. Why? Because the thread that decrements "pipe->files" *without* releasing it, will very much access it after it has been dropped: that "__pipe_unlock(pipe)" happens *after* we've decremented the pipe reference count and dropped the inode lock. So another CPU can come in and free the structure concurrently with that __pipe_unlock(pipe). This happens in two places, and we don't actually need or want the pipe lock for the pipe->files accesses (since pipe->files is protected by inode->i_lock, not the pipe lock), so the solution is to just do the __pipe_unlock() before the whole dance about the pipe->files reference count. Patch appended. And no wonder nobody has ever seen it, because the race is unlikely as hell to ever happen. Simon, I assume it will be another few months before we can say "yeah, that fixed it", but I really think this is it. It explains all the symptoms, including "DEBUG_PAGEALLOC didn't catch it" (because the access happens just as it is released, and DEBUG_PAGEALLOC takes too long to actually free unmap the page etc). Linus
fs/pipe.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index d2c45e14e6d8..18f1a4b2dbbc 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -743,13 +743,14 @@ pipe_release(struct inode *inode, struct file *file) kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + __pipe_unlock(pipe); + spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); - __pipe_unlock(pipe); if (kill) free_pipe_info(pipe); @@ -1130,13 +1131,14 @@ err_wr: goto err; err: + __pipe_unlock(pipe); + spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); - __pipe_unlock(pipe); if (kill) free_pipe_info(pipe); return ret;