On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <[email protected]> wrote: > On 05/09/2016 07:01 AM, [email protected] wrote: >> >> From: Filipe Manana <[email protected]> >> >> When we do a direct IO write against a preallocated extent (fallocate) >> that does not go beyond the i_size of the inode, we do the write operation >> without holding the inode's i_mutex (an optimization that landed in >> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows >> for a very tiny time window where a race can happen with a concurrent >> fsync using the fast code path, as the direct IO write path creates first >> a new extent map (no longer flagged as a prealloc extent) and then it >> creates the ordered extent, while the fast fsync path first collects >> ordered extents and then it collects extent maps. This allows for the >> possibility of the fast fsync path to collect the new extent map without >> collecting the new ordered extent, and therefore logging an extent item >> based on the extent map without waiting for the ordered extent to be >> created and complete. This can result in a situation where after a log >> replay we end up with an extent not marked anymore as prealloc but it was >> only partially written (or not written at all), exposing random, stale or >> garbage data corresponding to the unwritten pages and without any >> checksums in the csum tree covering the extent's range. >> >> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix >> race between fsync and lockless direct IO writes"). >> >> So fix this by creating first the ordered extent and then the extent >> map, so that this way if the fast fsync patch collects the new extent >> map it also collects the corresponding ordered extent. >> > > So this seems to just be trading one problem for another bad behavior. Now > instead we'll end up with an ordered extent but possibly not the extent map, > which is fine, but seems ugly, and is a subtle and undocumented behavior > that is going to bite us in the ass in the future. > > I know everybody loves lockless writes, but we need to protect for this > particular case in an explicit way so I don't go change something in the > future and forget about this dependency. What if we add a rwsem around > adding the extent map and the ordered extent. In fact we pull this dance > out into a common helper function so everybody who wants to do this just > calls the helper function that is easy to audit and understand, then we just > put a > > down_read(&BTRFS_I(inode)->lets_not_fuck_fsync)); > /* Do our extent map + ordered extent dance here */ > up_read(&BTRFS_I(inode)->lets_not_fuck_fsync)); > > and then where we gather all of this stuff in fsync we do a > down_write/up_write. This won't affect the buffered write case currently as > we're still protected by the i_mutex, and then makes the direct io case much > cleaner and safer. Then later on when we want to remove the i_mutex from > the buffered write case we have one less gotcha to worry about. Thanks,
Agreed. I though about something similar initially but I didn't want to increase the size of struct btrfs_inode. Would you mind that if instead of changing this patch I do one on top that introduces that helper function, and makes this and the other place in the dio path that does the same logic use it too? thanks > > Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
