On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote: > > > On 2019/3/20 下午10:00, Anand Jain wrote: > > > >>> Also any idea why the generation number for the extent data is not > >>> incremented [2] when -o nodatacow and notrunc option is used, is it > >>> a bug? the dump-tree is taken with the script as below [1] > >>> (this corruption is seen with or without generation number is > >>> being incremented, but as another way to fix for the corruption we can > >>> verify the inode EXTENT_DATA generation from the same disk from which > >>> the data is read). > >> > >> For the generation part, it's the generation when data is written to > >> disk. > >> > >> Truncation/nocow overwrite shouldn't really change the generation of > >> existing file extents. > >> > >> So I'm afraid you can't use that generation to do the check. > > > > Any idea why it shouldn't change? Albeit there isn't new allocation > > due to nodatacow and notrunc overwrite, but sure data is overwritten.
The references to the extent in the subvol trees hold a copy of the extent's generation, so if the extent's generation is modified, all the references to the extent in all the subvol trees have to be modified too, or they fail transid verification later on. Compared to pure datacow (nodatasum, compress=none), this would be the same number of iops, only fragmentation can be saved because of block overwrites (and even that isn't saved all the time). > > If that's the case then I would guess there will be bug in send receive > > as well. Send requires a read-only snapshot, and the snapshot's reference to the nodatacow extents automatically turns on datacow for those extents. Thus, send behaves correctly because nodatacow is disabled. The nodatacow flag is advisory. It doesn't prevent btrfs from relocating data when needed. > I'm not sure about the send part. > > On the other hand, if btrfs is going to update the generation of > nodatacow file extent overwrite, it should cause pretty big performance > degradation. > > The idea of nodatacow is to skip all the expensive csum, extent > allocation (maybe not that expensive) and the race of subvol tree. nodatacow also skips RAID data integrity checks. Generally, applications and admins should plan for any data put in a nodatacow file to be silently corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup. This is the price for minimizing the overhead for a write to a nodatacow extent. > If we're going to update file extents for such case, we're re-introduce > performance impact to users who don't want that impact at all. > I don't believe it's worthy at all. > > Thanks, > Qu > > > > > Thanks, Anand > > > >> Thanks, > >> Qu > >> > >>> > >>> [1] > >>> umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \ > >>> mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \ > >>> echo 1st write: && \ > >>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > >>> conv=fsync,notrunc && sync && \ > >>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > >>> echo --- && \ > >>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" && \ > >>> echo 2nd write: && \ > >>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > >>> conv=fsync,notrunc && sync && \ > >>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > >>> echo --- && \ > >>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" > >>> > >>> > >>> 1st write: > >>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > >>> generation 6 transid 6 size 4096 nbytes 4096 > >>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > >>> sequence 1 flags 0x3(NODATASUM|NODATACOW) > >>> atime 1553058460.163985452 (2019-03-20 13:07:40) > >>> ctime 1553058460.163985452 (2019-03-20 13:07:40) > >>> mtime 1553058460.163985452 (2019-03-20 13:07:40) > >>> otime 1553058460.163985452 (2019-03-20 13:07:40) > >>> --- > >>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > >>> generation 6 type 1 (regular) > >>> extent data disk byte 13631488 nr 4096 > >>> extent data offset 0 nr 4096 ram 4096 > >>> extent compression 0 (none) > >>> 2nd write: > >>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > >>> generation 6 transid 7 size 4096 nbytes 4096 > >>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > >>> sequence 2 flags 0x3(NODATASUM|NODATACOW) > >>> atime 1553058460.163985452 (2019-03-20 13:07:40) > >>> ctime 1553058460.189985450 (2019-03-20 13:07:40) > >>> mtime 1553058460.189985450 (2019-03-20 13:07:40) > >>> otime 1553058460.163985452 (2019-03-20 13:07:40) > >>> --- > >>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > >>> generation 6 type 1 (regular) <----- [2] > >>> extent data disk byte 13631488 nr 4096 > >>> extent data offset 0 nr 4096 ram 4096 > >>> extent compression 0 (none) > >>> > >>> > >>> Thanks, Anand > >> >
signature.asc
Description: PGP signature