Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result
On Thu, Apr 06, 2017 at 11:28:01AM -0500, Eric Sandeen wrote: > On 4/6/17 11:26 AM, Theodore Ts'o wrote: > > On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote: > >> > >> Test fails with ext3/2 when driving with ext4 driver, fiemap changed > >> after umount/mount cycle, then changed back to original result after > >> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.) > > > > I haven't had time to look at this, but I'm not sure this test is a > > reasonable one on the face of it. > > > > A file system may choose to optimize a file's extent tree for whatever > > reason it wants, whenever it wants, including on an unmount --- and > > that would not be an invalid thing to do. So to have an xfstests that > > causes a test failure if a file system were to, say, do some cleanup > > at mount or unmount time, or when the file is next opened, to merge > > adjacent extents together (and hence change what is returned by > > FIEMAP) might be strange, or even weird --- but is this any of user > > space's business? Or anything we want to enforce as wrong wrong wrong > > by xfstests? So I was asking for a review from ext4 side instead of queuing it for next xfstests update :) > > I had the same question. If the exact behavior isn't defined anywhere, > I don't know what we can be testing, TBH. Agreed, I was about to ask for the expected behavior today if there was no new review comments on this patch. Thanks for the comments and review! Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?
Andrei Borzenkov posted on Sun, 02 Apr 2017 09:30:46 +0300 as excerpted: > 02.04.2017 03:59, Duncan пишет: >> >> 4) In fact, since an in-place convert is almost certainly going to take >> more time than a blow-away and restore from backup, > > This caught my eyes. Why? In-place convert just needs to recreate > metadata. If you have multi-terabyte worth of data copying them twice > hardly can be faster. Why twice? If you care about the data by definition you already have backups, so it's only copying back from those backups to the newly created filesystem, right? And if you don't have backups, then by definition, you don't care about the data, at least not enough to be worth the hassle of a backup and thus arguably not enough to be worth the hassle of a convert, so just blow it away with a new mkfs and start from scratch since you self-evidently didn't care enough about the data for it to be worth a backup anyway, no problem. And actually, it's not even a single extra copy that you won't have to do anyway, if you schedule your new filesystem creation as part of your normal backup regime in place of what would otherwise be a full backup that you now don't have to make, so copying the data from the old filesystem to the new one is simply replacing the full backup that you'd otherwise be doing at the same point in time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume appears full but TB's of space available
Interesting. That's the first time I'm hearing this. If that's the case I feel like it's a stretch to call it RAID10 at all. It sounds a lot more like basic replication similar to Ceph only Ceph understands failure domains and therefore can be configured to handle device failure (albeit at a higher level) I do of course keep backups but I chose RAID10 for the mix of performance and reliability. It doesn't seems worth it losing 50% of my usable space for the performance gain alone. Thank you for letting me know about this. Knowing that I think I may have to reconsider my choice here. I've really been enjoying the flexibility of BTRS which is why I switched to it in the first place but with experimental RAID5/6 and what you've just told me I'm beginning to doubt that it's the right choice. What's more concerning is that I haven't found a good way to monitor BTRFS. I might be able to accept that the array can only handle a single drive failure if I was confident that I could detect it but so far I haven't found a good solution for this. ___ John Petrini NOC Systems Administrator // CoreDial, LLC // coredial.com // Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422 P: 215.297.4400 x232 // F: 215.297.4401 // E: jpetr...@coredial.com Interested in sponsoring PartnerConnex 2017? Learn more. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. On Thu, Apr 6, 2017 at 10:42 PM, Chris Murphy wrote: > On Thu, Apr 6, 2017 at 7:31 PM, John Petrini wrote: >> Hi Chris, >> >> I've followed your advice and converted the system chunk to raid10. I >> hadn't noticed it was raid0 and it's scary to think that I've been >> running this array for three months like that. Thank you for saving me >> a lot of pain down the road! > > For what it's worth, it is imperative to keep frequent backups with > Btrfs raid10, it is in some ways more like raid0+1. It can only > tolerate the loss of a single device. It will continue to function > with 2+ devices in a very deceptive degraded state, until it > inevitably hits dual missing chunks of metadata or data, and then it > will faceplant. And then you'll be looking at a scrape operation. > > It's not like raid10 where you can lose one of each mirrored pair. > Btrfs raid10 mirrors chunks, not drives. So your metadata and data are > all distributed across all of the drives, and that in effect means you > can only lose 1 drive. If you lose a 2nd drive, some amount of > metadata and data will have been lost. > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mix ssd and hdd in single volume
Roman Mamedov posted on Mon, 03 Apr 2017 13:41:07 +0500 as excerpted: > On Mon, 3 Apr 2017 11:30:44 +0300 Marat Khalili wrote: > >> You may want to look here: https://www.synology.com/en-global/dsm/Btrfs >> . Somebody forgot to tell Synology, which already supports btrfs in all >> hardware-capable devices. I think Rubicon has been crossed in >> 'mass-market NAS[es]', for good or not. > > AFAIR Synology did not come to this list asking for (any kind of) advice > prior to implementing that (else they would have gotten the same kind of > post from Duncan and others)[.] I don't remember seeing them actively > contribute improvements or fixes especially for the RAID5 or RAID6 > features (which they ADVERTISE on that page as a fully working part of > their product). > That doesn't seem honest to end users or playing nicely with the > upstream developers. What the upstream gets instead is just those > end-users coming here one by one some years later, asking how to fix > a broken Btrfs RAID5 on an embedded box running some 3.10 or 3.14 > kernel. And of course then the user gets the real state of btrfs and of btrfs raid56 mode, particularly back that far, explained to them. Along with that we'll explain that any data on it is in all likelihood lost data, with little to no chance at recovery. And we'll point out that if there was serious value in the data, they would have investigated the state of the filesystem before they put that data on it, and of course, as I've already said, they'd have had backups for anything that was of more value than the time/resources/hassle of doing those backups. And if they're lucky, that NAS will have /been/ the backup, and they'll still have the actual working copy at least, and can make another backup ASAP just in case that working copy dies too. But if they're unlucky... Of course the user will then blame the manufacturer, but by that time the warranty will be up, and even if not, while they /might/ get their money back, that won't get their data back. And the manufacturer will get a bad name, but by then having taken the money and run they'll be on to something else or perhaps be bought out by someone bigger or be out of business. And all the user will be able to do is chalk it up to experience, and mourn the loss of their kids' baby pictures/videos or their wedding videos, or whatever. If they're /really/ lucky, they'll have put them on facebook or youtube or whatever, and can retrieve at least those, from there. Meanwhile, the user, having been once burned, may never use the by then much improved btrfs, or even worse, never trust anything Linux, again. Oh, well. The best we can do here is warn those that /do/ value their data enough to do their research first, so they /do/ have those backups or at least use something a bit more mature than btrfs raid56 mode. Of course and continue to work on full btrfs stabilization. And I like to think we're reasonably good at those warnings, anyway. The stabilization, too, but that takes time and patience, plus coder skill, the last of which which I personally don't have, so I just pitch in where I can, answering questions, etc. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG] Cycle mount btrfs can cause fiemap to return different result. Like: # mount /dev/vdb5 /mnt/btrfs # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]:25088..25215 128 0x1 # umount /mnt/btrfs # mount /dev/vdb5 /mnt/btrfs # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..31]: 25088..2511932 0x0 1: [32..63]:25120..2515132 0x0 2: [64..95]:25152..2518332 0x0 3: [96..127]: 25184..2521532 0x1 But after above fiemap, we get correct merged result if we call fiemap again. # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]:25088..25215 128 0x1 [REASON] Btrfs will try to merge extent map when inserting new extent map. btrfs_fiemap(start=0 len=(u64)-1) |- extent_fiemap(start=0 len=(u64)-1) |- get_extent_skip_holes(start=0 len=64k) | |- btrfs_get_extent_fiemap(start=0 len=64k) | |- btrfs_get_extent(start=0 len=64k) || Found on-disk (ino, EXTENT_DATA, 0) ||- add_extent_mapping() ||- Return (em->start=0, len=16k) | |- fiemap_fill_next_extent(logic=0 phys=X len=16k) | |- get_extent_skip_holes(start=0 len=64k) | |- btrfs_get_extent_fiemap(start=0 len=64k) | |- btrfs_get_extent(start=16k len=48k) || Found on-disk (ino, EXTENT_DATA, 16k) ||- add_extent_mapping() || |- try_merge_map() || Merge with previous em start=0 len=16k || resulting em start=0 len=32k ||- Return (em->start=0, len=32K)<< Merged result |- Stripe off the unrelated range (0~16K) of return em |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K) ^^^ Causing split fiemap extent. And since in add_extent_mapping(), em is already merged, in next fiemap() call, we will get merged result. [FIX] Here we introduce a new structure, fiemap_cache, which records previous fiemap extent. And will always try to merge current fiemap_cache result before calling fiemap_fill_next_extent(). Only when we failed to merge current fiemap extent with cached one, we will call fiemap_fill_next_extent() to submit cached one. So by this method, we can merge all fiemap extents. It can also be done in fs/ioctl.c, however the problem is if fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap extent. So I choose to merge it in btrfs. Signed-off-by: Qu Wenruo --- v2: Since fiemap_extent_info has a limit for number of fiemap_extent, it's possible that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which can cause kernel warning if we fiemap is called on large compressed file. v3: Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we ensured submit_fiemap_extent() to submit fiemap cache, so it just acts as a sanity check. Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as extent map can be larger than BTRFS_MAX_EXTENT_SIZE. Don't do backward jump, suggested by David. Better sanity check and recoverable fix. To David: What about adding a btrfs_debug_warn(), which will only call WARN_ON(1) if BTRFS_CONFIG_DEBUG is specified for recoverable bug? And modify ASSERT() to always WARN_ON() and exit error code? --- fs/btrfs/extent_io.c | 124 ++- 1 file changed, 122 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 28e8192..c4cb65d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4353,6 +4353,123 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode, return NULL; } +/* + * To cache previous fiemap extent + * + * Will be used for merging fiemap extent + */ +struct fiemap_cache { + u64 offset; + u64 phys; + u64 len; + u32 flags; + bool cached; +}; + +/* + * Helper to submit fiemap extent. + * + * Will try to merge current fiemap extent specified by @offset, @phys, + * @len and @flags with cached one. + * And only when we fails to merge, cached one will be submitted as + * fiemap extent. + * + * Return value is the same as fiemap_fill_next_extent(). + */ +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo, + struct fiemap_cache *cache, + u64 offset, u64 phys, u64 len, u32 flags) +{ + int ret = 0; + + if (!cache->cached) + goto assign; + + /* +* Sanity check, extent_fiemap() should have ensured that new +* fiemap extent won't overlap with cahced one. +* Not recoverable. +* +* NOTE: Physical address can overlap, due to compression +
Re: Volume appears full but TB's of space available
On Thu, Apr 6, 2017 at 7:31 PM, John Petrini wrote: > Hi Chris, > > I've followed your advice and converted the system chunk to raid10. I > hadn't noticed it was raid0 and it's scary to think that I've been > running this array for three months like that. Thank you for saving me > a lot of pain down the road! For what it's worth, it is imperative to keep frequent backups with Btrfs raid10, it is in some ways more like raid0+1. It can only tolerate the loss of a single device. It will continue to function with 2+ devices in a very deceptive degraded state, until it inevitably hits dual missing chunks of metadata or data, and then it will faceplant. And then you'll be looking at a scrape operation. It's not like raid10 where you can lose one of each mirrored pair. Btrfs raid10 mirrors chunks, not drives. So your metadata and data are all distributed across all of the drives, and that in effect means you can only lose 1 drive. If you lose a 2nd drive, some amount of metadata and data will have been lost. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume appears full but TB's of space available
Hi Chris, I've followed your advice and converted the system chunk to raid10. I hadn't noticed it was raid0 and it's scary to think that I've been running this array for three months like that. Thank you for saving me a lot of pain down the road! Also thank you for the clarification on the output - this is making a lot more sense. Regards, John Petrini -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume appears full but TB's of space available
On Thu, Apr 6, 2017 at 7:15 PM, John Petrini wrote: > Okay so I came across this bug report: > https://bugzilla.redhat.com/show_bug.cgi?id=1243986 > > It looks like I'm just misinterpreting the output of btrfs fi df. What > should I be looking at to determine the actual free space? Is Free > (estimated): 13.83TiB (min: 13.83TiB) the proper metric? > > Simply running df does not seem to report the usage properly > > /dev/sdj 25T 11T 5.9T 65% /mnt/storage-array Free should be correct. And df -h should be IEC units, so I'd expect it to be closer to the value of btrfs fi us than this. But the code has changed over time, I'm not sure when the last adjustment was made. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume appears full but TB's of space available
On Thu, Apr 6, 2017 at 6:47 PM, John Petrini wrote: > sudo btrfs fi df /mnt/storage-array/ > Data, RAID10: total=10.72TiB, used=10.72TiB > System, RAID0: total=128.00MiB, used=944.00KiB > Metadata, RAID10: total=14.00GiB, used=12.63GiB > GlobalReserve, single: total=512.00MiB, used=0.00B The third line is kinda scary. System chunk is raid0, so ostensibly a single device failure means the entire array is lost. The fastest way to fix it is: btrfs balance start -mconvert=raid10,soft That will make the system chunk raid10. > > sudo btrfs fi usage /mnt/storage-array/ > Overall: > Device size: 49.12TiB > Device allocated: 21.47TiB > Device unallocated: 27.65TiB > Device missing: 0.00B > Used: 21.45TiB > Free (estimated): 13.83TiB (min: 13.83TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID10: Size:10.72TiB, Used:10.71TiB This is saying you have 10.72T of data. But because it's raid10, it will take up 2x that much space. This is what's reflected by the Overall: Used: value of 21.45T, plus some extra for metadata which is also 2x. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume appears full but TB's of space available
Okay so I came across this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1243986 It looks like I'm just misinterpreting the output of btrfs fi df. What should I be looking at to determine the actual free space? Is Free (estimated): 13.83TiB (min: 13.83TiB) the proper metric? Simply running df does not seem to report the usage properly /dev/sdj 25T 11T 5.9T 65% /mnt/storage-array Thank you, John Petrini -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add missing memset while reading compressed inline extents
At 04/07/2017 12:07 AM, Filipe Manana wrote: On Wed, Mar 22, 2017 at 2:37 AM, Qu Wenruo wrote: At 03/09/2017 10:05 AM, Zygo Blaxell wrote: On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote: On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell wrote: From: Zygo Blaxell This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) So this case is actually different from the reproducer below, because once a file has prealloc extents, future writes will never be compressed. That is, the extent at offset 4096 can not have compressed content using steps like the reproducer below. I can only imagine that after an falloc to create the extent at 4K, we cloned some compressed extent from some other file into that offset. The above comes from one of my captures of the bug appearing in the wild, not the reproducer from Liu Bo. I'll make that clearer. In fact, I found that btrfs/137 with compress=lzo seems to trigger such "inline-then-regular" case in a very high possibility. If abort the test just after populating the fs, I found the possibility to have inline-then-regular extent layout is over 70%. 100%, as explained in your fstests patch thread. Although in btrfs/137 case, it's "inline-then-hole"
Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result
At 04/07/2017 12:28 AM, Eric Sandeen wrote: On 4/6/17 11:26 AM, Theodore Ts'o wrote: On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote: Test fails with ext3/2 when driving with ext4 driver, fiemap changed after umount/mount cycle, then changed back to original result after sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.) I haven't had time to look at this, but I'm not sure this test is a reasonable one on the face of it. A file system may choose to optimize a file's extent tree for whatever reason it wants, whenever it wants, including on an unmount --- and that would not be an invalid thing to do. So to have an xfstests that causes a test failure if a file system were to, say, do some cleanup at mount or unmount time, or when the file is next opened, to merge adjacent extents together (and hence change what is returned by FIEMAP) might be strange, or even weird --- but is this any of user space's business? Or anything we want to enforce as wrong wrong wrong by xfstests? I had the same question. If the exact behavior isn't defined anywhere, I don't know what we can be testing, TBH. For unmount cleanup, it's acceptable. But if we're getting different result even we didn't modify the fs during the same mount duration, fiemap still changes, then it's at least anti-instinct. And unfortunately, btrfs and ext3 shares the same problem here: === Fiemap after cycle mount === /mnt/scratch/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..95]: 139264..139359 96 0x1000 1: [96..127]: 139360..139391 32 0x1000 2: [128..511]: 138112..138495 384 0x1000 3: [512..1023]: 138752..139263 512 0x1000 4: [1024..2047]:143360..1443831024 0x1000 5: [2048..4095]:145408..1474552048 0x1001 === Fiemap after cycle mount and sleep === /mnt/scratch/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]:139264..139391 128 0x1000 1: [128..511]: 138112..138495 384 0x1000 2: [512..1023]: 138752..139263 512 0x1000 3: [1024..2047]:143360..1443831024 0x1000 4: [2048..4095]:145408..1474552048 0x1001 I was using the same excuse just as you guys, until I found btrfs is merging extent maps at read time, which causes the problem. That's why the 2nd fiemap in btrfs returns merged result. We fix btrfs by caching fiemap extent result before calling fiemap_fill_next_extent(), and try to merge with cached result. Which fixes the problem quite easy. Thanks, Qu -Eric - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Volume appears full but TB's of space available
Hello List, I have a volume that appears to be full despite having multiple Terabytes of free space available. Just yesterday I ran a re-balance but it didn't change anything. I've just added two more disks to the array and am currently in the process of another re-balance but the available space has not increased. Currently I can still write to the volume (I haven't tried any large writes) so I'm not sure if this is just a reporting issue or if writes will eventually fail. Any help is appreciated. Here's the details: uname -a Linux yuengling.johnpetrini.com 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs --version btrfs-progs v4.4 sudo btrfs fi df /mnt/storage-array/ Data, RAID10: total=10.72TiB, used=10.72TiB System, RAID0: total=128.00MiB, used=944.00KiB Metadata, RAID10: total=14.00GiB, used=12.63GiB GlobalReserve, single: total=512.00MiB, used=0.00B sudo btrfs fi show /mnt/storage-array/ Label: none uuid: e113ab87-7869-4ec7-9508-95691f455018 Total devices 10 FS bytes used 10.73TiB devid1 size 4.55TiB used 2.65TiB path /dev/sdj devid2 size 4.55TiB used 2.65TiB path /dev/sdk devid3 size 3.64TiB used 2.65TiB path /dev/sdd devid4 size 3.64TiB used 2.65TiB path /dev/sdf devid5 size 3.64TiB used 2.65TiB path /dev/sdg devid6 size 3.64TiB used 2.65TiB path /dev/sde devid7 size 3.64TiB used 2.65TiB path /dev/sdb devid8 size 3.64TiB used 2.65TiB path /dev/sdc devid9 size 9.10TiB used 149.00GiB path /dev/sdh devid 10 size 9.10TiB used 149.00GiB path /dev/sdi sudo btrfs fi usage /mnt/storage-array/ Overall: Device size: 49.12TiB Device allocated: 21.47TiB Device unallocated: 27.65TiB Device missing: 0.00B Used: 21.45TiB Free (estimated): 13.83TiB (min: 13.83TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID10: Size:10.72TiB, Used:10.71TiB /dev/sdb1.32TiB /dev/sdc1.32TiB /dev/sdd1.32TiB /dev/sde1.32TiB /dev/sdf1.32TiB /dev/sdg1.32TiB /dev/sdh 72.00GiB /dev/sdi 72.00GiB /dev/sdj1.32TiB /dev/sdk1.32TiB Metadata,RAID10: Size:14.00GiB, Used:12.63GiB /dev/sdb1.75GiB /dev/sdc1.75GiB /dev/sdd1.75GiB /dev/sde1.75GiB /dev/sdf1.75GiB /dev/sdg1.75GiB /dev/sdj1.75GiB /dev/sdk1.75GiB System,RAID0: Size:128.00MiB, Used:944.00KiB /dev/sdb 16.00MiB /dev/sdc 16.00MiB /dev/sdd 16.00MiB /dev/sde 16.00MiB /dev/sdf 16.00MiB /dev/sdg 16.00MiB /dev/sdj 16.00MiB /dev/sdk 16.00MiB Unallocated: /dev/sdb2.31TiB /dev/sdc2.31TiB /dev/sdd2.31TiB /dev/sde2.31TiB /dev/sdf2.31TiB /dev/sdg2.31TiB /dev/sdh9.03TiB /dev/sdi9.03TiB /dev/sdj3.22TiB /dev/sdk3.22TiB Thank You, John Petrini -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents
At 04/07/2017 12:02 AM, Filipe Manana wrote: On Thu, Apr 6, 2017 at 2:28 AM, Qu Wenruo wrote: Btrfs allows inline file extent if and only if 1) It's at offset 0 2) It's smaller than min(max_inline, page_size) Although we don't specify if the size is before compression or after compression. At least according to current behavior, we are only limiting the size after compression. 3) It's the only file extent So that if we append existing inline extent, it should be converted to regular file extents. However several users in btrfs mail list have reported invalid inline file extent, which only meets the first two condition, but with regular file extents following. The bug is here for a long long time, so long that we have modified kernel and btrfs-progs to accept such case, but it's still not designed behavior, and must be detected and fixed. So after looking at this better, I'm not convinced it's a problem. Other than making btrfs/137 fail when compression is enabled, what problems have you observed? Inline-then-regular file extent layout itself is a problem. Just check the behavior of plain btrfs. Inline extent must be the *only* extent, hole/regular/prealloc extent shouldn't co-exist with inline extent. Such append should convert the inline extent to regular. Furthermore, this behavior also exposes the confusion of max_inline. If we are limiting inline extent size by its compressed size other than plain size, then we're hiding a new limit, page size. Thanks, Qu btrfs/137 fails due to the incremental send code not being prepared for this case, which does not seem harmful to me because the inline extent represents 4096 bytes of uncompressed data. It would be a problem only if the uncompressed data was less than 4096 bytes. So unless there's evidence that this particular case causes problems somewhere, I don't think it's useful to have this test. As for btrfs/137, I'm sending a fix for the incremental send code. More comments below anyway. Signed-off-by: Qu Wenruo --- v2: All suggested by Eryu Guan Use loop_counts instead of runtime to avoid naming confusion. Fix whitespace issues Use fsync instead of sync Use $XFS_IO_PROG instead of calling xfs_io directly Always output corruption possibility into seqres.full v3: All suggested by Filipe Manana Fix gramma errors Use simpler reproducer Add test case to compress group --- tests/btrfs/140 | 111 tests/btrfs/140.out | 2 + tests/btrfs/group | 1 + 3 files changed, 114 insertions(+) create mode 100755 tests/btrfs/140 create mode 100644 tests/btrfs/140.out diff --git a/tests/btrfs/140 b/tests/btrfs/140 new file mode 100755 index 000..183d9cd --- /dev/null +++ b/tests/btrfs/140 @@ -0,0 +1,111 @@ +#! /bin/bash +# FS QA Test 140 +# +# Check if btrfs will create inline-then-regular file layout. +# +# Btrfs only allows inline file extent if file is small enough, and any +# incoming write enlarges the file beyond max_inline should replace inline +# extents with regular extents. +# This is a long standing bug, so fsck won't detect it and kernel will allow +# normal read on it, but should be avoided. +# +#--- +# Copyright (c) 2017 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_btrfs_command inspect-internal dump-tree + +inline_size=2048 +page_size=$(get_page_size) +loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case Sorry I forgot about this before. But why is the possibility not 100%? The code that skips the inline extent expansion/conversion is 100% deterministic (btrfs_file_write_iter() -> btrfs_cont_expand() -> btrfs_truncate_block()) si
Re: Do different btrfs volumes compete for CPU?
On 05/04/17 08:04, Marat Khalili wrote: > On 04/04/17 20:36, Peter Grandi wrote: >> SATA works for external use, eSATA works well, but what really >> matters is the chipset of the adapter card. > eSATA might be sound electrically, but mechanically it is awful. Try to > run it for months in a crowded server room, and inevitably you'll get > disconnections and data corruption. Tried different cables, brackets -- > same result. If you ever used eSATA connector, you'd feel it. Been using eSATA here for multiple disk packs continuously connected for a few years now for 48TB of data (not enough room in the host for the disks). Never suffered am eSATA disconnect. Had the usual cooling fan fails and HDD fails due to old age. All just a case of ensuring undisturbed clean cabling and a good UPS?... (BTRFS spanning four disks per external pack has worked well also.) Good luck, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] nowait aio: xfs
On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote: > > + if (unaligned_io) { > > + /* If we are going to wait for other DIO to finish, bail */ > > + if ((iocb->ki_flags & IOCB_NOWAIT) && > > +atomic_read(&inode->i_dio_count)) > > + return -EAGAIN; > > inode_dio_wait(inode); > > This checks i_dio_count twice in the nowait case, I think it should be: > > if (iocb->ki_flags & IOCB_NOWAIT) { > if (atomic_read(&inode->i_dio_count)) > return -EAGAIN; > } else { > inode_dio_wait(inode); > } > > > if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { > > if (flags & IOMAP_DIRECT) { > > + /* A reflinked inode will result in CoW alloc */ > > + if (flags & IOMAP_NOWAIT) { > > + error = -EAGAIN; > > + goto out_unlock; > > + } > > This is a bit pessimistic - just because the inode has any shared > extents we could still write into unshared ones. For now I think this > pessimistic check is fine, but the comment should be corrected. Consider what happens in both _reflink_{allocate,reserve}_cow. If there is already an existing reservation in the CoW fork then we'll have to CoW and therefore can't satisfy the NOWAIT flag. If there isn't already anything in the CoW fork, then we have to see if there are shared blocks by calling _reflink_trim_around_shared. That performs a refcountbt lookup, which involves locking the AGF, so we also can't satisfy NOWAIT. IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to cover both write-to-reflinked-file cases. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix invalid dereference in btrfs_retry_endio
On Thu, Apr 06, 2017 at 04:21:50PM +0200, David Sterba wrote: > On Wed, Apr 05, 2017 at 02:04:19PM -0700, Liu Bo wrote: > > When doing directIO repair, we have this oops > > > > [ 1458.532816] general protection fault: [#1] SMP > > ... > > [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper > > [btrfs] > > [ 1458.536893] task: 88082a42d100 task.stack: c90002b3c000 > > [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs] > > ... > > [ 1458.543261] Call Trace: > > [ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0 > > [ 1458.544374] bio_endio+0xed/0x100 > > [ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs] > > [ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs] > > [ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs] > > [ 1458.546224] process_one_work+0x34d/0xb70 > > [ 1458.546570] ? process_one_work+0x29e/0xb70 > > [ 1458.546938] worker_thread+0x1cf/0x960 > > [ 1458.547263] ? process_one_work+0xb70/0xb70 > > [ 1458.547624] kthread+0x17d/0x180 > > [ 1458.547909] ? kthread_create_on_node+0x70/0x70 > > [ 1458.548300] ret_from_fork+0x31/0x40 > > > > It turns out that btrfs_retry_endio is trying to get inode from a directIO > > page. > > > > This fixes the problem by using the preservd inode pointer, done->inode. > > btrfs_retry_endio_nocsum has the same problem, and it's fixed as well. > > > > Also cleanup unused @start(which is too trivial for a separate patch). > > > > Cc: David Sterba > > Signed-off-by: Liu Bo > > Reviewed-by: David Sterba > > > --- > > fs/btrfs/inode.c | 14 -- > > 1 file changed, 4 insertions(+), 10 deletions(-) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 3ec5a05..8e71ed7 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -7910,7 +7910,6 @@ struct btrfs_retry_complete { > > static void btrfs_retry_endio_nocsum(struct bio *bio) > > { > > struct btrfs_retry_complete *done = bio->bi_private; > > - struct inode *inode; > > struct bio_vec *bvec; > > int i; > > > > @@ -7918,12 +7917,12 @@ static void btrfs_retry_endio_nocsum(struct bio > > *bio) > > goto end; > > > > ASSERT(bio->bi_vcnt == 1); > > - inode = bio->bi_io_vec->bv_page->mapping->host; > > - ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode)); > > + ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode)); > > I was interested how it got here in the first place, inode from mapping > hasn't been used in the initial support for DIO repair. So it was added > in the subpage blocksize preparation patchset (2dabb3248453b), slipped > through the review. Yes, it's a regression, and I've found another regression in this commit. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result
On 4/6/17 11:26 AM, Theodore Ts'o wrote: > On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote: >> >> Test fails with ext3/2 when driving with ext4 driver, fiemap changed >> after umount/mount cycle, then changed back to original result after >> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.) > > I haven't had time to look at this, but I'm not sure this test is a > reasonable one on the face of it. > > A file system may choose to optimize a file's extent tree for whatever > reason it wants, whenever it wants, including on an unmount --- and > that would not be an invalid thing to do. So to have an xfstests that > causes a test failure if a file system were to, say, do some cleanup > at mount or unmount time, or when the file is next opened, to merge > adjacent extents together (and hence change what is returned by > FIEMAP) might be strange, or even weird --- but is this any of user > space's business? Or anything we want to enforce as wrong wrong wrong > by xfstests? I had the same question. If the exact behavior isn't defined anywhere, I don't know what we can be testing, TBH. -Eric > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result
On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote: > > Test fails with ext3/2 when driving with ext4 driver, fiemap changed > after umount/mount cycle, then changed back to original result after > sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.) I haven't had time to look at this, but I'm not sure this test is a reasonable one on the face of it. A file system may choose to optimize a file's extent tree for whatever reason it wants, whenever it wants, including on an unmount --- and that would not be an invalid thing to do. So to have an xfstests that causes a test failure if a file system were to, say, do some cleanup at mount or unmount time, or when the file is next opened, to merge adjacent extents together (and hence change what is returned by FIEMAP) might be strange, or even weird --- but is this any of user space's business? Or anything we want to enforce as wrong wrong wrong by xfstests? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: send, fix file hole not being preserved due to inline extent
From: Filipe Manana Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit e1699d2d7bf6 ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 456c890..f66095a 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -5184,13 +5184,19 @@ static int is_extent_unchanged(struct send_ctx *sctx, while (key.offset < ekey->offset + left_len) { ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item); right_type = btrfs_file_extent_type(eb, ei); - if (right_type != BTRFS_FILE_EXTENT_REG) { + if (right_type != BTRFS_FILE_EXTENT_REG && + right_type != BTRFS_FILE_EXTENT_INLINE) { ret = 0; goto out; } right_disknr = btrfs_file_extent_disk_bytenr(eb, ei); - right_len = btrfs_file_extent_num_bytes(eb, ei); + if (right_type == BTRFS_FILE_EXTENT_INLINE) { + right_len = btrfs_file_extent_inline_len(eb, slot, ei); + right_len = PAGE_ALIGN(right_len); + } else { + right_len = btrfs_file_extent_num_bytes(eb, ei); + } right_offset = btrfs_file_extent_offset(eb, ei); right_gen = btrfs_file_extent_generation(eb, ei); @@ -5204,6 +5210,19 @@ static int is_extent_unchanged(struct send_ctx *sctx, goto out; } + /* +* We just wanted to see if when we have an inline extent, what +* follows it is a regular extent (wanted to check the above +* condition for inline extents too). This should normally not +* happen but it's possible for example when we have an inline +* compressed extent representing data with a size matching +* the page size (currently the same as sector size). +*/ + if (right_type == BTRFS_FILE_EXTENT_INLINE) { + ret = 0; + goto out; + } + left_offset_fixed = left_offset; if (key.offset < ekey->offset) { /* Fix the right offset for 2a and 7. */ -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add missing memset while reading compressed inline extents
On Wed, Mar 22, 2017 at 2:37 AM, Qu Wenruo wrote: > > > At 03/09/2017 10:05 AM, Zygo Blaxell wrote: >> >> On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote: >>> >>> On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell >>> wrote: From: Zygo Blaxell This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) >>> >>> >>> So this case is actually different from the reproducer below, because >>> once a file has prealloc extents, future writes will never be >>> compressed. That is, the extent at offset 4096 can not have compressed >>> content using steps like the reproducer below. I can only imagine that >>> after an falloc to create the extent at 4K, we cloned some compressed >>> extent from some other file into that offset. >> >> >> The above comes from one of my captures of the bug appearing in the wild, >> not the reproducer from Liu Bo. I'll make that c
Re: [PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents
On Thu, Apr 6, 2017 at 2:28 AM, Qu Wenruo wrote: > Btrfs allows inline file extent if and only if > 1) It's at offset 0 > 2) It's smaller than min(max_inline, page_size) >Although we don't specify if the size is before compression or after >compression. >At least according to current behavior, we are only limiting the size >after compression. > 3) It's the only file extent >So that if we append existing inline extent, it should be converted >to regular file extents. > > However several users in btrfs mail list have reported invalid inline > file extent, which only meets the first two condition, but with regular > file extents following. > > The bug is here for a long long time, so long that we have modified > kernel and btrfs-progs to accept such case, but it's still not designed > behavior, and must be detected and fixed. So after looking at this better, I'm not convinced it's a problem. Other than making btrfs/137 fail when compression is enabled, what problems have you observed? btrfs/137 fails due to the incremental send code not being prepared for this case, which does not seem harmful to me because the inline extent represents 4096 bytes of uncompressed data. It would be a problem only if the uncompressed data was less than 4096 bytes. So unless there's evidence that this particular case causes problems somewhere, I don't think it's useful to have this test. As for btrfs/137, I'm sending a fix for the incremental send code. More comments below anyway. > > Signed-off-by: Qu Wenruo > --- > v2: > All suggested by Eryu Guan > Use loop_counts instead of runtime to avoid naming confusion. > Fix whitespace issues > Use fsync instead of sync > Use $XFS_IO_PROG instead of calling xfs_io directly > Always output corruption possibility into seqres.full > > v3: > All suggested by Filipe Manana > Fix gramma errors > Use simpler reproducer > Add test case to compress group > --- > tests/btrfs/140 | 111 > > tests/btrfs/140.out | 2 + > tests/btrfs/group | 1 + > 3 files changed, 114 insertions(+) > create mode 100755 tests/btrfs/140 > create mode 100644 tests/btrfs/140.out > > diff --git a/tests/btrfs/140 b/tests/btrfs/140 > new file mode 100755 > index 000..183d9cd > --- /dev/null > +++ b/tests/btrfs/140 > @@ -0,0 +1,111 @@ > +#! /bin/bash > +# FS QA Test 140 > +# > +# Check if btrfs will create inline-then-regular file layout. > +# > +# Btrfs only allows inline file extent if file is small enough, and any > +# incoming write enlarges the file beyond max_inline should replace inline > +# extents with regular extents. > +# This is a long standing bug, so fsck won't detect it and kernel will allow > +# normal read on it, but should be avoided. > +# > +#--- > +# Copyright (c) 2017 Fujitsu. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > +_require_btrfs_command inspect-internal dump-tree > + > +inline_size=2048 > +page_size=$(get_page_size) > +loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case Sorry I forgot about this before. But why is the possibility not 100%? The code that skips the inline extent expansion/conversion is 100% deterministic (btrfs_file_write_iter() -> btrfs_cont_expand() -> btrfs_truncate_block()) since the first extent always has a size of 4096 bytes. So even if we end up having this test for some reason, there's no need to have loops and the test could be added to the quick group. > + > +do_test() > +{ > + _scratch_mkfs > /dev/null 2>&1 > + > + # specify max_inline and compress explicitly to create
Re: [PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user
On Thu, Apr 06, 2017 at 05:05:16PM +0800, Qu Wenruo wrote: > [BUG] > Cycle mount btrfs can cause fiemap to return different result. > Like: > # mount /dev/vdb5 /mnt/btrfs > # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..127]:25088..25215 128 0x1 > # umount /mnt/btrfs > # mount /dev/vdb5 /mnt/btrfs > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..31]: 25088..2511932 0x0 >1: [32..63]:25120..2515132 0x0 >2: [64..95]:25152..2518332 0x0 >3: [96..127]: 25184..2521532 0x1 > But after above fiemap, we get correct merged result if we call fiemap > again. > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..127]:25088..25215 128 0x1 > > [REASON] > Btrfs will try to merge extent map when inserting new extent map. > > btrfs_fiemap(start=0 len=(u64)-1) > |- extent_fiemap(start=0 len=(u64)-1) >|- get_extent_skip_holes(start=0 len=64k) >| |- btrfs_get_extent_fiemap(start=0 len=64k) >| |- btrfs_get_extent(start=0 len=64k) >|| Found on-disk (ino, EXTENT_DATA, 0) >||- add_extent_mapping() >||- Return (em->start=0, len=16k) >| >|- fiemap_fill_next_extent(logic=0 phys=X len=16k) >| >|- get_extent_skip_holes(start=0 len=64k) >| |- btrfs_get_extent_fiemap(start=0 len=64k) >| |- btrfs_get_extent(start=16k len=48k) >|| Found on-disk (ino, EXTENT_DATA, 16k) >||- add_extent_mapping() >|| |- try_merge_map() >|| Merge with previous em start=0 len=16k >|| resulting em start=0 len=32k >||- Return (em->start=0, len=32K)<< Merged result >|- Stripe off the unrelated range (0~16K) of return em >|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K) > ^^^ Causing split fiemap extent. > > And since in add_extent_mapping(), em is already merged, in next > fiemap() call, we will get merged result. > > [FIX] > Here we introduce a new structure, fiemap_cache, which records previous > fiemap extent. > > And will always try to merge current fiemap_cache result before calling > fiemap_fill_next_extent(). > Only when we failed to merge current fiemap extent with cached one, we > will call fiemap_fill_next_extent() to submit cached one. > > So by this method, we can merge all fiemap extents. > > It can also be done in fs/ioctl.c, however the problem is if > fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap > extent. > So I choose to merge it in btrfs. > > Signed-off-by: Qu Wenruo > --- > v2: > Since fiemap_extent_info has a limit for number of fiemap_extent, it's > possible > that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which > can > cause kernel warning if we fiemap is called on large compressed file. > --- > fs/btrfs/extent_io.c | 116 > --- > 1 file changed, 110 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 28e81922a21c..84f4090dfaff 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4353,6 +4353,107 @@ static struct extent_map > *get_extent_skip_holes(struct inode *inode, > return NULL; > } > > +/* > + * To cache previous fiemap extent > + * > + * Will be used for merging fiemap extent > + */ > +struct fiemap_cache { > + bool cached; > + u64 offset; > + u64 phys; > + u64 len; > + u32 flags; > +}; > + > +/* > + * Helper to submit fiemap extent. > + * > + * Will try to merge current fiemap extent specified by @offset, @phys, > + * @len and @flags with cached one. > + * And only when we fails to merge, cached one will be submitted as > + * fiemap extent. > + * > + * Return 0 if merged or submitted. > + * Return <0 for error. > + */ > +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo, > + struct fiemap_cache *cache, > + u64 offset, u64 phys, u64 len, u32 flags) > +{ > + int ret; > + > + if (!cache->cached) { > +assign: > + cache->cached = true; > + cache->offset = offset; > + cache->phys = phys; > + cache->len = len; > + cache->flags = flags; > + return 0; > + } > + > + /* > + * Sanity check, extent_fiemap() should have ensured that new > + * fiemap extent won't overlap with cahced one. > + * NOTE: Physical address can overlap, due to compression > + */ > + WARN_ON(cache->offset + cache->len > offset); > + > + /* > + * Only merge fiemap extents if > +
Re: [PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user
On Thu, Apr 06, 2017 at 05:05:16PM +0800, Qu Wenruo wrote: > [BUG] > Cycle mount btrfs can cause fiemap to return different result. > Like: > # mount /dev/vdb5 /mnt/btrfs > # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..127]:25088..25215 128 0x1 > # umount /mnt/btrfs > # mount /dev/vdb5 /mnt/btrfs > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..31]: 25088..2511932 0x0 >1: [32..63]:25120..2515132 0x0 >2: [64..95]:25152..2518332 0x0 >3: [96..127]: 25184..2521532 0x1 > But after above fiemap, we get correct merged result if we call fiemap > again. > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/test/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >0: [0..127]:25088..25215 128 0x1 > > [REASON] > Btrfs will try to merge extent map when inserting new extent map. > > btrfs_fiemap(start=0 len=(u64)-1) > |- extent_fiemap(start=0 len=(u64)-1) >|- get_extent_skip_holes(start=0 len=64k) >| |- btrfs_get_extent_fiemap(start=0 len=64k) >| |- btrfs_get_extent(start=0 len=64k) >|| Found on-disk (ino, EXTENT_DATA, 0) >||- add_extent_mapping() >||- Return (em->start=0, len=16k) >| >|- fiemap_fill_next_extent(logic=0 phys=X len=16k) >| >|- get_extent_skip_holes(start=0 len=64k) >| |- btrfs_get_extent_fiemap(start=0 len=64k) >| |- btrfs_get_extent(start=16k len=48k) >|| Found on-disk (ino, EXTENT_DATA, 16k) >||- add_extent_mapping() >|| |- try_merge_map() >|| Merge with previous em start=0 len=16k >|| resulting em start=0 len=32k >||- Return (em->start=0, len=32K)<< Merged result >|- Stripe off the unrelated range (0~16K) of return em >|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K) > ^^^ Causing split fiemap extent. > > And since in add_extent_mapping(), em is already merged, in next > fiemap() call, we will get merged result. > > [FIX] > Here we introduce a new structure, fiemap_cache, which records previous > fiemap extent. > > And will always try to merge current fiemap_cache result before calling > fiemap_fill_next_extent(). > Only when we failed to merge current fiemap extent with cached one, we > will call fiemap_fill_next_extent() to submit cached one. > > So by this method, we can merge all fiemap extents. > > It can also be done in fs/ioctl.c, however the problem is if > fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap > extent. > So I choose to merge it in btrfs. > > Signed-off-by: Qu Wenruo > --- > v2: > Since fiemap_extent_info has a limit for number of fiemap_extent, it's > possible > that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which > can > cause kernel warning if we fiemap is called on large compressed file. > --- > fs/btrfs/extent_io.c | 116 > --- > 1 file changed, 110 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 28e81922a21c..84f4090dfaff 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4353,6 +4353,107 @@ static struct extent_map > *get_extent_skip_holes(struct inode *inode, > return NULL; > } > > +/* > + * To cache previous fiemap extent > + * > + * Will be used for merging fiemap extent > + */ > +struct fiemap_cache { > + bool cached; > + u64 offset; > + u64 phys; > + u64 len; > + u32 flags; Please move bool cached after flags, for better packing. > +}; > + > +/* > + * Helper to submit fiemap extent. > + * > + * Will try to merge current fiemap extent specified by @offset, @phys, > + * @len and @flags with cached one. > + * And only when we fails to merge, cached one will be submitted as > + * fiemap extent. > + * > + * Return 0 if merged or submitted. > + * Return <0 for error. > + */ > +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo, > + struct fiemap_cache *cache, > + u64 offset, u64 phys, u64 len, u32 flags) > +{ > + int ret; > + > + if (!cache->cached) { > +assign: > + cache->cached = true; > + cache->offset = offset; > + cache->phys = phys; > + cache->len = len; > + cache->flags = flags; > + return 0; > + } > + > + /* > + * Sanity check, extent_fiemap() should have ensured that new > + * fiemap extent won't overlap with cahced one. > + * NOTE: Physical address can overlap, due to compression > + */ > + WARN_ON(cache->offset + cache->len > offset); > +
Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags
On Thu, Apr 06, 2017 at 03:20:43PM +0100, Filipe Manana wrote: > On Thu, Apr 6, 2017 at 3:18 PM, Eryu Guan wrote: > > On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote: > >> From: Filipe Manana > >> > >> For example NFS 4.2 supports fallocate but it does not support its > >> KEEP_SIZE flag, so we want to skip tests that use fallocate with that > >> flag on filesystems that don't support it. > >> > >> Suggested-by: Eryu Guan > >> Signed-off-by: Filipe Manana > >> --- > >> common/rc | 4 ++-- > >> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> diff --git a/common/rc b/common/rc > >> index e1ab2c6..3d0f089 100644 > >> --- a/common/rc > >> +++ b/common/rc > >> @@ -2021,8 +2021,8 @@ _require_xfs_io_command() > >> "chproj") > >> testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1` > >> ;; > >> - "falloc" ) > >> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1` > >> + "falloc*" ) > > > > This doesn't work as expected with quotes. I can remove the quotes at > > commit time though. > > Hum, it did work for me, strange. Quoted globs don't work the same way as unquoted inside case, this #!/bin/sh i=abc case $i in abc*) echo notquoted;; "abc*") echo quoted;; esac --- will return 'notquoted' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix invalid dereference in btrfs_retry_endio
On Wed, Apr 05, 2017 at 02:04:19PM -0700, Liu Bo wrote: > When doing directIO repair, we have this oops > > [ 1458.532816] general protection fault: [#1] SMP > ... > [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs] > [ 1458.536893] task: 88082a42d100 task.stack: c90002b3c000 > [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs] > ... > [ 1458.543261] Call Trace: > [ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0 > [ 1458.544374] bio_endio+0xed/0x100 > [ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs] > [ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs] > [ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs] > [ 1458.546224] process_one_work+0x34d/0xb70 > [ 1458.546570] ? process_one_work+0x29e/0xb70 > [ 1458.546938] worker_thread+0x1cf/0x960 > [ 1458.547263] ? process_one_work+0xb70/0xb70 > [ 1458.547624] kthread+0x17d/0x180 > [ 1458.547909] ? kthread_create_on_node+0x70/0x70 > [ 1458.548300] ret_from_fork+0x31/0x40 > > It turns out that btrfs_retry_endio is trying to get inode from a directIO > page. > > This fixes the problem by using the preservd inode pointer, done->inode. > btrfs_retry_endio_nocsum has the same problem, and it's fixed as well. > > Also cleanup unused @start(which is too trivial for a separate patch). > > Cc: David Sterba > Signed-off-by: Liu Bo Reviewed-by: David Sterba > --- > fs/btrfs/inode.c | 14 -- > 1 file changed, 4 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 3ec5a05..8e71ed7 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7910,7 +7910,6 @@ struct btrfs_retry_complete { > static void btrfs_retry_endio_nocsum(struct bio *bio) > { > struct btrfs_retry_complete *done = bio->bi_private; > - struct inode *inode; > struct bio_vec *bvec; > int i; > > @@ -7918,12 +7917,12 @@ static void btrfs_retry_endio_nocsum(struct bio *bio) > goto end; > > ASSERT(bio->bi_vcnt == 1); > - inode = bio->bi_io_vec->bv_page->mapping->host; > - ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode)); > + ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode)); I was interested how it got here in the first place, inode from mapping hasn't been used in the initial support for DIO repair. So it was added in the subpage blocksize preparation patchset (2dabb3248453b), slipped through the review. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags
On Thu, Apr 6, 2017 at 3:18 PM, Eryu Guan wrote: > On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> For example NFS 4.2 supports fallocate but it does not support its >> KEEP_SIZE flag, so we want to skip tests that use fallocate with that >> flag on filesystems that don't support it. >> >> Suggested-by: Eryu Guan >> Signed-off-by: Filipe Manana >> --- >> common/rc | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/common/rc b/common/rc >> index e1ab2c6..3d0f089 100644 >> --- a/common/rc >> +++ b/common/rc >> @@ -2021,8 +2021,8 @@ _require_xfs_io_command() >> "chproj") >> testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1` >> ;; >> - "falloc" ) >> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1` >> + "falloc*" ) > > This doesn't work as expected with quotes. I can remove the quotes at > commit time though. Hum, it did work for me, strange. But please do, thanks. > > Thanks, > Eryu > >> + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1` >> ;; >> "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare") >> testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \ >> -- >> 2.7.0.rc3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe fstests" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags
On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > For example NFS 4.2 supports fallocate but it does not support its > KEEP_SIZE flag, so we want to skip tests that use fallocate with that > flag on filesystems that don't support it. > > Suggested-by: Eryu Guan > Signed-off-by: Filipe Manana > --- > common/rc | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/common/rc b/common/rc > index e1ab2c6..3d0f089 100644 > --- a/common/rc > +++ b/common/rc > @@ -2021,8 +2021,8 @@ _require_xfs_io_command() > "chproj") > testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1` > ;; > - "falloc" ) > - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1` > + "falloc*" ) This doesn't work as expected with quotes. I can remove the quotes at commit time though. Thanks, Eryu > + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1` > ;; > "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare") > testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \ > -- > 2.7.0.rc3 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG] Cycle mount btrfs can cause fiemap to return different result. Like: # mount /dev/vdb5 /mnt/btrfs # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]:25088..25215 128 0x1 # umount /mnt/btrfs # mount /dev/vdb5 /mnt/btrfs # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..31]: 25088..2511932 0x0 1: [32..63]:25120..2515132 0x0 2: [64..95]:25152..2518332 0x0 3: [96..127]: 25184..2521532 0x1 But after above fiemap, we get correct merged result if we call fiemap again. # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]:25088..25215 128 0x1 [REASON] Btrfs will try to merge extent map when inserting new extent map. btrfs_fiemap(start=0 len=(u64)-1) |- extent_fiemap(start=0 len=(u64)-1) |- get_extent_skip_holes(start=0 len=64k) | |- btrfs_get_extent_fiemap(start=0 len=64k) | |- btrfs_get_extent(start=0 len=64k) || Found on-disk (ino, EXTENT_DATA, 0) ||- add_extent_mapping() ||- Return (em->start=0, len=16k) | |- fiemap_fill_next_extent(logic=0 phys=X len=16k) | |- get_extent_skip_holes(start=0 len=64k) | |- btrfs_get_extent_fiemap(start=0 len=64k) | |- btrfs_get_extent(start=16k len=48k) || Found on-disk (ino, EXTENT_DATA, 16k) ||- add_extent_mapping() || |- try_merge_map() || Merge with previous em start=0 len=16k || resulting em start=0 len=32k ||- Return (em->start=0, len=32K)<< Merged result |- Stripe off the unrelated range (0~16K) of return em |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K) ^^^ Causing split fiemap extent. And since in add_extent_mapping(), em is already merged, in next fiemap() call, we will get merged result. [FIX] Here we introduce a new structure, fiemap_cache, which records previous fiemap extent. And will always try to merge current fiemap_cache result before calling fiemap_fill_next_extent(). Only when we failed to merge current fiemap extent with cached one, we will call fiemap_fill_next_extent() to submit cached one. So by this method, we can merge all fiemap extents. It can also be done in fs/ioctl.c, however the problem is if fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap extent. So I choose to merge it in btrfs. Signed-off-by: Qu Wenruo --- v2: Since fiemap_extent_info has a limit for number of fiemap_extent, it's possible that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which can cause kernel warning if we fiemap is called on large compressed file. --- fs/btrfs/extent_io.c | 116 --- 1 file changed, 110 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 28e81922a21c..84f4090dfaff 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4353,6 +4353,107 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode, return NULL; } +/* + * To cache previous fiemap extent + * + * Will be used for merging fiemap extent + */ +struct fiemap_cache { + bool cached; + u64 offset; + u64 phys; + u64 len; + u32 flags; +}; + +/* + * Helper to submit fiemap extent. + * + * Will try to merge current fiemap extent specified by @offset, @phys, + * @len and @flags with cached one. + * And only when we fails to merge, cached one will be submitted as + * fiemap extent. + * + * Return 0 if merged or submitted. + * Return <0 for error. + */ +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo, + struct fiemap_cache *cache, + u64 offset, u64 phys, u64 len, u32 flags) +{ + int ret; + + if (!cache->cached) { +assign: + cache->cached = true; + cache->offset = offset; + cache->phys = phys; + cache->len = len; + cache->flags = flags; + return 0; + } + + /* +* Sanity check, extent_fiemap() should have ensured that new +* fiemap extent won't overlap with cahced one. +* NOTE: Physical address can overlap, due to compression +*/ + WARN_ON(cache->offset + cache->len > offset); + + /* +* Only merge fiemap extents if +* 1) Their logical addresses are continuous +* +* 2) Their physical addresses are continuous +*So truly compressed (physical size smaller than logical size) +*extents won't get merged with each other +* +* 3) Share same flags except FIE
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu 06-04-17 11:12:02, NeilBrown wrote: > On Wed, Apr 05 2017, Jan Kara wrote: > >> If you want to ensure read-only files can remain cached over a crash, > >> then you would have to mark a file in some way on stable storage > >> *before* allowing any change. > >> e.g. you could use the lsb. Odd i_versions might have been changed > >> recently and crash-count*large-number needs to be added. > >> Even i_versions have not been changed recently and nothing need be > >> added. > >> > >> If you want to change a file with an even i_version, you subtract > >> crash-count*large-number > >> to the i_version, then set lsb. This is written to stable storage before > >> the change. > >> > >> If a file has not been changed for a while, you can add > >> crash-count*large-number > >> and clear lsb. > >> > >> The lsb of the i_version would be for internal use only. It would not > >> be visible outside the filesystem. > >> > >> It feels a bit clunky, but I think it would work and is the best > >> combination of Jan's idea and your requirement. > >> The biggest cost would be switching to 'odd' before an changes, and the > >> unknown is when does it make sense to switch to 'even'. > > > > Well, there is also a problem that you would need to somehow remember with > > which 'crash count' the i_version has been previously reported as that is > > not stored on disk with my scheme. So I don't think we can easily use your > > scheme. > > I don't think there is a problem here maybe I didn't explain > properly or something. > > I'm assuming there is a crash-count that is stored once per filesystem. > This might be a disk-format change, or maybe the "Last checked" time > could be used with ext4 (that is a bit horrible though). > > Every on-disk i_version has a flag to choose between: > - use this number as it is, but update it on-disk before any change > - add multiple of current crash-count to this number before use. > If you crash during an update, the i_version is thus automatically > increased. > > To change from the first option to the second option you subtract the > multiple of the current crash-count (which might make the stored > i_version negative), and flip the bit. > To change from the second option to the first, you add the multiple > of the current crash-count, and flip the bit. > In each case, the externally visible i_version does not change. > Nothing needs to be stored except the per-inode i_version and the per-fs > crash_count. Right, I didn't realize you would subtract crash counter when flipping the bit and then add it back when flipping again. That would work. > > So the options we have are: > > > > 1) Keep i_version as is, make clients also check for i_ctime. > >Pro: No on-disk format changes. > >Cons: After a crash, i_version can go backwards (but when file changes > >i_version, i_ctime pair should be still different) or not, data can be > >old or not. > > I like to think of this approach as using the i_version as an extension > to the i_ctime. > i_ctime doesn't necessarily change on every file modification, either > because it is not a modification that is meant to change i_ctime, or > because i_ctime doesn't have the resolution to show a very small change > in time, or because the clock that is used to update i_ctime doesn't > have much resolution. > So when a change happens, if the stored c_time changes, set i_version to > zero, otherwise increment i_version. > Then the externally visible i-version is a combination of the stored > c_time and the stored i_version. > If you only used 1-second ctime resolution for versioning purposes, you > could provide a 64bit i_version as 34 bits of ctime and 30 bits of > changes-in-one-second. > It is important that the resolution of ctime used is less that the > fastest possible restart after a crash. > > I don't think that i_version going backwards should be a problem, as > long as an old version means exactly the same old data. Presumably > journalling would ensure that the data and ctime/version are updated > atomically. So as Dave and I wrote earlier in this thread, journalling does not ensure data vs ctime/version consistency (well, except for ext4 in data=journal mode but people rarely run that due to performance implications). So you can get old data and new version as well as new data and old version after a crash. The only thing filesystems guarantee is that you will not see uninitialized blocks and that fsync makes both data & ctime/version persistent. But as Bruce wrote for NFS open-to-close semantics this may be actually good enough. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html