Re: [PATCHv2 00/24] tree-log inode vs btrfs_inode cleanups
On 19.01.2017 20:21, David Sterba wrote: > On Wed, Jan 18, 2017 at 12:31:26AM +0200, Nikolay Borisov wrote: >> So here is a new set of patches cleaning up tree-log function >> w.r.t inode vs btrfs_inode. There are still some which remain >> but I didn't find compelling arguments to cleaning them up, so >> I've left them unchanged. This time there are some size shrinkage: >> >>text data bss dec hex filename >>2530598174661 28288 2733547 29b5eb fs/btrfs/btrfs.ko - upstream >> master >> >> text data bss dec hex filename >> 2530774 174661 28288 2733723 29b69b fs/btrfs/btrfs.ko - before >> tree-log cleanup >> >> textdata bss dec hex filename >> 2530163 174661 28288 2733112 29b438 fs/btrfs/btrfs.ko - both series >> applied >> >> So the net result of the 2 series is 435 bytes and I assume there >> will be further reduction in size once further cleanups are made >> >> Changes since v1: >> * Rebased all patche to latest master >> >> Nikolay Borisov (24): >> btrfs: Make btrfs_must_commit_transaction take btrfs_inode >> btrfs: Make btrfs_record_unlink_dir take btrfs_inode >> btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode >> btrfs: Make btrfs_inode_in_log take btrfs_inode >> btrfs: Make btrfs_log_new_name take btrfs_inode >> btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode >> btrfs: Make btrfs_del_inode_ref take btrfs_inode >> btrfs: Make logged_inode_size take btrfs_inode >> btrfs: Make btrfs_check_ref_name_override take btrfs_inode >> btrfs: Make copy_items take btrfs_inode >> btrfs: Make btrfs_log_all_xattrs take btrfs_inode >> btrfs: Make btrfs_log_trailing_hole take btrfs_inode >> btrfs: Make btrfs_get_logged_extents take btrfs_inode >> btrfs: Make btrfs_log_changed_extents take btrfs_inode >> btrfs: Make log_dir_items take btrfs_inode >> btrfs: Make log_directory_changes take btrfs_inode >> btrfs: Make log_new_dir_dentries take btrfs_inode >> btrfs: Make btrfs_unlink_inode take btrfs_inode >> btrfs: Make drop_one_dir_item take btrfs_inode >> btrfs: Make __add_inode_ref take btrfs_inode >> btrfs: Make log_inode_item take btrfs_inode >> btrfs: Make btrfs_log_inode take btrfs_inode >> btrfs: Make count_inode_extrefs take btrfs_inode >> btrfs: Make count_inode_refs take btrfs_inode > > Added to 4.11 queue, thanks. There were several 80+ lines added by the > patches, I've updated the patches during review. Please be more careful > with changes like this > > From https://patchwork.kernel.org/patch/9522051/ > > - if (S_ISDIR(inode->i_mode) || > + if (S_ISDIR(inode->vfs_inode.i_mode) || > (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, > -_I(inode)->runtime_flags) && > - inode_only >= LOG_INODE_EXISTS)) > +>runtime_flags) && > + inode_only == LOG_INODE_EXISTS)) > > Notice the change from >= to == . Yeah, I did notice this after the rebase I did when you told me patch 6 wasn't applying cleanly, however I thought I had fixed it during the rebase. Apparently I do need to recheck things one more time for good measure. Anyway, thanks for pulling! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check lowmem vs original
At 01/20/2017 12:38 PM, Chris Murphy wrote: All of my Btrfs file systems, including new ones, have errors according to lowmem mode, and no errors reported at all for original mode. So which is correct? If lowmem mode is correct, then there are obviously kernel bugs that are causing problems right away, even on minutes old file systems. I can't tell from the output whether these are serious errors or minor errors either, because the output from check is totally incomprehensible. Further the usage summary at the end, extent bytes, and fs bytes, etc. are different, sometimes by an order of magnitude, between lowmem and original. Errors like this: ERROR: block group[46200258560 1073741824] used 1073741824 but extent items used 1144422400 ERROR: block group[85928706048 1073741824] used 1073741824 but extent items used 0 ERROR: block group[178270502912 1073741824] used 1073741824 but extent items used 1083211776 ERROR: block group[360806612992 1073741824] used 1073479680 but extent items used 1074769920 Christoph Anton Mitterer also reported this bug. And it turns out to be a false alert. Patch under test, should be out soon. (But the image is not easy to create, so no test case yet) ERROR: extent[366498091008, 134217728] referencer count mismatch (root: 818, owner: 73782, offset: 134217728) wanted: 4, have: 26 This is a little different. Not sure which fsck is wrong until btrfs-debug-tree extent and fs tree dump is provided. One file system has over 100 lines (exactly the same thing, no difference) ERROR: data extent[16913485824 7577600] backref lost Same, need btrfs-debug-tree extent and fs tree dump. Another file system, 15 minutes old with two mounts in its whole lifetime, and only written with kernel 4.10-rc3 has over 30 lines of varying numbers: ERROR: root 257 EXTENT DATA[150134 11317248] prealloc shouln't have datasum That file system should have no preallocated extents (It's a clean installation of Fedora Rawhide, using only rsync) btrfs-debug-tree will help to make sure what is wrong. Again, zero errors with original mode; but all file systems of all ages have errors with lowmem. This is kindof annoying to say the least, it's like we're adding dice to the check and repair situation on Btrfs which is already incredibly more complicated and unhelpful than any other file system. That's why lowmem mode is still not the default option. The problem os original mode is, if you're checking a TB level fs and only 2 or 4G ram, then it's quite possible you ran out of memory and won't be able to check the fs forever, more several than annoying. Thanks, Qu btrfs-progs 4.9. kernels vary from 4.4 to 4.10-rc4, but one fs is only a month old having used 4.7 and 4.8 kernels; and another one just 4.10-rc3 as I said. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/3] fstests: test btrfs incremental send after moving a directory
On Thu, Jan 12, 2017 at 03:13:37AM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Test that an incremental send operation works after moving a directory > into a new parent directory, deleting its previous parent directory and > creating a new inode that has the same inode number as the old parent. > > This issue is fixed by the following patch for the linux kernel: > > "Btrfs: incremental send, do not delay rename when parent inode is new" > > Signed-off-by: Robbie Ko > Signed-off-by: Filipe Manana I tested this patchset with/without proposed kernel patches, all tests worked as expected. Just one tiny update on commit below. > --- > +# Remount the filesystem so that the next created inodes will have the > numbers > +# 258 and 259. This is because when a filesystem is mounted, btrfs sets the > +# subvolume's inode counter to a value corresponding to the highest inode > number > +# in the subvolume plus 1. This inode counter is used to assign a unique > number > +# to each new inode and it's incremented by 1 after very inode creation. > +# Note: we unmount and then mount instead of doing a mount with "-o remount" > +# because otherwise the inode counter remains at value 260. > +_scratch_unmount > +_scratch_mount I replaced these with _scratch_cycle_mount. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs check lowmem vs original
All of my Btrfs file systems, including new ones, have errors according to lowmem mode, and no errors reported at all for original mode. So which is correct? If lowmem mode is correct, then there are obviously kernel bugs that are causing problems right away, even on minutes old file systems. I can't tell from the output whether these are serious errors or minor errors either, because the output from check is totally incomprehensible. Further the usage summary at the end, extent bytes, and fs bytes, etc. are different, sometimes by an order of magnitude, between lowmem and original. Errors like this: ERROR: block group[46200258560 1073741824] used 1073741824 but extent items used 1144422400 ERROR: block group[85928706048 1073741824] used 1073741824 but extent items used 0 ERROR: block group[178270502912 1073741824] used 1073741824 but extent items used 1083211776 ERROR: block group[360806612992 1073741824] used 1073479680 but extent items used 1074769920 ERROR: extent[366498091008, 134217728] referencer count mismatch (root: 818, owner: 73782, offset: 134217728) wanted: 4, have: 26 One file system has over 100 lines (exactly the same thing, no difference) ERROR: data extent[16913485824 7577600] backref lost Another file system, 15 minutes old with two mounts in its whole lifetime, and only written with kernel 4.10-rc3 has over 30 lines of varying numbers: ERROR: root 257 EXTENT DATA[150134 11317248] prealloc shouln't have datasum That file system should have no preallocated extents (It's a clean installation of Fedora Rawhide, using only rsync) Again, zero errors with original mode; but all file systems of all ages have errors with lowmem. This is kindof annoying to say the least, it's like we're adding dice to the check and repair situation on Btrfs which is already incredibly more complicated and unhelpful than any other file system. btrfs-progs 4.9. kernels vary from 4.4 to 4.10-rc4, but one fs is only a month old having used 4.7 and 4.8 kernels; and another one just 4.10-rc3 as I said. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] btrfs-progs: btrfs-corrupt-block refactor
This is the RFC proposal for refactor btrfs-corrupt-block. Just as Lakshmipathi and xfstests guys commented in btrfs ML, we need btrfs-corrupt-block to either craft a test case, or for later RAID56 corruption test case. However the current situation of btrfs-corrupt-block has several problems, and those problems prevents us from using it: 1) Bad arranged options Quite a lot of combination makes no sense. And quite a lot of function covers each other. This is quite frustrating when someone want to use it. 2) No documentation 3) No longer default installed/compiled This is OK if such tool is only used to corrupt things. But if the long term objective is to provide a tool to modify the fs offline at will, that's not a good idea. So here I want to refactor the current btrfs-corrupt-block to archive the following objective: 1) Logical layer divided parameters <> The current idea is like: btrfs-corrupt-block is one of "item" "data" "leaf" "node" (with more to come) "item"/"leaf"/"node" common options are: "--nocow": Don't do CoW, but overwrite (default) "--cow": Do normal tree block CoW "--pattern DATA" The pattern to fill, default to 0xcd "item" options are: "--root NUMBER|ROOTID": Mandatory "--key (OBJID,TYPE,OFFSET)": Mandatory "--field FIELD" Optional, default to corrupt all fields of the item "--delete" Optional, if specified, the item will be deleted. Conflicts with "--field" "leaf" options are: "--bytenr BYTENR" or "--key" with "--root" Mandatory to locate the leaf "--field FIELD" Optional, default to corrupt all fields of the leaf header With --key, can also be used to corrupt certain item field NOTE: No delete is provided. Please use "node" "--delete" to delete a leaf. Or use "--filed ALL" with "--pattern 0x00" to wipe the leaf header. "node" options are: "--bytenr BYTENR" or "--key" "--root" and "--level" Mandatory to locate the node "--field FIELD" Optional, default to corrupt all fields of the node header With --key, can also be used to corrupt certain nodeptr filed "--delete" Optional, Only works with "--key" "--root" "--level". Delete the nodeptr. "data" options are: "--bytenr BYTENR" Mandatory, logical address "--length LENGTH" Optional, default to sectorsize of the fs "--mirror" Optional, default to all mirror May add "raid56" destination for raid56 tests later. 2) Better documentation Both asciidoc doc and better help string 3) Better error string At least the same level we do in other btrfs commands Any advice and idea are welcomed. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs recovery
At 01/19/2017 06:06 PM, Sebastian Gottschall wrote: Hello I have a question. after a power outage my system was turning into a unrecoverable state using btrfs (kernel 4.9) since im running --init-extent-tree now for 3 days i'm asking how long this process normally takes and why it outputs millions of lines like --init-extent-tree will trash *ALL* current extent tree, and *REBUILD* them from fs-tree. This can takes a long time depending on the size of the fs, and how many shared extents there are (snapshots and reflinks all counts). Such a huge operation should only be used if you're sure only extent tree is corrupted, and other tree are all OK. Or you'll just totally screw your fs further, especially when interrupted. Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 1562890240 root 262 owner 483059214 offset 0 found 1 wanted 0 back 0x23b0211d0 backpointer mismatch on [1562890240 4096] This is common, since --init-extent-tree trash all extent tree, so every tree-block/data extent will trigger such output adding new data backref on 1562890240 root 262 owner 483059214 offset 0 found 1 Repaired extent references for 1562890240 But as you see, it repaired the extent tree by adding back EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works. If you see such output with all the same bytenr, then things goes really wrong and maybe a dead loop. Personally speaking, normal problem like failed to mount should not need --init-extent-tree. Especially, extent-tree corruption normally is not really related to mount failure, but sudden remount to RO and kernel wanring. Thanks, Qu please avoid typical answers like "potential dangerous operation" since all repair options are declared as potenial dangerous. Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs_alloc_tree_block: Faulting instruction address: 0xc02d4584
Hi, after upgrading this powerpc32 box from 4.10-rc2 to -rc4, the message below occured a few hours after boot. Full dmesg and .config: http://nerdbynature.de/bits/4.10-rc4/ Any ideas? Thanks, Christian. Faulting instruction address: 0xc02d4584 Oops: Kernel access of bad area, sig: 11 [#1] PowerMac Modules linked in: ecb xt_tcpudp iptable_filter ip_tables x_tables nfnetlink_log nfnetlink sha256_generic twofish_generic twofish_common usb_storage therm_adt746x loop i2c_powermac arc4 firewire_sbp2 b43 rng_core ssb bcma mac80211 cfg80211 ecryptfs [last unloaded: nbd] CPU: 0 PID: 1395 Comm: btrfs-transacti Tainted: GW 4.10.0-rc4-1-gab8184b #1 task: ee7162e0 task.stack: ee9cc000 NIP: c02d4584 LR: c02d4574 CTR: c00d0df0 REGS: ee9cdaa0 TRAP: 0300 Tainted: GW (4.10.0-rc4-1-gab8184b) MSR: 9032CR: 24422248 XER: DAR: 10581054 DSISR: 4200 GPR00: c02d4574 ee9cdb50 ee71d4b8 10581050 0001 dbc88118 0020 GPR08: 1205 0004 24422444 93c3 GPR16: f000 0001 ee260800 GPR24: 0001 ee9cdc1b eef5c1a0 1000 ee47 dbc88118 ee470170 NIP [c02d4584] btrfs_alloc_tree_block+0x18c/0x5c4 LR [c02d4574] btrfs_alloc_tree_block+0x17c/0x5c4 Call Trace: [ee9cdb50] [c02d4574] btrfs_alloc_tree_block+0x17c/0x5c4 (unreliable) [ee9cdbf0] [c02b86d4] __btrfs_cow_block+0x110/0x638 [ee9cdc70] [c02b8d74] btrfs_cow_block+0xdc/0x1b0 [ee9cdca0] [c02bc48c] btrfs_search_slot+0x1c0/0x904 [ee9cdd10] [c02dc680] btrfs_lookup_inode+0x3c/0x124 [ee9cdd50] [c02ec204] btrfs_update_inode_item+0x4c/0x10c [ee9cdd80] [c02d05e4] cache_save_setup+0xc0/0x400 [ee9cdde0] [c02d4d54] btrfs_start_dirty_block_groups+0x184/0x47c [ee9cde50] [c02e7e84] btrfs_commit_transaction+0x148/0xac4 [ee9cdeb0] [c02e313c] transaction_kthread+0x1d0/0x1ec [ee9cdf00] [c004f1fc] kthread+0xf8/0x124 [ee9cdf40] [c0011480] ret_from_kernel_thread+0x5c/0x64 --- interrupt: 0 at (null) LR = (null) Instruction dump: 4800b3ed 7f838040 7c7e1b78 419d0430 806300d4 81db 81fb0004 4bdfe2b9 3924 7ee6bb78 38630050 7fc5f378 <7dc91d2c> 7de01d2c 809501cf 807501cb ---[ end trace 937683537ecd986b ]--- -- BOFH excuse #342: HTTPD Error 4004 : very old Intel cpu - insufficient processing power -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On 2017-01-19 13:23, Roman Mamedov wrote: On Thu, 19 Jan 2017 17:39:37 +0100 "Alejandro R. Mosteo"wrote: I was wondering, from a point of view of data safety, if there is any difference between using dup or making a raid1 from two partitions in the same disk. This is thinking on having some protection against the typical aging HDD that starts to have bad sectors. RAID1 will write slower compared to DUP, as any optimization to make RAID1 devices work in parallel will cause a total performance disaster for you as you will start trying to write to both partitions at the same time, turning all linear writes into random ones, which are about two orders of magnitude slower than linear on spinning hard drives. DUP shouldn't have this issue, but still it will be twice slower than single, since you are writing everything twice. As of right now, there will actually be near zero impact on write performance (or at least, it's way less than the theoretical 50%) because there really isn't any optimization to speak of in the multi-device code. That will hopefully change over time, but it's not likely to do so any time in the future since nobody appears to be working on multi-device write performance. You could consider DUP data for when a disk is already known to be getting bad sectors from time to time -- but then it's a fringe exercise to try and keep using such disk in the first place. Yeah with DUP data DUP metadata you can likely have some more life out of such disk as a throwaway storage space for non-essential data, at half capacity, but is it worth the effort, as it's likely to start failing progressively worse over time. In all other cases the performance and storage space penalty of DUP within a single device are way too great (and gained redundancy is too low) compared to a proper system of single profile data + backups, or a RAID5/6 system (not Btrfs-based) + backups. That really depends on your usage. In my case, I run DUP data on single disks regularly. I still do backups of course, but the performance is worth far less for me (especially in the cases where I'm using NVMe SSD's which have performance measured in thousands of MB/s for both reads and writes) than the ability to recover from transient data corruption without needing to go to a backup. As long as /home and any other write heavy directories are on a separate partition, I would actually advocate using DUP data on your root filesystem if you can afford the space simply because it's a whole lot easier to recover other data if the root filesystem still works. Most of the root filesystem except some stuff under /var follows a WORM access pattern, and even the stuff that doesn't in /var is usually not performance critical, so the write performance penalty won't have anywhere near as much impact on how well the system runs as you might think. There's also the fact that you're writing more metadata than data most of the time unless you're dealing with really big files, and metadata is already DUP mode (unless you are using an SSD), so the performance hit isn't 50%, it's actually a bit more than half the ratio of data writes to metadata writes. On a related note, I see this caveat about dup in the manpage: "For example, a SSD drive can remap the blocks internally to a single copy thus deduplicating them. This negates the purpose of increased redunancy (sic) and just wastes space" That ability is vastly overestimated in the man page. There is no miracle content-addressable storage system working at 500 MB/sec speeds all within a little cheap controller on SSDs. Likely most of what it can do, is just compress simple stuff, such as runs of zeroes or other repeating byte sequences. Most of those that do in-line compression don't implement it in firmware, they implement it in hardware, and even DEFLATE can get 500 MB/second speeds if properly implemented in hardware. The firmware may control how the hardware works, but it's usually hardware doing heavy lifting in that case, and getting a good ASIC made that can hit the required performance point for a reasonable compression algorithm like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work. And the DUP mode is still useful on SSDs, for cases when one copy of the DUP gets corrupted in-flight due to a bad controller or RAM or cable, you could then restore that block from its good-CRC DUP copy. The only window of time during which bad RAM could result in only one copy of a block being bad is after the first copy is written but before the second is, which is usually an insanely small amount of time. As far as the cabling, the window for errors resulting in a single bad copy of a block is pretty much the same as for RAM, and if they're persistently bad, you're more likely to lose data for other reasons. That said, I do still feel that DUP mode has value on SSD's. The primary arguments against it are: 1. It
Re: Raid 1 recovery
On Thu, Jan 19, 2017 at 12:15 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Wed, 18 Jan 2017 14:30:28 -0700 as excerpted: > >> On Wed, Jan 18, 2017 at 2:07 PM, Jonwrote: >>> So, I had a raid 1 btrfs system setup on my laptop. Recently I upgraded >>> the drives and wanted to get my data back. I figured I could just plug >>> in one drive, but I found that the volume simply would not mount. I >>> tried the other drive alone and got the same thing. Plugging in both at >>> the same time and the volume mounted without issue. >> >> Requires mount option degraded. >> >> If this is a boot volume, this is difficult because the current udev >> rule prevents a mount attempt so long as all devices for a Btrfs volume >> aren't present. > > OK, so I've known about this from the list for some time, but what is the > status with regard to udev/systemd (has a bug/issue been filed, results, > link?), and what are the alternatives, both for upstream, and for a dev, > either trying to be proactive, or currently facing a refusal to boot due > to the issue? If the udev rule isn't there, there's a chance that there's a mount failure with any multiple device setup if one member device is late to the party. If the udev rule is removed and if rootflags=degraded, now whenever there's a late device, there's always a degraded boot and the drive late to the party is out of sync. And we have no fast resync like mdadm with write intent bitmaps, so it requires a complete volume scrub (initiated manually) to avoid corruption, as soon as the volume is made whole. Now maybe the udev rule could be made smarter, I don't really know. If it's multiple device you'd want a rule that just waits for say, 30 seconds or a minute or something sane, whatever that'd be. That way normal operation just delays things a bit to make sure all member drives are available at the same time, so that the mount command (without degraded option) works. And the only failure case is when there is in fact a bad drive. Someone willing to take the risk could use such a udev rule along with rootflags=degraded, but this is asking for trouble. What's really needed is a daemon or other service that manages the pool status. And that includes dealing with degradedness and resyncs automatically. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 00/24] tree-log inode vs btrfs_inode cleanups
On Wed, Jan 18, 2017 at 12:31:26AM +0200, Nikolay Borisov wrote: > So here is a new set of patches cleaning up tree-log function > w.r.t inode vs btrfs_inode. There are still some which remain > but I didn't find compelling arguments to cleaning them up, so > I've left them unchanged. This time there are some size shrinkage: > >text data bss dec hex filename >2530598 174661 28288 2733547 29b5eb fs/btrfs/btrfs.ko - upstream > master > > text data bss dec hex filename > 2530774 174661 28288 2733723 29b69b fs/btrfs/btrfs.ko - before > tree-log cleanup > > text data bss dec hex filename > 2530163 174661 28288 2733112 29b438 fs/btrfs/btrfs.ko - both series > applied > > So the net result of the 2 series is 435 bytes and I assume there > will be further reduction in size once further cleanups are made > > Changes since v1: > * Rebased all patche to latest master > > Nikolay Borisov (24): > btrfs: Make btrfs_must_commit_transaction take btrfs_inode > btrfs: Make btrfs_record_unlink_dir take btrfs_inode > btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode > btrfs: Make btrfs_inode_in_log take btrfs_inode > btrfs: Make btrfs_log_new_name take btrfs_inode > btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode > btrfs: Make btrfs_del_inode_ref take btrfs_inode > btrfs: Make logged_inode_size take btrfs_inode > btrfs: Make btrfs_check_ref_name_override take btrfs_inode > btrfs: Make copy_items take btrfs_inode > btrfs: Make btrfs_log_all_xattrs take btrfs_inode > btrfs: Make btrfs_log_trailing_hole take btrfs_inode > btrfs: Make btrfs_get_logged_extents take btrfs_inode > btrfs: Make btrfs_log_changed_extents take btrfs_inode > btrfs: Make log_dir_items take btrfs_inode > btrfs: Make log_directory_changes take btrfs_inode > btrfs: Make log_new_dir_dentries take btrfs_inode > btrfs: Make btrfs_unlink_inode take btrfs_inode > btrfs: Make drop_one_dir_item take btrfs_inode > btrfs: Make __add_inode_ref take btrfs_inode > btrfs: Make log_inode_item take btrfs_inode > btrfs: Make btrfs_log_inode take btrfs_inode > btrfs: Make count_inode_extrefs take btrfs_inode > btrfs: Make count_inode_refs take btrfs_inode Added to 4.11 queue, thanks. There were several 80+ lines added by the patches, I've updated the patches during review. Please be more careful with changes like this >From https://patchwork.kernel.org/patch/9522051/ - if (S_ISDIR(inode->i_mode) || + if (S_ISDIR(inode->vfs_inode.i_mode) || (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, - _I(inode)->runtime_flags) && -inode_only >= LOG_INODE_EXISTS)) + >runtime_flags) && +inode_only == LOG_INODE_EXISTS)) Notice the change from >= to == . -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
On Thu, 19 Jan 2017 17:39:37 +0100 "Alejandro R. Mosteo"wrote: > I was wondering, from a point of view of data safety, if there is any > difference between using dup or making a raid1 from two partitions in > the same disk. This is thinking on having some protection against the > typical aging HDD that starts to have bad sectors. RAID1 will write slower compared to DUP, as any optimization to make RAID1 devices work in parallel will cause a total performance disaster for you as you will start trying to write to both partitions at the same time, turning all linear writes into random ones, which are about two orders of magnitude slower than linear on spinning hard drives. DUP shouldn't have this issue, but still it will be twice slower than single, since you are writing everything twice. You could consider DUP data for when a disk is already known to be getting bad sectors from time to time -- but then it's a fringe exercise to try and keep using such disk in the first place. Yeah with DUP data DUP metadata you can likely have some more life out of such disk as a throwaway storage space for non-essential data, at half capacity, but is it worth the effort, as it's likely to start failing progressively worse over time. In all other cases the performance and storage space penalty of DUP within a single device are way too great (and gained redundancy is too low) compared to a proper system of single profile data + backups, or a RAID5/6 system (not Btrfs-based) + backups. > On a related note, I see this caveat about dup in the manpage: > > "For example, a SSD drive can remap the blocks internally to a single > copy thus deduplicating them. This negates the purpose of increased > redunancy (sic) and just wastes space" That ability is vastly overestimated in the man page. There is no miracle content-addressable storage system working at 500 MB/sec speeds all within a little cheap controller on SSDs. Likely most of what it can do, is just compress simple stuff, such as runs of zeroes or other repeating byte sequences. And the DUP mode is still useful on SSDs, for cases when one copy of the DUP gets corrupted in-flight due to a bad controller or RAM or cable, you could then restore that block from its good-CRC DUP copy. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read-only fs, kernel 4.9.0, fs/btrfs/delayed-inode.c:1170 __btrfs_run_delayed_items,
I don't know if it is btrfs related but I'm getting hard freezes on 4.8.17. So I went back to 4.8.14 (with identical .config file). It is one of my kernels which is known to be trouble free for a long time. Since they are hard lock up for real, I can't provide anything.. Does anyone experience anything like that? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: lowmem-check: Fix wrong extent tree iteration
Hey Qu. On Wed, 2017-01-18 at 16:48 +0800, Qu Wenruo wrote: > To Christoph, > > Would you please try this patch, and to see if it suppress the block > group > warning? I did another round of fsck in both modes (original/lomem), first WITHOUT your patch, then WITH it... both on progs version 4.9... no further RW mount between these 4 runs: btrfs-progs v4.9 WITHOUT patch: *** # btrfs check /dev/nbd0 ; echo $? checking extents checking free space cache checking fs roots checking csums checking root refs Checking filesystem on /dev/nbd0 UUID: 326d292d-f97b-43ca-b1e8-c722d3474719 found 7469206884352 bytes used err is 0 total csum bytes: 7281779252 total tree bytes: 10837262336 total fs tree bytes: 2011906048 total extent tree bytes: 1015349248 btree space waste bytes: 922444044 file data blocks allocated: 7458369622016 referenced 7579485159424 0 # btrfs check --mode=lowmem /dev/nbd0 ; echo $? checking extents ERROR: block group[74117545984 1073741824] used 1073741824 but extent items used 0 ERROR: block group[239473786880 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[500393050112 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[581997428736 1073741824] used 1073741824 but extent items used 0 ERROR: block group[626557714432 1073741824] used 1073741824 but extent items used 0 ERROR: block group[668433645568 1073741824] used 1073741824 but extent items used 0 ERROR: block group[948680261632 1073741824] used 1073741824 but extent items used 0 ERROR: block group[982503129088 1073741824] used 1073741824 but extent items used 0 ERROR: block group[1039411445760 1073741824] used 1073741824 but extent items used 0 ERROR: block group[1054443831296 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[1190809042944 1073741824] used 1073741824 but extent items used 0 ERROR: block group[1279392743424 1073741824] used 1073741824 but extent items used 0 ERROR: block group[1481256206336 1073741824] used 1073741824 but extent items used 0 ERROR: block group[1620842643456 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[1914511032320 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[3055361720320 1073741824] used 1073741824 but extent items used 0 ERROR: block group[3216422993920 1073741824] used 1073741824 but extent items used 0 ERROR: block group[3670615785472 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[3801612288000 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[3828455833600 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[4250973241344 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4261710659584 1073741824] used 1073741824 but extent items used 1074266112 ERROR: block group[4392707162112 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4558063403008 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4607455526912 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4635372814336 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4640204652544 1073741824] used 1073741824 but extent items used 0 ERROR: block group[4642352136192 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[4681006841856 1073741824] used 1073741824 but extent items used 0 ERROR: block group[5063795802112 1073741824] used 1073741824 but extent items used 0 ERROR: block group[5171169984512 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[5216267141120 1073741824] used 1073741824 but extent items used 1207959552 ERROR: block group[5290355326976 1073741824] used 1073741824 but extent items used 0 ERROR: block group[5445511020544 1073741824] used 1073741824 but extent items used 1074266112 ERROR: block group[6084387405824 1073741824] used 1073741824 but extent items used 0 ERROR: block group[6104788500480 1073741824] used 1073741824 but extent items used 0 ERROR: block group[6878956355584 1073741824] used 1073741824 but extent items used 0 ERROR: block group[6997067956224 1073741824] used 1073741824 but extent items used 0 ERROR: block group[7702516334592 1073741824] used 1073741824 but extent items used 0 ERROR: block group[8051482427392 1073741824] used 1073741824 but extent items used 1084751872 ERROR: block group[8116980678656 1073741824] used 1073217536 but extent items used 0 ERROR: errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Checking filesystem on /dev/nbd0 UUID: 326d292d-f97b-43ca-b1e8-c722d3474719 found 7469206884352 bytes used err is -5 total csum bytes: 7281779252 total tree bytes: 10837262336 total fs tree bytes: 2011906048 total extent tree bytes: 1015349248 btree space waste bytes: 922444044 file data
Re: Fwd: dup vs raid1 in single disk
On 2017-01-19 11:39, Alejandro R. Mosteo wrote: Hello list, I was wondering, from a point of view of data safety, if there is any difference between using dup or making a raid1 from two partitions in the same disk. This is thinking on having some protection against the typical aging HDD that starts to have bad sectors. On a related note, I see this caveat about dup in the manpage: "For example, a SSD drive can remap the blocks internally to a single copy thus deduplicating them. This negates the purpose of increased redunancy (sic) and just wastes space" SSDs failure modes are different (more an all or nothing thing, I'm told) so it wouldn't apply to the use case above, but I'm curious for curiosity's sake if there would be any difference too. On a traditional HDD, there actually is a reasonable safety benefit to using 2 partitions in raid1 mode over using dup mode. This is because most traditional HDD firmware still keeps the mapping of physical sectors to logical sectors mostly linear, so having separate partitions will (usually) mean that the two copies are not located near each other on physical media. A similar but weaker version of the same effect can be achieved by using the 'ssd_spread' mount option, but I would not suggest relying on that. This doesn't apply to hybrid drives (because they move stuff around however they want like SSD's), or SMR drives (because they rewrite large portions of the disk when one place gets rewritten, so physical separation of the data copies doesn't get you as much protection). For most SSD's, there is no practical benefit because the FTL in the SSD firmware generally maps physical sectors to logical sectors in whatever arbitrary way it wants, which is usually not going to be linear. As far as failure modes on an SSD, you usually see one of two things happen, either the whole disk starts acting odd (or stops working), or individual blocks a few MB in size (which seem to move around the disk as they get over-written) start behaving odd. The first case is the firmware or primary electronics going bad, while the second is individual erase blocks going bad. As a general rule, SSD's will run longer as they're going bad than HDD's will, but in both cases you should look at replacing the device once you start seeing the error counters going up consistently over time (or if you see them suddenly jump to a much higher number). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: dup vs raid1 in single disk
Hello list, I was wondering, from a point of view of data safety, if there is any difference between using dup or making a raid1 from two partitions in the same disk. This is thinking on having some protection against the typical aging HDD that starts to have bad sectors. On a related note, I see this caveat about dup in the manpage: "For example, a SSD drive can remap the blocks internally to a single copy thus deduplicating them. This negates the purpose of increased redunancy (sic) and just wastes space" SSDs failure modes are different (more an all or nothing thing, I'm told) so it wouldn't apply to the use case above, but I'm curious for curiosity's sake if there would be any difference too. Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] xfstests: btrfs/132: add test for invaild update time by an incremental send
On Thu, Jan 5, 2017 at 11:31 AM, Filipe Mananawrote: > On Thu, Jan 5, 2017 at 2:45 AM, robbieko wrote: >> Filipe Manana 於 2017-01-04 21:09 寫到: >> >>> On Wed, Jan 4, 2017 at 10:53 AM, robbieko wrote: From: Robbie Ko Test that an incremental send operation dosen't' work because it tries to update the time to a deleted directory after it finishes a move operation. The other one is that an operation is applied to a file using the old name not the new name. This test exercises scenarios used to fail in btrfs and are fixed by the following patches for the linux kernel: "Btrfs: incremental send, add generation check for the inode waiting for rmdir operation." "Btrfs: incremental send, add generation check in existence demtermination for the parent directory" Signed-off-by: Robbie Ko --- V3: remove "run_" based helpers V2: improve the change log tests/btrfs/132 | 123 tests/btrfs/132.out | 7 +++ tests/btrfs/group | 1 + 3 files changed, 131 insertions(+) create mode 100755 tests/btrfs/132 create mode 100644 tests/btrfs/132.out diff --git a/tests/btrfs/132 b/tests/btrfs/132 new file mode 100755 index 000..f1bb698 --- /dev/null +++ b/tests/btrfs/132 @@ -0,0 +1,123 @@ +#! /bin/bash +# FS QA Test No. btrfs/132 +# +# Test that an incremental send operation dosen't' work because +# it tries to update the time to a deleted directory after it finishes +# a move operation. +# +# The other one is that an operation is applied to a file using the old +# name not the new name. +# +#--- +# Copyright (C) 2016 Synology Inc. All Rights Reserved. +# Author: Robbie Ko +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_fssum + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount + +mkdir $SCRATCH_MNT/dir257 +mkdir $SCRATCH_MNT/dir258 +mkdir $SCRATCH_MNT/dir259 +mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/dir258/dir257 + +# Filesystem looks like: +# +# . (ino 256) +# |--- dir258/ (ino 258) +# ||--- dir257/ (ino 257) +# | +# |--- dir259/ (ino 259) +# +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/mysnap1 > /dev/null + +mv $SCRATCH_MNT/dir258/dir257 $SCRATCH_MNT/dir257 +rmdir $SCRATCH_MNT/dir258 +rmdir $SCRATCH_MNT/dir259 +_scratch_unmount +_scratch_mount >>> >>> >>> Why do need to remount the fs? There's also a _scratch_remount function. >>> >>> Again, I would like to see comments in the test explaining why and >>> how/why did the incremental send operation used to fail (like most of >>> the existing send/receive tests have). >>> >>> thanks >> >> >> Because we need to use the same inode 258 recreate, so need to delete the >> old dir258 and dir259, then remount the fs, the next new file will be >>
[PATCH 2/3] Btrfs: incremental send, do not delay rename when parent inode is new
From: Filipe MananaWhen we are checking if we need to delay the rename operation for an inode we not checking if a parent inode that exists in the send and parent snapshots is really the same inode or not, that is, we are not comparing the generation number of the parent inode in the send and parent snapshots. Not only this results in unnecessarily delaying a rename operation but also can later on make us generate an incorrect name for a new inode in the send snapshot that has the same number as another inode in the parent snapshot but a different generation. Here follows an example where this happens. Parent snapshot: . (ino 256, gen 3) |--- dir258/ (ino 258, gen 7) | |--- dir257/ (ino 257, gen 7) | |--- dir259/ (ino 259, gen 7) Send snapshot: . (ino 256, gen 3) |--- file258 (ino 258, gen 10) | |--- new_dir259/ (ino 259, gen 10) |--- dir257/ (ino 257, gen 7) The following steps happen when computing the incremental send stream: 1) When processing inode 257, its new parent is created using its orphan name (o257-21-0), and the rename operation for inode 257 is delayed because its new parent (inode 259) was not yet processed - this decision to delay the rename operation does not make much sense because the inode 259 in the send snapshot is a new inode, it's not the same as inode 259 in the parent snapshot. 2) When processing inode 258 we end up delaying its rmdir operation, because inode 257 was not yet renamed (moved away from the directory inode 258 represents). We also create the new inode 258 using its orphan name "o258-10-0", then rename it to its final name of "file258" and then issue a truncate operation for it. However this truncate operation contains an incorrect name, which corresponds to the orphan name and not to the final name, which makes the receiver fail. This happens because when we attempt to compute the inode's current name we verify that there's another inode with the same number (258) that has its rmdir operation pending and because of that we generate an orphan name for the new inode 258 (we do this in the function get_cur_path()). Fix this by not delayed the rename operation of an inode if it has parents with the same number but different generations in both snapshots. The following steps reproduce this example scenario. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir257 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/dir257 /mnt/dir258/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/dir258/dir257 /mnt/dir257 $ rmdir /mnt/dir258 $ rmdir /mnt/dir259 # Remount the filesystem so that the next created inodes will have the # numbers 258 and 259. This is because when a filesystem is mounted, # btrfs sets the subvolume's inode counter to a value corresponding to # the highest inode number in the subvolume plus 1. This inode counter # is used to assign a unique number to each new inode and it's # incremented by 1 after very inode creation. # Note: we unmount and then mount instead of doing a mount with # "-o remount" because otherwise the inode counter remains at value 260. $ umount /mnt $ mount /dev/sdb /mnt $ touch /mnt/file258 $ mkdir /mnt/new_dir259 $ mv /mnt/dir257 /mnt/new_dir259/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmo/1.snap $ btrfs receive /mnt -f /tmo/2.snap -vv receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7 utimes mkdir o259-10-0 rename dir258 -> o258-7-0 utimes mkfile o258-10-0 rename o258-10-0 -> file258 utimes truncate o258-10-0 size=0 ERROR: truncate o258-10-0 failed: No such file or directory Reported-by: Robbie Ko Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 2ae32c4..2742324 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3559,6 +3559,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, { int ret = 0; u64 ino = parent_ref->dir; + u64 ino_gen = parent_ref->dir_gen; u64 parent_ino_before, parent_ino_after; struct fs_path *path_before = NULL; struct fs_path *path_after = NULL; @@ -3579,6 +3580,8 @@ static int wait_for_parent_move(struct send_ctx *sctx, * at
[PATCH 3/3] Btrfs: incremental send, do not issue invalid rmdir operations
From: Robbie KoWhen both the parent and send snapshots have a directory inode with the same number but different generations (therefore they are different inodes) and both have an entry with the same name, an incremental send stream will contain an invalid rmdir operation that refers to the orphanized name of the inode from the parent snapshot. The following example scenario shows how this happens. Parent snapshot: . | d259_old/ (ino 259, gen 9) | | d1/ (ino 258, gen 9) | | f (ino 257, gen 9) Send snapshot: . | d258/ (ino 258, gen 7) | d259/ (ino 259, gen 7) | d1/ (ino 257, gen 7) When the kernel is processing inode 258 it notices that in both snapshots there is an inode numbered 259 that is a parent of an inode 258. However it ignores the fact that the inodes numbered 259 have different generations in both snapshots, which means they are effectively different inodes. Then it checks that both inodes 259 have a dentry named "d1" and because of that it issues a rmdir operation with orphanized name of the inode 258 from the parent snapshot. This happens at send.c:process_record_refs(), which calls send.c:did_overwrite_first_ref() that returns true and because of that later on at process_recorded_refs() such rmdir operation is issued because the inode being currently processed (258) is a directory and it was deleted in the send snapshot (and replaced with another inode that has the same number and is a directory too). Fix this issue by comparing the generations of parent directory inodes that have the same number and make send.c:did_overwrite_first_ref() when the generations are different. The following steps reproduce the problem. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/f $ mkdir /mnt/d1 $ mkdir /mnt/d259_old $ mv /mnt/d1 /mnt/d259_old/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/d1 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/d1 /mnt/dir259/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt/ -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inodes from the second snapshot all have # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. This means all the inodes # in the first snapshot (snap1) have a generation value of 9. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive -vv /mnt -f /tmp/2.snap receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9 utimes unlink f mkdir o257-7-0 mkdir o259-7-0 rename o257-7-0 -> o259-7-0/d1 chown o259-7-0/d1 - uid=0, gid=0 chmod o259-7-0/d1 - mode=0755 utimes o259-7-0/d1 rmdir o258-9-0 ERROR: rmdir o258-9-0 failed: No such file or directory Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 13 + 1 file changed, 13 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 2742324..712922e 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1937,6 +1937,19 @@ static int did_overwrite_ref(struct send_ctx *sctx, if (ret <= 0) goto out; + if (dir != BTRFS_FIRST_FREE_OBJECTID) { + ret = get_inode_info(sctx->send_root, dir, NULL, , NULL, +NULL, NULL, NULL); + if (ret < 0 && ret != -ENOENT) + goto out; + if (ret) { + ret = 0; + goto out; + } + if (gen != dir_gen) + goto out; + } + /* check if the ref was overwritten by another ref */ ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len, _inode, _type); -- 2.7.0.rc3 -- To
Re: [PATCH v3 2/6] Btrfs: incremental send, fix invalid path for truncate operations
On Thu, Jan 5, 2017 at 8:24 AM, robbiekowrote: > From: Robbie Ko > > Under certain situations, an incremental send operation can Again, missing some word after the word "can", without it the phrase doesn't make any sense. Tip: don't copy paste change logs and then modify them for each patch, it's a lot easier to make errors like that. > a truncate operation that will make the receiving end fail when > attempting to execute it, because the path is not exist. The problem is generating wrong paths, and that can be for any operation, not just truncates. The subject should reflect that. As it is, it is misleading. > > Example scenario: > Parent snapshot: > | dir258/ (ino 258, gen 15, dir) > |--- dir257/ (ino 257, gen 15, dir) > | dir259/ (ino 259, gen 15, dir) > > Send snapshot: > | file258 (ino 258, gen 21, file) > | new_dir259/ (ino 259, gen 21, dir) > |--- dir257/ (ino 257, gen 15, dir) > > utimes > mkdir o259-21-0 > rename dir258 -> o258-15-0 > utimes > mkfile o258-21-0 > rename o258-21-0 -> file258 > utimes > truncate o258-21-0 size=0 > ERROR: truncate o258-21-0 failed: No such file or directory > > While computing the send stream the following steps happen: > > 1) While processing inode 257, we create o259-21-0 and delay >dir257 rename operation, because its new parent in the send >snapshot, inode 259, was not yet processed and therefore not >yet renamed. The problem is exactly here. There's no need to delay the rename of 257, because inode 259 is a new inode. The whole logic for delaying renames is meant to be applied when there's really a need to do so, when the child and parent inodes exist in both snapshots. So the problem is fully fixed by not delaying the rename for 257, doing this also avoids the extra complexity and the need for all the patches in your patchset except for patches 1/6 and 4/6. I've added such a fix into my 4.11 integration branch and just sent it to the list too. It's here: https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/commit/?h=integration-4.11=b271d6c72e1dc34e93ff6035096e29d45ffa933e thanks > > 2) Later when processing inode 258, we delay rmdir operation for dir258 >because it's not empty, and then orphanize it. > > 3) After we create o258-21-0 and rename it to file258, we get ENOENT on >truncate file258. The reason is that the dir258 with inode 258 is >waiting for rmdir operation and file258's inode is also 258, then >get_current_path called before truncate will return a unique name. > > Fix this by adding generation check for the inode waiting for rmdir > operation. > > Signed-off-by: Robbie Ko > --- > V3: improve the change log > fs/btrfs/send.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 2060e75..22eca86 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -311,7 +311,7 @@ static int is_waiting_for_move(struct send_ctx *sctx, u64 > ino); > static struct waiting_dir_move * > get_waiting_dir_move(struct send_ctx *sctx, u64 ino); > > -static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino); > +static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 > dir_gen); > > static int need_send_hole(struct send_ctx *sctx) > { > @@ -2292,7 +2292,7 @@ static int get_cur_path(struct send_ctx *sctx, u64 ino, > u64 gen, > > fs_path_reset(name); > > - if (is_waiting_for_rm(sctx, ino)) { > + if (is_waiting_for_rm(sctx, ino, gen)) { > ret = gen_unique_name(sctx, ino, gen, name); > if (ret < 0) > goto out; > @@ -2899,11 +2899,11 @@ get_orphan_dir_info(struct send_ctx *sctx, u64 > dir_ino) > return NULL; > } > > -static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino) > +static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 dir_gen) > { > struct orphan_dir_info *odi = get_orphan_dir_info(sctx, dir_ino); > > - return odi != NULL; > + return (odi != NULL && odi->gen == dir_gen); > } > > static void free_orphan_dir_info(struct send_ctx *sctx, > @@ -3166,7 +3166,7 @@ static int path_loop(struct send_ctx *sctx, struct > fs_path *name, > while (ino != BTRFS_FIRST_FREE_OBJECTID) { > fs_path_reset(name); > > - if (is_waiting_for_rm(sctx, ino)) > + if (is_waiting_for_rm(sctx, ino, gen)) > break; > if (is_waiting_for_move(sctx, ino)) { > if (*ancestor_ino == 0) > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people
Re: [PATCH v3 1/6] Btrfs: incremental send, fix failure to rename with the name collision
On Thu, Jan 5, 2017 at 8:24 AM, robbiekowrote: > From: Robbie Ko > > Under certain situations, an incremental send operation can Missing some word after the word "can". > a rename operation that will make the receiving end fail when > attempting to execute it, because the target is exist. > > Example scenario: > Parent snapshot: Two consecutive sentences ending with a colon is odd. > |.(ino 256, gen 5) . is not a child of anything, shouldn't have the | preceding it. > | a1/ (ino 257, gen 5) > | a2/ (ino 258, gen 5) > > Send snapshot: > |.(ino 256, gen 7) Same here. > | a2 (ino 257, gen 7) > > rmdir a1 > mkfile o257-7-0 > rename o257-7-0 -> a2 > ERROR: rename o257-7-0 -> a2 failed: Is a directory Alright what is this? So you just paste here the output of "btrfs receive -vv" without mentioning where it comes from. Let the reader guess and make some sense of it. > > While computing the send stream the following steps happen: > > 1) delete a1; > > 2) mkfile o257-7-0; > > 3) rename o257-7-0->a2; > > In step 3 we will check whether it will lead to overwrite. > > The generation number of inode 257's parent (ino 256) in send snapshot > is 7, which is inconsistent with the one in parent snapshot (gen 5). > For the general parent directory except inode 256, if its generation > is not identical between parent and send snapshot, it will be removed > then created. Thus it's not possible to happen overwrite under the new > directory. However for the special inode 256, the above logic does not > work since it is a subvolume. So overwrite check is required for the > inode 256. > Fix by skipping generation inconsistency check for inode 256. I've heavily reworded the explanation and made other changes to the changelog myself. You should have given an example on how to create the snapshots so that it's clear how to get different inodes with the same number but different generations. I've pushed into an integration branch in my git repository with the goal of including it for 4.11: https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.11 The subject was also changed to "Btrfs: send, fix failure to rename top level inode due to name collision". So for this patch, you don't need to do anything else. Thanks. > > Signed-off-by: Robbie Ko > --- > V3: improve the change log > fs/btrfs/send.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index a87675f..2060e75 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -1681,6 +1681,9 @@ static int is_inode_existent(struct send_ctx *sctx, u64 > ino, u64 gen) > { > int ret; > > + if (ino == BTRFS_FIRST_FREE_OBJECTID) > + return 1; > + > ret = get_cur_inode_state(sctx, ino, gen); > if (ret < 0) > goto out; > @@ -1866,7 +1869,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, > u64 dir, u64 dir_gen, > * not deleted and then re-created, if it was then we have no > overwrite > * and we can just unlink this entry. > */ > - if (sctx->parent_root) { > + if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID) { > ret = get_inode_info(sctx->parent_root, dir, NULL, , NULL, > NULL, NULL, NULL); > if (ret < 0 && ret != -ENOENT) > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] xfstests: btrfs/134: add test for incremental send which renames a directory already being deleted
On Wed, Nov 2, 2016 at 3:52 AM, robbiekowrote: > Hi Eryu Guan, > > Yes, it need apply > [PATCH] "Btrfs: incremental send, do not skip generation inconsistency check > for inode 256." > and test again, it will failed. > > because current code there is a problem, but just will not happen. Then it's not a problem... What your're saying is confusing to say the least. I really don't like the idea of adding a patch that is know to introduce a regression and then adding another patch that fixes it. Anyway, your patchset has been reviewed and those patches are not needed anymore as there's a better solution that doesn't imply introducing a regression temporarily, so this test is not really needed. thanks > > Thansk > Robbie Ko > > Eryu Guan 於 2016-11-01 15:20 寫到: > >> On Fri, Oct 28, 2016 at 09:44:06AM +0800, robbieko wrote: >>> >>> From: Robbie Ko >>> >>> Test that an incremental send operation dosen't work because >>> it tries to rename a directory which is already deleted. >>> >>> This test exercises scenarios used to fail in btrfs and are fixed by >>> the following patch for the linux kernel: >>> >>> "Btrfs: incremental send, add generation check for inode is waiting for >>> move." >> >> >> I was testing with v4.9-rc1+ kernel and btrfs-progs v4.6. Seems above >> patch is not merged in 4.9-rc1 kernel, but test passed for me, is that >> expected? >> >> Thanks, >> Eryu > > > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/3] fstests: test btrfs incremental send after replacing a top level inode
From: Filipe MananaTest that an incremental send operation does not fail when a new inode replaces an old inode that has the same number but different generation, and both are direct children of the subvolume/snapshot root. This is fixed by the following patch for the linux kernel: "Btrfs: send, fix failure to rename top level inode due to name collision" Signed-off-by: Robbie Ko Signed-off-by: Filipe Manana --- v4: Improved changelog and added comments to the test that explain what is being tested and the problem. tests/btrfs/133 | 127 tests/btrfs/133.out | 8 tests/btrfs/group | 1 + 3 files changed, 136 insertions(+) create mode 100755 tests/btrfs/133 create mode 100644 tests/btrfs/133.out diff --git a/tests/btrfs/133 b/tests/btrfs/133 new file mode 100755 index 000..5839f17 --- /dev/null +++ b/tests/btrfs/133 @@ -0,0 +1,127 @@ +#! /bin/bash +# FS QA Test No. btrfs/133 +# +# Test that an incremental send operation does not fail when a new inode +# replaces an old inode that has the same number but different generation, +# and both are direct children of the subvolume/snapshot root. +# +#--- +# Copyright (C) 2017 Synology Inc. All Rights Reserved. +# Author: Robbie Ko +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_fssum + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount + +mkdir $SCRATCH_MNT/a1 +mkdir $SCRATCH_MNT/a2 + +# Filesystem looks like: +# +# . (ino 256) +# |--- a1/ (ino 257) +# |--- a2/ (ino 258) +# +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/mysnap1 > /dev/null + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \ +$send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch + +_scratch_unmount +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount +touch $SCRATCH_MNT/a2 + +# Filesystem now looks like: +# +# . (ino 256) +# |--- a2 (ino 257) +# +# Notice that at this point inode 257 has a generation with value 7, which is +# the generation value for a brand new filesystem. + +# Now create the second snapshot. This makes the filesystem's current generation +# value to increase to the value 8, due to a transaction commit performed by the +# snapshot creation ioctl. +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/mysnap2 > /dev/null + +# Now receive the first snapshot created in the first filesystem. +# Before creating any inodes, the receive command creates the first snapshot, +# which causes a transaction commit and therefore bumps the filesystem's current +# generation to the value 9. All the inodes created end up getting a generation +# value of 9 and the snapshot's root inode (256) gets a generation value of 8. +$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $send_files_dir/1.snap > /dev/null +rm $send_files_dir/1.snap + +$FSSUM_PROG -A -f -w $send_files_dir/1.fssum $SCRATCH_MNT/mysnap1 +$FSSUM_PROG -A -f -w $send_files_dir/2.fssum $SCRATCH_MNT/mysnap2 + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \ +$send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch +$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ + -f $send_files_dir/2.snap 2>&1 1>/dev/null | _filter_scratch + +# Now recreate the filesystem by receiving both send streams and verify we get +# the same content
[PATCH v4 3/3] fstests: test btrfs incremental send after replacing directory
From: Filipe MananaTest that an incremental send operation works when in both snapshots there are two directory inodes that have the same number but different generations and have an entry with the same name that corresponds to different inodes in each snapshot. The btrfs issue is fixed by the following patch for the linux kernel: "Btrfs: incremental send, do not issue invalid rmdir operations" Signed-off-by: Robbie Ko Signed-off-by: Filipe Manana --- v4: Improved changelog and added comments to the test that explain what is being tested and the problem. tests/btrfs/135 | 159 tests/btrfs/135.out | 8 +++ tests/btrfs/group | 1 + 3 files changed, 168 insertions(+) create mode 100755 tests/btrfs/135 create mode 100644 tests/btrfs/135.out diff --git a/tests/btrfs/135 b/tests/btrfs/135 new file mode 100755 index 000..bdc70b3 --- /dev/null +++ b/tests/btrfs/135 @@ -0,0 +1,159 @@ +#! /bin/bash +# FS QA Test No. btrfs/135 +# +# Test that an incremental send operation works when in both snapshots there are +# two directory inodes that have the same number but different generations and +# have an entry with the same name that corresponds to different inodes in each +# snapshot. +# +#--- +# Copyright (C) 2017 Synology Inc. All Rights Reserved. +# Author: Robbie Ko +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_fssum + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount + +touch $SCRATCH_MNT/f +mkdir $SCRATCH_MNT/d1 +mkdir $SCRATCH_MNT/d259_old +mv $SCRATCH_MNT/d1 $SCRATCH_MNT/d259_old/d1 + +# Filesystem looks like: +# +# . (ino 256) +# |--- f(ino 257) +# | +# |--- d259_old/(ino 259) +# |--- d1/ (ino 258) +# +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/mysnap1 > /dev/null + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \ + $send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch + +_scratch_unmount +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount +mkdir $SCRATCH_MNT/d1 +mkdir $SCRATCH_MNT/dir258 +mkdir $SCRATCH_MNT/dir259 +mv $SCRATCH_MNT/d1 $SCRATCH_MNT/dir259/d1 + +# Filesystem now looks like: +# +# . (ino 256) +# |--- dir258/ (ino 258) +# | +# |--- dir259/ (ino 259) +#|--- d1/ (ino 257) +# +# Notice that at this point all inodes have a generation with value 7, which is +# the generation value for a brand new filesystem. + +# Now create the second snapshot. This makes the filesystem's current generation +# value to increase to the value 8, due to a transaction commit performed by the +# snapshot creation ioctl. +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/mysnap2 > /dev/null + +# Now receive the first snapshot created in the first filesystem. +# Before creating any inodes, the receive command creates the first snapshot, +# which causes a transaction commit and therefore bumps the filesystem's current +# generation to the value 9. All the inodes created end up getting a generation +# value of 9 and the snapshot's root inode (256) gets a generation value of 8. +$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $send_files_dir/1.snap > /dev/null +rm $send_files_dir/1.snap
[PATCH v4 2/3] fstests: test btrfs incremental send after moving a directory
From: Filipe MananaTest that an incremental send operation works after moving a directory into a new parent directory, deleting its previous parent directory and creating a new inode that has the same inode number as the old parent. This issue is fixed by the following patch for the linux kernel: "Btrfs: incremental send, do not delay rename when parent inode is new" Signed-off-by: Robbie Ko Signed-off-by: Filipe Manana --- v4: Improved changelog and added comments to the test that explain what is being tested and the problem. tests/btrfs/134 | 126 tests/btrfs/134.out | 6 +++ tests/btrfs/group | 1 + 3 files changed, 133 insertions(+) create mode 100755 tests/btrfs/134 create mode 100644 tests/btrfs/134.out diff --git a/tests/btrfs/134 b/tests/btrfs/134 new file mode 100755 index 000..e4b4c6d --- /dev/null +++ b/tests/btrfs/134 @@ -0,0 +1,126 @@ +#! /bin/bash +# FS QA Test No. btrfs/134 +# +# Test that an incremental send operation works after moving a directory into +# a new parent directory, deleting its previous parent directory and creating +# a new inode that has the same inode number as the old parent. +# +#--- +# Copyright (C) 2017 Synology Inc. All Rights Reserved. +# Author: Robbie Ko +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_fssum + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount + +mkdir $SCRATCH_MNT/dir257 +mkdir $SCRATCH_MNT/dir258 +mkdir $SCRATCH_MNT/dir259 +mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/dir258/dir257 + +# Filesystem looks like: +# +# . (ino 256, gen 3) +# |--- dir258/ (ino 258, gen 7) +# | |--- dir257/ (ino 257, gen 7) +# | +# |--- dir259/ (ino 259, gen 7) +# +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/mysnap1 > /dev/null + +mv $SCRATCH_MNT/dir258/dir257 $SCRATCH_MNT/dir257 +rmdir $SCRATCH_MNT/dir258 +rmdir $SCRATCH_MNT/dir259 +# Remount the filesystem so that the next created inodes will have the numbers +# 258 and 259. This is because when a filesystem is mounted, btrfs sets the +# subvolume's inode counter to a value corresponding to the highest inode number +# in the subvolume plus 1. This inode counter is used to assign a unique number +# to each new inode and it's incremented by 1 after very inode creation. +# Note: we unmount and then mount instead of doing a mount with "-o remount" +# because otherwise the inode counter remains at value 260. +_scratch_unmount +_scratch_mount +touch $SCRATCH_MNT/file258 +mkdir $SCRATCH_MNT/new_dir259 +mv $SCRATCH_MNT/dir257 $SCRATCH_MNT/new_dir259/dir257 + +# Filesystem now looks like: +# +# .(ino 256, gen 3) +# |--- file258 (ino 258, gen 10) +# | +# |--- new_dir259/ (ino 259, gen 10) +# |--- dir257/(ino 257, gen 7) +# +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/mysnap2 > /dev/null + +$FSSUM_PROG -A -f -w $send_files_dir/1.fssum $SCRATCH_MNT/mysnap1 +$FSSUM_PROG -A -f -w $send_files_dir/2.fssum \ +-x $SCRATCH_MNT/mysnap2/mysnap1 $SCRATCH_MNT/mysnap2 + +$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f \ + $send_files_dir/1.snap 2>&1 1>/dev/null | _filter_scratch +$BTRFS_UTIL_PROG send -p
Re: [PATCH v3] btrfs-progs: Fix disable backtrace assert error
On Wed, Jan 18, 2017 at 09:42:45PM -0600, Goldwyn Rodrigues wrote: > >>> +#define BUG_ON(c) ASSERT(!(c)) > >> > >> The problem with this is that you are killing the value printed as a > >> part of the trace for BUG_ON(). The main reason why commit > >> 00e769d04c2c83029d6c71 was written. Please be careful with your notting > >> especially when the values are being printed. > >> > > > > This is designed. > > > > As ASSERT() is more meaningful than the abused BUG_ON(), I changed it to > > print correct value for ASSERT() and ignore BUG_ON(). > > If ASSERT is meaningful, correct it. But please don't make BUG_ON worse > and leave it as it is, at least until the time you can remove the > BUG_ONs or convert it. It should print the correct value of why it did > bug, or else it will print just 1, which is of no debug value. Agreed, we want to see the exact value. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/6] Btrfs: incremental send, fix invalid path for rmdir operations
On Thu, Jan 5, 2017 at 8:24 AM, robbiekowrote: > From: Robbie Ko > > Under certain situations, an incremental send operation can Again, missing some word after the word "can". Copy pasting change logs is not that good > a rmdir operation that will make the receiving end fail when > attempting to execute it, because the path is not exist. > > Example scenario: > Parent snapshot: > | d259_old/ (ino 259, gen 96) > | d1/ (ino 258, gen 96) > | f (ino 257, gen 96) > > Send snapshot: > | d258/ (ino 258, gen 98) > | d259/ (ino 259, gen 98) > | d1/ (ino 257, gen 98) Please try to align things for easier readability. > > unlink f > mkdir o257-98-0 > mkdir o259-98-0 > chown o257-98-0 - uid=0, gid=0 > chmod o257-98-0 - mode=0755 > rmdir o258-96-0 > ERROR: rmdir o258-96-0 failed: No such file or directory Same comment as before (patch 1/6). Don't just paste the output of receive -vv out of nowhere without mentioning what it is. > > While computing the send stream the following steps happen: > > 1) While processing inode 257 we create o257-98-0 and o259-98-0, >then delay o257-98-0 rename operation because its new parent >in the send snapshot, inode 259, was not yet processed and therefore >not yet renamed; > > 2) Later we want to delete d1 (ino 258, gen 96) while processing inode >258. In order to get its path for delete, we need to check if it is >overwritten in the send snapshot. And we find it is overwritten so >we delete it via unique name , which leads to error. The reason is >we will find out d1 is under parent directory (inode 259) in the send >snapshot, and for this case, because d1(inode 257, gen 98) is not same >as d1 (inode 258, gen 96), we conclude d1 has been overwritten. > > Fix this by adding generation check for the parent directory. Because > both parent directory are not identical, we can just skip the overwrite > check. In addition, inode 256 should not check for this since it is a > subvolume. I've heavily reworded the change log and included the patch in my 4.11 integration branch. Thanks. > > Signed-off-by: Robbie Ko > --- > V3: improve the change log > fs/btrfs/send.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index eaf1c92..139f492 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -1938,6 +1938,19 @@ static int did_overwrite_ref(struct send_ctx *sctx, > if (ret <= 0) > goto out; > > + if (dir != BTRFS_FIRST_FREE_OBJECTID) { > + ret = get_inode_info(sctx->send_root, dir, NULL, , NULL, > +NULL, NULL, NULL); > + if (ret < 0 && ret != -ENOENT) > + goto out; > + if (ret) { > + ret = 0; > + goto out; > + } > + if (gen != dir_gen) > + goto out; > + } > + > /* check if the ref was overwritten by another ref */ > ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len, > _inode, _type); > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: send, fix failure to rename top level inode due to name collision
From: Robbie KoUnder certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) | a1/ (ino 257, gen 9) | a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) | a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index d145ce8..2ae32c4 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1681,6 +1681,9 @@ static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen) { int ret; + if (ino == BTRFS_FIRST_FREE_OBJECTID) + return 1; + ret = get_cur_inode_state(sctx, ino, gen); if (ret < 0) goto out; @@ -1866,7 +1869,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen, * not deleted and then re-created, if it was then we have no overwrite * and we can just unlink this entry. */ - if (sctx->parent_root) { + if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID) { ret = get_inode_info(sctx->parent_root, dir, NULL, , NULL, NULL, NULL, NULL); if (ret < 0 && ret != -ENOENT) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Thu 19-01-17 10:22:36, Jan Kara wrote: > On Thu 19-01-17 09:39:56, Michal Hocko wrote: > > On Tue 17-01-17 18:29:25, Jan Kara wrote: > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote: > > > > > > But before going to play with that I am really wondering whether we > > > > > > need > > > > > > all this with no journal at all. AFAIU what Jack told me it is the > > > > > > journal lock(s) which is the biggest problem from the reclaim > > > > > > recursion > > > > > > point of view. What would cause a deadlock in no journal mode? > > > > > > > > > > We still have the original problem for why we need GFP_NOFS even in > > > > > ext2. If we are in a writeback path, and we need to allocate memory, > > > > > we don't want to recurse back into the file system's writeback path. > > > > > > > > But we do not enter the writeback path from the direct reclaim. Or do > > > > you mean something other than pageout()'s mapping->a_ops->writepage? > > > > There is only try_to_release_page where we get back to the filesystems > > > > but I do not see any NOFS protection in ext4_releasepage. > > > > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage() > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0), > > > shrinkers. That's it. So the recursion possibilities are rather more > > > limited > > > than they used to be several years ago and we likely do not need as much > > > GFP_NOFS protection as we used to. > > > > Thanks for making my remark more clear Jack! I would just want to add > > that I was playing with the patch below (it is basically > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the > > debugging patch which means they are called from within transaction) and > > it didn't hit the lockdep when running xfstests both with or without the > > enabled journal. > > > > So am I still missing something or the nojournal mode is safe and the > > current series is OK wrt. ext*? > > I'm convinced the current series is OK, only real life will tell us whether > we missed something or not ;) I would like to extend the changelog of "jbd2: mark the transaction context with the scope GFP_NOFS context". " Please note that setups without journal do not suffer from potential recursion problems and so they do not need the scope protection because neither ->releasepage nor ->evict_inode (which are the only fs entry points from the direct reclaim) can reenter a locked context which is doing the allocation currently. " > > The following patch in its current form is WIP and needs a proper review > > before I post it. > > So jbd2 changes look confusing (although technically correct) to me - we > *always* should run in NOFS context in those place so having GFP_KERNEL > there looks like it is unnecessarily hiding what is going on. So in those > places I'd prefer to keep GFP_NOFS or somehow else make it very clear these > allocations are expected to be GFP_NOFS (and assert that). Otherwise the > changes look good to me. I would really like to get rid most of NOFS direct usage and only dictate it via the scope API otherwise I suspect we will just grow more users and end up in the same situation as we are now currently over time. In principle only the context which changes the reclaim reentrancy policy should care about NOFS and everybody else should just pretend nothing like that exists. There might be few exceptions of course, I am not yet sure whether jbd2 is that case. But I am not proposing this change yet (thanks for checking anyway)... -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 1 recovery
On Thu, Jan 19, 2017 at 10:15 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Wed, 18 Jan 2017 14:30:28 -0700 as excerpted: > >> On Wed, Jan 18, 2017 at 2:07 PM, Jonwrote: >>> So, I had a raid 1 btrfs system setup on my laptop. Recently I upgraded >>> the drives and wanted to get my data back. I figured I could just plug >>> in one drive, but I found that the volume simply would not mount. I >>> tried the other drive alone and got the same thing. Plugging in both at >>> the same time and the volume mounted without issue. >> >> Requires mount option degraded. >> >> If this is a boot volume, this is difficult because the current udev >> rule prevents a mount attempt so long as all devices for a Btrfs volume >> aren't present. > > OK, so I've known about this from the list for some time, but what is the > status with regard to udev/systemd (has a bug/issue been filed, results, > link?), and what are the alternatives, both for upstream, and for a dev, > either trying to be proactive, or currently facing a refusal to boot due > to the issue? > The only possible solution is to introduce separate object that represents btrsf "storage" (raid) that exports information whether it can be mounted. Then it is possible to start timer and decide to mount degraded after timer expires. Compare with Linux MD that is assembled by udev as well and does exactly that. Where this object is better implemented - I do not know. It can be btrfs side (export pseudo device similar to /dev/mdX for Linux MD). It can be user space - special service that waits for device assembly. Abstracting it on btrfs level is more clean and does not depend on specific user space (a.k.a. udev/systemd). OTOH zfs shares the same problem at the end, so common solution to multi-device filesystem handling would be good. > IOW, I run btrfs raid1 on /, setup before the udev rule I believe as it > worked back then, and I've known about the issue but haven't followed it > closely enough to know what I should do if faced with a dead device, or > what the current status is on alternatives to fix the problem longer > term, either locally (does simply disabling the rule work?) or upstream, > and what the alternatives might be along with the reasons this is still > shipping instead. And I'm asking in ordered to try to remedy that. =:^) > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs recovery
Hello I have a question. after a power outage my system was turning into a unrecoverable state using btrfs (kernel 4.9) since im running --init-extent-tree now for 3 days i'm asking how long this process normally takes and why it outputs millions of lines like Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 1562890240 root 262 owner 483059214 offset 0 found 1 wanted 0 back 0x23b0211d0 backpointer mismatch on [1562890240 4096] adding new data backref on 1562890240 root 262 owner 483059214 offset 0 found 1 Repaired extent references for 1562890240 please avoid typical answers like "potential dangerous operation" since all repair options are declared as potenial dangerous. Sebastian -- Mit freundlichen Grüssen / Regards Sebastian Gottschall / CTO NewMedia-NET GmbH - DD-WRT Firmensitz: Berliner Ring 101, 64625 Bensheim Registergericht: Amtsgericht Darmstadt, HRB 25473 Geschäftsführer: Peter Steinhäuser, Christian Scheele http://www.dd-wrt.com email: s.gottsch...@dd-wrt.com Tel.: +496251-582650 / Fax: +496251-5826565 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Thu 19-01-17 09:39:56, Michal Hocko wrote: > On Tue 17-01-17 18:29:25, Jan Kara wrote: > > On Tue 17-01-17 17:16:19, Michal Hocko wrote: > > > > > But before going to play with that I am really wondering whether we > > > > > need > > > > > all this with no journal at all. AFAIU what Jack told me it is the > > > > > journal lock(s) which is the biggest problem from the reclaim > > > > > recursion > > > > > point of view. What would cause a deadlock in no journal mode? > > > > > > > > We still have the original problem for why we need GFP_NOFS even in > > > > ext2. If we are in a writeback path, and we need to allocate memory, > > > > we don't want to recurse back into the file system's writeback path. > > > > > > But we do not enter the writeback path from the direct reclaim. Or do > > > you mean something other than pageout()'s mapping->a_ops->writepage? > > > There is only try_to_release_page where we get back to the filesystems > > > but I do not see any NOFS protection in ext4_releasepage. > > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage() > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0), > > shrinkers. That's it. So the recursion possibilities are rather more limited > > than they used to be several years ago and we likely do not need as much > > GFP_NOFS protection as we used to. > > Thanks for making my remark more clear Jack! I would just want to add > that I was playing with the patch below (it is basically > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the > debugging patch which means they are called from within transaction) and > it didn't hit the lockdep when running xfstests both with or without the > enabled journal. > > So am I still missing something or the nojournal mode is safe and the > current series is OK wrt. ext*? I'm convinced the current series is OK, only real life will tell us whether we missed something or not ;) > The following patch in its current form is WIP and needs a proper review > before I post it. So jbd2 changes look confusing (although technically correct) to me - we *always* should run in NOFS context in those place so having GFP_KERNEL there looks like it is unnecessarily hiding what is going on. So in those places I'd prefer to keep GFP_NOFS or somehow else make it very clear these allocations are expected to be GFP_NOFS (and assert that). Otherwise the changes look good to me. Honza > --- > fs/ext4/inode.c | 4 ++-- > fs/ext4/mballoc.c | 14 +++--- > fs/ext4/xattr.c | 2 +- > fs/jbd2/journal.c | 4 ++-- > fs/jbd2/revoke.c | 2 +- > fs/jbd2/transaction.c | 2 +- > 6 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index b7d141c3b810..841cb8c4cb5e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page, > return __ext4_journalled_writepage(page, len); > > ext4_io_submit_init(_submit, wbc); > - io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS); > + io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL); > if (!io_submit.io_end) { > redirty_page_for_writepage(wbc, page); > unlock_page(page); > @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t > *handle, > int err = 0; > > page = find_or_create_page(mapping, from >> PAGE_SHIFT, > -mapping_gfp_constraint(mapping, ~__GFP_FS)); > +mapping_gfp_mask(mapping)); > if (!page) > return -ENOMEM; > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index d9fd184b049e..67b97cd6e3d6 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, > ext4_group_t group, > static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, > struct ext4_buddy *e4b) > { > - return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS); > + return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL); > } > > static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) > @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct > ext4_allocation_context *ac, > > /* We only do this if the grp has never been initialized */ > if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { > - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); > + int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL); > if (ret) > return ret; > } > @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) > BUG_ON(ac->ac_status != AC_STATUS_FOUND); > BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); > > - pa = kmem_cache_alloc(ext4_pspace_cachep,
Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Tue 17-01-17 18:29:25, Jan Kara wrote: > On Tue 17-01-17 17:16:19, Michal Hocko wrote: > > > > But before going to play with that I am really wondering whether we need > > > > all this with no journal at all. AFAIU what Jack told me it is the > > > > journal lock(s) which is the biggest problem from the reclaim recursion > > > > point of view. What would cause a deadlock in no journal mode? > > > > > > We still have the original problem for why we need GFP_NOFS even in > > > ext2. If we are in a writeback path, and we need to allocate memory, > > > we don't want to recurse back into the file system's writeback path. > > > > But we do not enter the writeback path from the direct reclaim. Or do > > you mean something other than pageout()'s mapping->a_ops->writepage? > > There is only try_to_release_page where we get back to the filesystems > > but I do not see any NOFS protection in ext4_releasepage. > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage() > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0), > shrinkers. That's it. So the recursion possibilities are rather more limited > than they used to be several years ago and we likely do not need as much > GFP_NOFS protection as we used to. Thanks for making my remark more clear Jack! I would just want to add that I was playing with the patch below (it is basically GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the debugging patch which means they are called from within transaction) and it didn't hit the lockdep when running xfstests both with or without the enabled journal. So am I still missing something or the nojournal mode is safe and the current series is OK wrt. ext*? The following patch in its current form is WIP and needs a proper review before I post it. --- fs/ext4/inode.c | 4 ++-- fs/ext4/mballoc.c | 14 +++--- fs/ext4/xattr.c | 2 +- fs/jbd2/journal.c | 4 ++-- fs/jbd2/revoke.c | 2 +- fs/jbd2/transaction.c | 2 +- 6 files changed, 14 insertions(+), 14 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b7d141c3b810..841cb8c4cb5e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page, return __ext4_journalled_writepage(page, len); ext4_io_submit_init(_submit, wbc); - io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS); + io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL); if (!io_submit.io_end) { redirty_page_for_writepage(wbc, page); unlock_page(page); @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle, int err = 0; page = find_or_create_page(mapping, from >> PAGE_SHIFT, - mapping_gfp_constraint(mapping, ~__GFP_FS)); + mapping_gfp_mask(mapping)); if (!page) return -ENOMEM; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d9fd184b049e..67b97cd6e3d6 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, struct ext4_buddy *e4b) { - return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS); + return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL); } static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, /* We only do this if the grp has never been initialized */ if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); + int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL); if (ret) return ret; } @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) BUG_ON(ac->ac_status != AC_STATUS_FOUND); BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL); if (pa == NULL) return -ENOMEM; @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac) BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); BUG_ON(ext4_pspace_cachep == NULL); - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL); if (pa == NULL) return -ENOMEM; @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, } } - ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS); + ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL); if (!ac) { ar->len = 0; *errp =
Re: [PATCH 1/2] btrfs-progs: cmds-check.c: supports inode nbytes fix in lowmem
Hi, The test flag override way only runs all tests in lowmem mode or not, but can't decide one test repair or not. I have some ideas below: 1.Create a hidden empty file under the test dir which need to be repaired.Edit tests/common.local:_skip_spec() to judge repair or not by the existence of hidden file. 2.Or just edit a test.sh under the test dir.I test my patches in this way but may the way is not grace enough. I'm wondering which one is perferred. Thanks Su On 01/17/2017 11:40 PM, David Sterba wrote: Hi, I have some comments, see below. On Mon, Jan 09, 2017 at 01:38:07PM +0800, Su Yue wrote: Added 'repair_inode_item' which dispatches functions such as 'repair_inode__nbytes_lowmem' to correct errors and 'struct inode_item_fix_info' to store correct values and errors. Signed-off-by: Su Yue--- cmds-check.c | 161 +++ 1 file changed, 152 insertions(+), 9 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 1dba298..567ca80 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -371,6 +371,17 @@ struct root_item_info { }; /* + * Use inode_item_fix_info as function check_inode_item's arg. + */ +struct inode_item_fix_info { + u64 ino; + u64 isize; + u64 nbytes; + + int err; +}; + +/* * Error bit for low memory mode check. * * Currently no caller cares about it yet. Just internal use for error @@ -1866,13 +1877,16 @@ struct node_refs { static int update_nodes_refs(struct btrfs_root *root, u64 bytenr, struct node_refs *nrefs, u64 level); static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path, - unsigned int ext_ref); - + unsigned int ext_ref, + struct inode_item_fix_info *info); +static int repair_inode_item(struct btrfs_root *root, +struct inode_item_fix_info *info); static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, struct node_refs *nrefs, int *level, int ext_ref) { struct extent_buffer *cur = path->nodes[0]; struct btrfs_key key; + struct inode_item_fix_info info; u64 cur_bytenr; u32 nritems; u64 first_ino = 0; @@ -1881,6 +1895,7 @@ static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, int ret = 0; /* Final return value */ int err = 0; /* Positive error bitmap */ + memset(, 0, sizeof(info)); cur_bytenr = cur->start; /* skip to first inode item or the first inode number change */ @@ -1900,8 +1915,26 @@ static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, path->slots[0] = i; again: - err |= check_inode_item(root, path, ext_ref); + err |= check_inode_item(root, path, ext_ref, ); + + if (repair && (err & ~LAST_ITEM)) { + ret = repair_inode_item(root, ); + if (ret < 0) + goto out; + /* +* if some errors was repaired, path shall be searched +* again since path has been changed +*/ + if (ret == 0) { + btrfs_item_key_to_cpu(path->nodes[0], , + path->slots[0]); + btrfs_release_path(path); + btrfs_search_slot(NULL, root, , path, 0, 0); + + cur = path->nodes[0]; + } + } if (err & LAST_ITEM) goto out; @@ -2211,7 +2244,8 @@ out: } static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path, - unsigned int ext_ref); + unsigned int ext_ref, + struct inode_item_fix_info *info); static int walk_down_tree_v2(struct btrfs_root *root, struct btrfs_path *path, int *level, struct node_refs *nrefs, int ext_ref) @@ -2293,7 +2327,7 @@ static int walk_down_tree_v2(struct btrfs_root *root, struct btrfs_path *path, } ret = check_child_node(root, cur, path->slots[*level], next); - if (ret < 0) + if (ret < 0) break; if (btrfs_is_leaf(next)) @@ -2383,6 +2417,105 @@ out: return ret; } +/* + * Set inode's nbytes to correct value in @info + * + * Returns <0 means on error + * Returns 0 means successful repair + */ +static int repair_inode_nbytes_lowmem(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct inode_item_fix_info *info) +{ + struct btrfs_inode_item *ei; + struct btrfs_key key; + struct btrfs_path path; + int ret; + + ASSERT(info); + key.objectid