Re: Very slow balance / btrfs-transaction
>Should quota support generally be disabled during balances? If this true and quota impacts balance throughput, at-least there should an alert message like "Running Balance with quota will affect performance" or similar before starting. Cheers, Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
February 4, 2017 1:07 AM, "Goldwyn Rodrigues"wrote: > On 02/03/2017 06:30 PM, Jorg Bornschein wrote: > >> February 3, 2017 11:26 PM, "Goldwyn Rodrigues" wrote: >> >> Hi, >> >> I'm currently running a balance (without any filters) on a 4 drives raid1 >> filesystem. The array >> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because >> the 6TB drive recently >> replaced a 2TB drive. >> >> I know that balance is not supposed to be a fast operation, but this one is >> now running for ~6 days >> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 >> considered), 82% left) >> -- so I expect it to take another ~4 weeks. >> >> That seems excessively slow for ~8TiB of data. >> >> Is this expected behavior? In case it's not: Is there anything I can do to >> help debug it? >>> Do you have quotas enabled? >> >> I might have activated it when playing with "snapper" -- I remember using >> some quota command >> without knowing what it does. >> >> How can I check its active? Shall I just disable it wit "btrfs quota >> disable"? > > To check your quota limits: > # btrfs qgroup show > > To disable > # btrfs quota disable > > Yes, please check if disabling quotas makes a difference in execution > time of btrfs balance. Quata support was indeed active -- and it warned me that the qroup data was inconsistent. Disabling quotas had an immediate impact on balance throughput -- it's *much* faster now! >From a quick glance at iostat I would guess it's at least a factor 100 faster. Should quota support generally be disabled during balances? Or did I somehow push my fs into a weired state where it triggered a slow-path? Thanks! j -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
On 02/03/2017 06:30 PM, Jorg Bornschein wrote: > February 3, 2017 11:26 PM, "Goldwyn Rodrigues"wrote: > >>> Hi, >>> >>> I'm currently running a balance (without any filters) on a 4 drives raid1 >>> filesystem. The array >>> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because >>> the 6TB drive recently >>> replaced a 2TB drive. >>> >>> I know that balance is not supposed to be a fast operation, but this one is >>> now running for ~6 days >>> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 >>> considered), 82% left) >>> -- so I expect it to take another ~4 weeks. >>> >>> That seems excessively slow for ~8TiB of data. >>> >>> Is this expected behavior? In case it's not: Is there anything I can do to >>> help debug it? >> >> Do you have quotas enabled? > > > I might have activated it when playing with "snapper" -- I remember using > some quota command without knowing what it does. > > How can I check its active? Shall I just disable it wit "btrfs quota > disable"? > To check your quota limits: # btrfs qgroup show To disable # btrfs quota disable Yes, please check if disabling quotas makes a difference in execution time of btrfs balance. -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
February 3, 2017 11:26 PM, "Goldwyn Rodrigues"wrote: >> Hi, >> >> I'm currently running a balance (without any filters) on a 4 drives raid1 >> filesystem. The array >> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because >> the 6TB drive recently >> replaced a 2TB drive. >> >> I know that balance is not supposed to be a fast operation, but this one is >> now running for ~6 days >> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 >> considered), 82% left) >> -- so I expect it to take another ~4 weeks. >> >> That seems excessively slow for ~8TiB of data. >> >> Is this expected behavior? In case it's not: Is there anything I can do to >> help debug it? > > Do you have quotas enabled? I might have activated it when playing with "snapper" -- I remember using some quota command without knowing what it does. How can I check its active? Shall I just disable it wit "btrfs quota disable"? j -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
On 02/03/2017 04:13 PM, j...@capsec.org wrote: > Hi, > > > I'm currently running a balance (without any filters) on a 4 drives raid1 > filesystem. The array contains 3 3TB drives and one 6TB drive; I'm running > the rebalance because the 6TB drive recently replaced a 2TB drive. > > > I know that balance is not supposed to be a fast operation, but this one is > now running for ~6 days and it managed to balance ~18% (754 out of about 4250 > chunks balanced (755 considered), 82% left) -- so I expect it to take > another ~4 weeks. > > That seems excessively slow for ~8TiB of data. > > > Is this expected behavior? In case it's not: Is there anything I can do to > help debug it? Do you have quotas enabled? -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] Lowmem mode fsck fixes with fsck-tests framework update
Hey Qu On Fri, 2017-02-03 at 14:20 +0800, Qu Wenruo wrote: > Great thanks for that! You're welcome. :) > I also added missing error message output for other places I found, > and > updated the branch, the name remains as "lowmem_tests" > > Please try it. # btrfs check /dev/nbd0 ; echo $? checking extents checking free space cache checking fs roots checking csums checking root refs Checking filesystem on /dev/nbd0 UUID: 326d292d-f97b-43ca-b1e8-c722d3474719 found 7519512838144 bytes used, no error found total csum bytes: 7330834320 total tree bytes: 10902437888 total fs tree bytes: 2019704832 total extent tree bytes: 1020149760 btree space waste bytes: 925714197 file data blocks allocated: 7509228494848 referenced 7630551511040 0 # btrfs check /dev/nbd0 ; echo $? checking extents checking free space cache checking fs roots checking csums checking root refs Checking filesystem on /dev/nbd0 UUID: 326d292d-f97b-43ca-b1e8-c722d3474719 found 7519512838144 bytes used, no error found total csum bytes: 7330834320 total tree bytes: 10902437888 total fs tree bytes: 2019704832 total extent tree bytes: 1020149760 btree space waste bytes: 925714197 file data blocks allocated: 7509228494848 referenced 7630551511040 0 => looks good this time :) btw: In your commit messages, please change my email to cales...@scientia.net everywhere... I accidentally used my university address (christoph.anton.mitte...@lmu.de) sometimes when sending mail. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists
Am Thu, 02 Feb 2017 13:01:03 +0100 schrieb Marc Joliet: > On Sunday 28 August 2016 15:29:08 Kai Krakow wrote: > > Hello list! > > Hi list > > > It happened again. While using VirtualBox the following crash > > happened, btrfs check found a lot of errors which it couldn't > > repair. Earlier that day my system crashed which may already > > introduced errors into my filesystem. Apparently, I couldn't create > > an image (not enough space available), I only can give this trace > > from dmesg: > > > > [44819.903435] [ cut here ] > > [44819.903443] WARNING: CPU: 3 PID: 2787 at > > fs/btrfs/extent-tree.c:2963 btrfs_run_delayed_refs+0x26c/0x290 > > [44819.903444] BTRFS: Transaction aborted (error -17) > > [44819.903445] Modules linked in: nls_iso8859_15 nls_cp437 vfat fat > > fuse rfcomm veth af_packet ipt_MASQUERADE nf_nat_masquerade_ipv4 > > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > > nf_conntrack bridge stp llc w83627ehf bnep hwmon_vid cachefiles > > btusb btintel bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek > > snd_hda_codec_generic snd_hda_intel snd_hda_codec rfkill snd_hwdep > > snd_hda_core snd_pcm snd_timer coretemp hwmon snd r8169 mii > > kvm_intel kvm iTCO_wdt iTCO_vendor_support rtc_cmos irqbypass > > soundcore ip_tables uas usb_storage nvidia_drm(PO) vboxpci(O) > > vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) > > nvidia(PO) efivarfs unix ipv6 [44819.903484] CPU: 3 PID: 2787 Comm: > > BrowserBlocking Tainted: P O4.7.2-gentoo #2 > > [44819.903485] Hardware name: To Be Filled By O.E.M. To Be Filled > > By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [44819.903487] > > 8130af2d 8800b7d03d20 > > [44819.903489] 810865fa 880409374428 8800b7d03d70 > > 8803bf299760 [44819.903491] ffef > > 8803f677f000 8108666a [44819.903493] Call Trace: > > [44819.903496] [] ? dump_stack+0x46/0x59 > > [44819.903500] [] ? __warn+0xba/0xe0 > > [44819.903502] [] ? warn_slowpath_fmt+0x4a/0x50 > > [44819.903504] [] ? > > btrfs_run_delayed_refs+0x26c/0x290 [44819.903507] > > [] ? btrfs_release_path+0xe/0x80 [44819.903509] > > [] ? btrfs_start_dirty_block_groups+0x2da/0x420 > > [44819.903511] [] ? > > btrfs_commit_transaction+0x143/0x990 [44819.903514] > > [] ? kmem_cache_free+0x165/0x180 [44819.903516] > > [] ? btrfs_wait_ordered_range+0x7c/0x110 > > [44819.903518] [] ? btrfs_sync_file+0x286/0x360 > > [44819.903522] [] ? do_fsync+0x33/0x60 > > [44819.903524] [] ? SyS_fdatasync+0xa/0x10 > > [44819.903528] [] ? > > entry_SYSCALL_64_fastpath+0x13/0x8f [44819.903529] ---[ end trace > > 6944811e170a0e57 ]--- [44819.903531] BTRFS: error (device bcache2) > > in btrfs_run_delayed_refs:2963: errno=-17 Object already exists > > [44819.903533] BTRFS info (device bcache2): forced readonly > > I got the same error myself, with this stack trace: > > -- Logs begin at Fr 2016-04-01 17:07:28 CEST, end at Mi 2017-02-01 > 22:03:57 CET. -- > Feb 01 01:46:26 diefledermaus kernel: [ cut here > ] Feb 01 01:46:26 diefledermaus kernel: WARNING: CPU: 1 > PID: 16727 at fs/btrfs/extent-tree.c:2967 > btrfs_run_delayed_refs+0x278/0x2b0 Feb 01 01:46:26 diefledermaus > kernel: BTRFS: Transaction aborted (error -17) Feb 01 01:46:26 > diefledermaus kernel: BTRFS: error (device sdb2) in > btrfs_run_delayed_refs:2967: errno=-17 Object already exists Feb 01 > 01:46:27 diefledermaus kernel: BTRFS info (device sdb2): forced > readonly Feb 01 01:46:27 diefledermaus kernel: Modules linked in: msr > ctr ccm tun arc4 snd_hda_codec_idt applesmc snd_hda_codec_generic > input_polldev hwmon snd_hda_intel ath5k snd_hda_codec mac80211 > snd_hda_core ath snd_pcm cfg80211 snd_timer video acpi_cpufreq snd > backlight sky2 rfkill processor button soundcore sg usb_storage > sr_mod cdrom ata_generic pata_acpi uhci_hcd ahci libahci ata_piix > libata ehci_pci ehci_hcd Feb 01 01:46:27 diefledermaus kernel: CPU: 1 > PID: 16727 Comm: kworker/u4:0 Not tainted 4.9.6-gentoo #1 > Feb 01 01:46:27 diefledermaus kernel: Hardware name: Apple Inc. > Macmini2,1/Mac-F4208EAA, BIOS MM21.88Z.009A.B00.0706281359 > 06/28/07 Feb 01 01:46:27 diefledermaus kernel: Workqueue: > btrfs-extent-refs btrfs_extent_refs_helper > Feb 01 01:46:27 diefledermaus kernel: > 812cf739 c9000285fd60 > Feb 01 01:46:27 diefledermaus kernel: 8104908a > 8800428df1e0 c9000285fdb0 0020 > Feb 01 01:46:27 diefledermaus kernel: 880003c1b1b8 > 8800bb73e900 810490fa > Feb 01 01:46:27 diefledermaus kernel: Call Trace: > Feb 01 01:46:27 diefledermaus kernel: [] ? > dump_stack+0x46/0x5d > Feb 01 01:46:27 diefledermaus kernel: [] ? > __warn+0xba/0xe0 Feb 01 01:46:27 diefledermaus kernel: > [] ? warn_slowpath_fmt+0x4a/0x50 > Feb 01 01:46:27 diefledermaus kernel: [] ? >
Very slow balance / btrfs-transaction
Hi, I'm currently running a balance (without any filters) on a 4 drives raid1 filesystem. The array contains 3 3TB drives and one 6TB drive; I'm running the rebalance because the 6TB drive recently replaced a 2TB drive. I know that balance is not supposed to be a fast operation, but this one is now running for ~6 days and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 considered), 82% left) -- so I expect it to take another ~4 weeks. That seems excessively slow for ~8TiB of data. Is this expected behavior? In case it's not: Is there anything I can do to help debug it? The 4 individual devices are bcache devices with currently no ssd cache partition attached; the bcache backing devices sit ontop of luks encrypted devices. Maybe a few words about the history of this fs: It used to be a 1 drive btrfs ontop of a bcache partition with a 30GiB SSD cache (actively used for >1 year). During the last month, I gradually added devices (always with active bcaches). At some point, after adding the 4th device, I deactivated (detached) the bcache caching device and instead activated raid1 for data and metadata and ran a rebalance (which was reasonably fast -- I don't remember how fast exactly, but probably <24h). The finaly steps that lead to the current situation: I activated "nossd" and replaced the smallest device with "btrfs dev replace" (which was also reasonabley fast, <12h). Best & thanks, j -- [joerg@dorsal ~]$ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 111.8G 0 disk ├─sda18:10 1G 0 part /boot └─sda28:20 110.8G 0 part └─crypted 254:00 110.8G 0 crypt ├─ssd-root 254:10 72.8G 0 lvm / ├─ssd-swap 254:20 8G 0 lvm [SWAP] └─ssd-cache 254:3030G 0 lvm sdb 8:16 0 2.7T 0 disk └─sdb18:17 0 2.7T 0 part └─crypted-sdb 254:70 2.7T 0 crypt └─bcache2 253:20 2.7T 0 disk sdc 8:32 0 2.7T 0 disk └─sdc18:33 0 2.7T 0 part └─crypted-sdc 254:40 2.7T 0 crypt └─bcache1 253:10 2.7T 0 disk sdd 8:48 0 2.7T 0 disk └─sdd18:49 0 2.7T 0 part └─crypted-sdd 254:60 2.7T 0 crypt └─bcache0 253:00 2.7T 0 disk sde 8:64 0 5.5T 0 disk └─sde18:65 0 5.5T 0 part └─crypted-sde 254:50 5.5T 0 crypt └─bcache3 253:30 5.5T 0 disk /storage -- joerg@dorsal ~]$ sudo btrfs fi usage -h /storage/ Overall: Device size: 13.64TiB Device allocated: 8.35TiB Device unallocated:5.29TiB Device missing: 0.00B Used: 8.34TiB Free (estimated): 2.65TiB (min: 2.65TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 15.77MiB) Data,RAID1: Size:4.17TiB, Used:4.16TiB /dev/bcache02.38TiB /dev/bcache12.37TiB /dev/bcache22.38TiB /dev/bcache31.20TiB Metadata,RAID1: Size:9.00GiB, Used:7.49GiB /dev/bcache18.00GiB /dev/bcache21.00GiB /dev/bcache39.00GiB System,RAID1: Size:32.00MiB, Used:624.00KiB /dev/bcache1 32.00MiB /dev/bcache3 32.00MiB Unallocated: /dev/bcache0 355.52GiB /dev/bcache1 356.49GiB /dev/bcache2 355.52GiB /dev/bcache34.25TiB -- [joerg@dorsal ~]$ ps -xal | grep btrfs 1 0 227 2 0 -20 0 0 - S< ? 0:00 [btrfs-worker] 1 0 229 2 0 -20 0 0 - S< ? 0:00 [btrfs-worker-hi] 1 0 230 2 0 -20 0 0 - S< ? 0:00 [btrfs-delalloc] 1 0 231 2 0 -20 0 0 - S< ? 0:00 [btrfs-flush_del] 1 0 232 2 0 -20 0 0 - S< ? 0:00 [btrfs-cache] 1 0 233 2 0 -20 0 0 - S< ? 0:00 [btrfs-submit] 1 0 234 2 0 -20 0 0 - S< ? 0:00 [btrfs-fixup] 1 0 235 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio] 1 0 236 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio-met] 1 0 237 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio-met] 1 0 238 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio-rai] 1 0 239 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio-rep] 1 0 240 2 0 -20 0 0 - S< ? 0:00 [btrfs-rmw] 1 0 241 2 0 -20 0 0 - S< ? 0:00 [btrfs-endio-wri] 1 0 242 2 0 -20 0 0 - S< ? 0:00 [btrfs-freespace] 1 0 243 2 0 -20 0 0 - S< ? 0:00 [btrfs-delayed-m] 1 0 244 2 0 -20 0
[PATCH v2] btrfs-progs: better document btrfs receive security
This adds some extra documentation to the btrfs-receive manpage that explains some of the security related aspects of btrfs-receive. The first part covers the fact that the subvolume being received is writable until the receive finishes, and the second covers the current lack of sanity checking of the send stream. Signed-off-by: Austin S. HemmelgarnSuggested-by: Graham Cobb --- Chages since v1: * Updated the description based on suggestions from Graham Cobb. Inspired by a recent thread on the ML. This could probably be more thorough, but I felt it was more important to get it documented as quickly as possible, and this should cover the basic info that most people will care about. Documentation/btrfs-receive.asciidoc | 20 1 file changed, 20 insertions(+) diff --git a/Documentation/btrfs-receive.asciidoc b/Documentation/btrfs-receive.asciidoc index a6838e5e..1ada396f 100644 --- a/Documentation/btrfs-receive.asciidoc +++ b/Documentation/btrfs-receive.asciidoc @@ -31,7 +31,7 @@ the stream, and print the stream metadata, one operation per line. 3. default subvolume has changed or you didn't mount the filesystem at the toplevel subvolume -A subvolume is made read-only after the receiving process finishes succesfully. +A subvolume is made read-only after the receiving process finishes succesfully (see BUGS below). `Options` @@ -68,6 +68,26 @@ dump the stream metadata, one line per operation + Does not require the 'path' parameter. The filesystem chanded. +BUGS + +*btrfs receive* sets the subvolume read-only after it completes +successfully. However, while the receive is in progress, users who have +write access to files or directories in the receiving 'path' can add, +remove, or modify files, in which case the resulting read-only subvolume +will not be an exact copy of the sent subvolume. + +If the intention is to create an exact copy, the receiving 'path' +should be protected from access by users until the receive operation +has completed and the subvolume is set to read-only. + +Additionally, receive does not currently do a very good job of validating +that an incremental send streams actually makes sense, and it is thus +possible for a specially crafted send stream to create a subvolume with +reflinks to arbitrary files in the same filesystem. Because of this, +users are advised to not use *btrfs receive* on send streams from +untrusted sources, and to protect trusted streams when sending them +across untrusted networks. + EXIT STATUS --- *btrfs receive* returns a zero exit status if it succeeds. Non zero is -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 2017-02-03 14:17, Graham Cobb wrote: On 03/02/17 16:01, Austin S. Hemmelgarn wrote: Ironically, I ended up having time sooner than I thought. The message doesn't appear to be in any of the archives yet, but the message ID is: <20170203134858.75210-1-ahferro...@gmail.com> Ah. I didn't notice it until after I had sent my message. No worries. I actually like how you explained things a bit better though, so if you are OK with it I'll update the patch I sent using your description (and credit you in the commit message too of course). You are welcome to use any of my phrasing or approach, of course! I'll send out the new version shortly then. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 03/02/17 16:01, Austin S. Hemmelgarn wrote: > Ironically, I ended up having time sooner than I thought. The message > doesn't appear to be in any of the archives yet, but the message ID is: > <20170203134858.75210-1-ahferro...@gmail.com> Ah. I didn't notice it until after I had sent my message. > I actually like how you explained things a bit better though, so if you > are OK with it I'll update the patch I sent using your description (and > credit you in the commit message too of course). You are welcome to use any of my phrasing or approach, of course! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/24] btrfs: Convert to separately allocated bdi
On Thu, Feb 02, 2017 at 06:34:06PM +0100, Jan Kara wrote: > Allocate struct backing_dev_info separately instead of embedding it > inside superblock. This unifies handling of bdi among users. Looks good. Reviewed-by: Liu BoThanks, -liubo > > CC: Chris Mason > CC: Josef Bacik > CC: David Sterba > CC: linux-btrfs@vger.kernel.org > Signed-off-by: Jan Kara > --- > fs/btrfs/ctree.h | 1 - > fs/btrfs/disk-io.c | 36 +++- > fs/btrfs/super.c | 7 +++ > 3 files changed, 14 insertions(+), 30 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 6a823719b6c5..1dc06f66dfcf 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -801,7 +801,6 @@ struct btrfs_fs_info { > struct btrfs_super_block *super_for_commit; > struct super_block *sb; > struct inode *btree_inode; > - struct backing_dev_info bdi; > struct mutex tree_log_mutex; > struct mutex transaction_kthread_mutex; > struct mutex cleaner_mutex; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 37a31b12bb0c..b25723e729c0 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1810,21 +1810,6 @@ static int btrfs_congested_fn(void *congested_data, > int bdi_bits) > return ret; > } > > -static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info > *bdi) > -{ > - int err; > - > - err = bdi_setup_and_register(bdi, "btrfs"); > - if (err) > - return err; > - > - bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE; > - bdi->congested_fn = btrfs_congested_fn; > - bdi->congested_data = info; > - bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK; > - return 0; > -} > - > /* > * called by the kthread helper functions to finally call the bio end_io > * functions. This is where read checksum verification actually happens > @@ -2598,16 +2583,10 @@ int open_ctree(struct super_block *sb, > goto fail; > } > > - ret = setup_bdi(fs_info, _info->bdi); > - if (ret) { > - err = ret; > - goto fail_srcu; > - } > - > ret = percpu_counter_init(_info->dirty_metadata_bytes, 0, > GFP_KERNEL); > if (ret) { > err = ret; > - goto fail_bdi; > + goto fail_srcu; > } > fs_info->dirty_metadata_batch = PAGE_SIZE * > (1 + ilog2(nr_cpu_ids)); > @@ -2715,7 +2694,6 @@ int open_ctree(struct super_block *sb, > > sb->s_blocksize = 4096; > sb->s_blocksize_bits = blksize_bits(4096); > - sb->s_bdi = _info->bdi; > > btrfs_init_btree_inode(fs_info); > > @@ -2912,9 +2890,12 @@ int open_ctree(struct super_block *sb, > goto fail_sb_buffer; > } > > - fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super); > - fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > - SZ_4M / PAGE_SIZE); > + sb->s_bdi->congested_fn = btrfs_congested_fn; > + sb->s_bdi->congested_data = fs_info; > + sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK; > + sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE; > + sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super); > + sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE); > > sb->s_blocksize = sectorsize; > sb->s_blocksize_bits = blksize_bits(sectorsize); > @@ -3282,8 +3263,6 @@ int open_ctree(struct super_block *sb, > percpu_counter_destroy(_info->delalloc_bytes); > fail_dirty_metadata_bytes: > percpu_counter_destroy(_info->dirty_metadata_bytes); > -fail_bdi: > - bdi_destroy(_info->bdi); > fail_srcu: > cleanup_srcu_struct(_info->subvol_srcu); > fail: > @@ -4010,7 +3989,6 @@ void close_ctree(struct btrfs_fs_info *fs_info) > percpu_counter_destroy(_info->dirty_metadata_bytes); > percpu_counter_destroy(_info->delalloc_bytes); > percpu_counter_destroy(_info->bio_counter); > - bdi_destroy(_info->bdi); > cleanup_srcu_struct(_info->subvol_srcu); > > btrfs_free_stripe_hash_table(fs_info); > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index b5ae7d3d1896..08ef08b63132 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -1133,6 +1133,13 @@ static int btrfs_fill_super(struct super_block *sb, > #endif > sb->s_flags |= MS_I_VERSION; > sb->s_iflags |= SB_I_CGROUPWB; > + > + err = super_setup_bdi(sb); > + if (err) { > + btrfs_err(fs_info, "super_setup_bdi failed"); > + return err; > + } > + > err = open_ctree(sb, fs_devices, (char *)data); > if (err) { > btrfs_err(fs_info, "open_ctree failed"); > -- > 2.10.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org
Re: [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems
On Fri, Feb 03, 2017 at 02:50:42PM +0100, Jan Kara wrote: > On Thu 02-02-17 11:28:27, Liu Bo wrote: > > Hi, > > > > On Thu, Feb 02, 2017 at 06:34:02PM +0100, Jan Kara wrote: > > > Provide helper functions for setting up dynamically allocated > > > backing_dev_info structures for filesystems and cleaning them up on > > > superblock destruction. > > > > Just one concern, will this cause problems for multiple superblock cases > > like nfs with nosharecache? > > Can you ellaborate a bit? I've looked for a while what nfs with > nosharecache does but I didn't see how it would influence anything with > bdis... Oh, I missed that bdi_seq was static, then it should be fine. (I was worried about that nfs with nosharecache would have multiple superblocks and if each superblock has a bdi using the same bdi name, nfs-xx.) Thanks for the reply. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Remove unused function arg in delete_extent_records
From: Goldwyn Rodriguesnew_len is not used in delete_extent_records(). Signed-off-by: Goldwyn Rodrigues --- cmds-check.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 84e1d99..9fb85e4 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -7969,7 +7969,7 @@ out: static int delete_extent_records(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, -u64 bytenr, u64 new_len) +u64 bytenr) { struct btrfs_key key; struct btrfs_key found_key; @@ -8975,7 +8975,7 @@ static int fixup_extent_refs(struct btrfs_fs_info *info, /* step two, delete all the existing records */ ret = delete_extent_records(trans, info->extent_root, , - rec->start, rec->max_size); + rec->start); if (ret < 0) goto out; -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 2017-02-03 10:44, Graham Cobb wrote: On 03/02/17 12:44, Austin S. Hemmelgarn wrote: I can look at making a patch for this, but it may be next week before I have time (I'm not great at multi-tasking when it comes to software development, and I'm in the middle of helping to fix a bug in Ansible right now). That would be great, Austin! It is about 15 years since I last submitted a patch under kernel development patch rules and things have changed a fair bit in that time. So if you are set up to do it that sounds good. As a starting point, I have created a suggested text (patch attached). Ironically, I ended up having time sooner than I thought. The message doesn't appear to be in any of the archives yet, but the message ID is: <20170203134858.75210-1-ahferro...@gmail.com> I actually like how you explained things a bit better though, so if you are OK with it I'll update the patch I sent using your description (and credit you in the commit message too of course). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 03/02/17 12:44, Austin S. Hemmelgarn wrote: > I can look at making a patch for this, but it may be next week before I > have time (I'm not great at multi-tasking when it comes to software > development, and I'm in the middle of helping to fix a bug in Ansible > right now). That would be great, Austin! It is about 15 years since I last submitted a patch under kernel development patch rules and things have changed a fair bit in that time. So if you are set up to do it that sounds good. As a starting point, I have created a suggested text (patch attached). diff --git a/Documentation/btrfs-receive.asciidoc b/Documentation/btrfs-receive.asciidoc index 6be4aa6..db525d9 100644 --- a/Documentation/btrfs-receive.asciidoc +++ b/Documentation/btrfs-receive.asciidoc @@ -31,7 +31,7 @@ the stream, and print the stream metadata, one operation per line. 3. default subvolume has changed or you didn't mount the filesystem at the toplevel subvolume -A subvolume is made read-only after the receiving process finishes succesfully. +A subvolume is made read-only after the receiving process finishes succesfully (see BUGS below). `Options` @@ -73,6 +73,16 @@ EXIT STATUS *btrfs receive* returns a zero exit status if it succeeds. Non zero is returned in case of failure. +BUGS + +*btrfs receive* sets the subvolume read-only after it completes successfully. +However, while the receive is in progress, users who have write access to files +or directories in the receiving 'path' can add, remove or modify files, in which +case the resulting read-only subvolume will not be a copy of the sending subvolume. + +If the intention is to create an exact copy, the receiving 'path' should be protected +from access by users until the receive has completed and the subvolume set to read-only. + AVAILABILITY *btrfs* is part of btrfs-progs.
Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Mon 30-01-17 09:12:10, Michal Hocko wrote: > On Fri 27-01-17 11:40:42, Theodore Ts'o wrote: > > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote: > > > If this ever turn out to be a problem and with the vmapped stacks we > > > have good chances to get a proper stack traces on a potential overflow > > > we can add the scope API around the problematic code path with the > > > explanation why it is needed. > > > > Yeah, or maybe we can automate it? Can the reclaim code check how > > much stack space is left and do the right thing automatically? > > I am not sure how to do that. Checking for some magic value sounds quite > fragile to me. It also sounds a bit strange to focus only on the reclaim > while other code paths might suffer from the same problem. > > What is actually the deepest possible call chain from the slab reclaim > where I stopped? I have tried to follow that path but hit the callback > wall quite early. > > > The reason why I'm nervous is that nojournal mode is not a common > > configuration, and "wait until production systems start failing" is > > not a strategy that I or many SRE-types find comforting. > > I understand that but I would be much more happier if we did the > decision based on the actual data rather than a fear something would > break down. ping on this. I would really like to move forward here and target 4.11 merge window. Is your concern so serious to block this patch? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v4 ... colors!!
On 02/03/2017 03:36 PM, Timofey Titovets wrote: > > ➜ python-btrfs git:(master) ✗ tar -vxf python-btrfs-5-1-any.pkg.tar.xz > .PKGINFO > .BUILDINFO > .MTREE > usr/ > usr/lib/ > usr/lib/python3.6/ > usr/lib/python3.6/site-packages/ > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/ > usr/lib/python3.6/site-packages/btrfs/ > usr/lib/python3.6/site-packages/btrfs/__pycache__/ > usr/lib/python3.6/site-packages/btrfs/utils.py > usr/lib/python3.6/site-packages/btrfs/ioctl.py > usr/lib/python3.6/site-packages/btrfs/ctree.py > usr/lib/python3.6/site-packages/btrfs/crc32c.py > usr/lib/python3.6/site-packages/btrfs/__init__.py > usr/lib/python3.6/site-packages/btrfs/__pycache__/utils.cpython-36.pyc > usr/lib/python3.6/site-packages/btrfs/__pycache__/ioctl.cpython-36.pyc > usr/lib/python3.6/site-packages/btrfs/__pycache__/ctree.cpython-36.pyc > usr/lib/python3.6/site-packages/btrfs/__pycache__/crc32c.cpython-36.pyc > usr/lib/python3.6/site-packages/btrfs/__pycache__/__init__.cpython-36.pyc > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/installed-files.txt > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/PKG-INFO > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/dependency_links.txt > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/top_level.txt > usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/SOURCES.txt > > I think it's building automaticaly by: > PIP_CONFIG_FILE=/dev/null pip install --isolated --root="$pkgdir" > --ignore-installed --no-deps btrfs > Even better :-) Thanks, -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v4 ... colors!!
2017-02-03 17:27 GMT+03:00 Hans van Kranenburg: > On 02/03/2017 03:18 PM, Timofey Titovets wrote: >> 2017-02-03 15:57 GMT+03:00 Hans van Kranenburg >> : >>> On 02/03/2017 12:25 PM, Timofey Titovets wrote: Thank you for your great work: JFYI Packaged in AUR: https://aur.archlinux.org/packages/python-btrfs-heatmap/ >>> >>> Hey, thanks. >>> >>> Just wondering... what is that btrfs.py you refer to in... >>> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=python-btrfs ? >>> >>> It's a package, not a single file, so maybe that compile command doesn't >>> do anything? I'm not familiar with arch, so correct me if I'm wrong. >>> >>> . >>> ├── btrfs >>> │ ├── crc32c.py >>> │ ├── ctree.py >>> │ ├── __init__.py >>> │ ├── ioctl.py >>> │ ├── utils.py >>> >> >> You right, compile command are useless, so i did remove it, thanks. > > -$ python2 -m compileall btrfs > Listing btrfs ... > Compiling btrfs/__init__.py ... > Compiling btrfs/crc32c.py ... > Compiling btrfs/ctree.py ... > Compiling btrfs/ioctl.py ... > Compiling btrfs/utils.py ... > > . > ├── btrfs > │ ├── crc32c.py > │ ├── crc32c.pyc > │ ├── ctree.py > │ ├── ctree.pyc > │ ├── __init__.py > │ ├── __init__.pyc > │ ├── ioctl.py > │ ├── ioctl.pyc > │ ├── utils.py > │ └── utils.pyc > > > -$ python3 -m compileall btrfs > Listing 'btrfs'... > Compiling 'btrfs/__init__.py'... > Compiling 'btrfs/crc32c.py'... > Compiling 'btrfs/ctree.py'... > Compiling 'btrfs/ioctl.py'... > Compiling 'btrfs/utils.py'... > > . > ├── btrfs > │ ├── crc32c.py > │ ├── ctree.py > │ ├── __init__.py > │ ├── ioctl.py > │ ├── __pycache__ > │ │ ├── crc32c.cpython-35.pyc > │ │ ├── ctree.cpython-35.pyc > │ │ ├── __init__.cpython-35.pyc > │ │ ├── ioctl.cpython-35.pyc > │ │ └── utils.cpython-35.pyc > │ └── utils.py > > > -- > Hans van Kranenburg ➜ python-btrfs git:(master) ✗ tar -vxf python-btrfs-5-1-any.pkg.tar.xz .PKGINFO .BUILDINFO .MTREE usr/ usr/lib/ usr/lib/python3.6/ usr/lib/python3.6/site-packages/ usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/ usr/lib/python3.6/site-packages/btrfs/ usr/lib/python3.6/site-packages/btrfs/__pycache__/ usr/lib/python3.6/site-packages/btrfs/utils.py usr/lib/python3.6/site-packages/btrfs/ioctl.py usr/lib/python3.6/site-packages/btrfs/ctree.py usr/lib/python3.6/site-packages/btrfs/crc32c.py usr/lib/python3.6/site-packages/btrfs/__init__.py usr/lib/python3.6/site-packages/btrfs/__pycache__/utils.cpython-36.pyc usr/lib/python3.6/site-packages/btrfs/__pycache__/ioctl.cpython-36.pyc usr/lib/python3.6/site-packages/btrfs/__pycache__/ctree.cpython-36.pyc usr/lib/python3.6/site-packages/btrfs/__pycache__/crc32c.cpython-36.pyc usr/lib/python3.6/site-packages/btrfs/__pycache__/__init__.cpython-36.pyc usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/installed-files.txt usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/PKG-INFO usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/dependency_links.txt usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/top_level.txt usr/lib/python3.6/site-packages/btrfs-5-py3.6.egg-info/SOURCES.txt I think it's building automaticaly by: PIP_CONFIG_FILE=/dev/null pip install --isolated --root="$pkgdir" --ignore-installed --no-deps btrfs -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v4 ... colors!!
On 02/03/2017 03:18 PM, Timofey Titovets wrote: > 2017-02-03 15:57 GMT+03:00 Hans van Kranenburg >: >> On 02/03/2017 12:25 PM, Timofey Titovets wrote: >>> Thank you for your great work: >>> JFYI Packaged in AUR: >>> https://aur.archlinux.org/packages/python-btrfs-heatmap/ >> >> Hey, thanks. >> >> Just wondering... what is that btrfs.py you refer to in... >> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=python-btrfs ? >> >> It's a package, not a single file, so maybe that compile command doesn't >> do anything? I'm not familiar with arch, so correct me if I'm wrong. >> >> . >> ├── btrfs >> │ ├── crc32c.py >> │ ├── ctree.py >> │ ├── __init__.py >> │ ├── ioctl.py >> │ ├── utils.py >> > > You right, compile command are useless, so i did remove it, thanks. -$ python2 -m compileall btrfs Listing btrfs ... Compiling btrfs/__init__.py ... Compiling btrfs/crc32c.py ... Compiling btrfs/ctree.py ... Compiling btrfs/ioctl.py ... Compiling btrfs/utils.py ... . ├── btrfs │ ├── crc32c.py │ ├── crc32c.pyc │ ├── ctree.py │ ├── ctree.pyc │ ├── __init__.py │ ├── __init__.pyc │ ├── ioctl.py │ ├── ioctl.pyc │ ├── utils.py │ └── utils.pyc -$ python3 -m compileall btrfs Listing 'btrfs'... Compiling 'btrfs/__init__.py'... Compiling 'btrfs/crc32c.py'... Compiling 'btrfs/ctree.py'... Compiling 'btrfs/ioctl.py'... Compiling 'btrfs/utils.py'... . ├── btrfs │ ├── crc32c.py │ ├── ctree.py │ ├── __init__.py │ ├── ioctl.py │ ├── __pycache__ │ │ ├── crc32c.cpython-35.pyc │ │ ├── ctree.cpython-35.pyc │ │ ├── __init__.cpython-35.pyc │ │ ├── ioctl.cpython-35.pyc │ │ └── utils.cpython-35.pyc │ └── utils.py -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v4 ... colors!!
2017-02-03 15:57 GMT+03:00 Hans van Kranenburg: > On 02/03/2017 12:25 PM, Timofey Titovets wrote: >> Thank you for your great work: >> JFYI Packaged in AUR: >> https://aur.archlinux.org/packages/python-btrfs-heatmap/ > > Hey, thanks. > > Just wondering... what is that btrfs.py you refer to in... > https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=python-btrfs ? > > It's a package, not a single file, so maybe that compile command doesn't > do anything? I'm not familiar with arch, so correct me if I'm wrong. > > . > ├── btrfs > │ ├── crc32c.py > │ ├── ctree.py > │ ├── __init__.py > │ ├── ioctl.py > │ ├── utils.py > > -- > Hans van Kranenburg You right, compile command are useless, so i did remove it, thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/24] fs: Provide infrastructure for dynamic BDIs in filesystems
On Thu 02-02-17 11:28:27, Liu Bo wrote: > Hi, > > On Thu, Feb 02, 2017 at 06:34:02PM +0100, Jan Kara wrote: > > Provide helper functions for setting up dynamically allocated > > backing_dev_info structures for filesystems and cleaning them up on > > superblock destruction. > > Just one concern, will this cause problems for multiple superblock cases > like nfs with nosharecache? Can you ellaborate a bit? I've looked for a while what nfs with nosharecache does but I didn't see how it would influence anything with bdis... Honza -- Jan KaraSUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: better document btrfs receive security
This adds some extra documentation to the btrfs-receive manpage that explains some of the security related aspects of btrfs-receive. The first part covers the fact that the subvolume being received is writable until the receive finishes, and the second covers the current lack of sanity checking of the send stream. Signed-off-by: Austin S. Hemmelgarn--- Inspired by a recent thread on the ML. This could probably be more thorough, but I felt it was more important to get it documented as quickly as possible, and this should cover the basic info that most people will care about. Documentation/btrfs-receive.asciidoc | 20 1 file changed, 20 insertions(+) diff --git a/Documentation/btrfs-receive.asciidoc b/Documentation/btrfs-receive.asciidoc index a6838e5e..1ada396f 100644 --- a/Documentation/btrfs-receive.asciidoc +++ b/Documentation/btrfs-receive.asciidoc @@ -68,6 +68,26 @@ dump the stream metadata, one line per operation + Does not require the 'path' parameter. The filesystem chanded. +SECURITY + +Because of how *btrfs receive* works, subvolumes that are in the +process of being received are writable. As a result of this, there +is a possibility that a subvolume for which the receive operation just +completed may not be identical to the originally sent subvolume. + +To avoid this in cases where more than one user may have access to the +area the received subvolumes are being stored, it is reccommended to +receive subvolumes into a staging area which is only accessible to the +user who is running receive, and then move then into the final destination +when the receive command completes. + +Additionally, receive does not currently do a very good job of validating +that an incremental send streams actually makes sense, and it is thus +possible for a specially crafted send stream to create a subvolume with +reflinks to arbitrary files in the same filesystem. Because of this, +users are advised to not use *btrfs receive* on send streams from +untrusted sources. + EXIT STATUS --- *btrfs receive* returns a zero exit status if it succeeds. Non zero is -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_drop_snapshot "IO failure" after RAID controller reset
On Fri, Feb 03, 2017 at 11:16:51AM +0100, Juergen 'Louis' Fluk wrote: > Dear all, > > the RAID controller underneath our 32T BTRFS container had a sudden reset, > and after rebooting BTRFS drops to readonly after some list of messages. > > I did recovery + btrfs-zero-log + recovery (using a LVM snapshot), yet > the error persists. From "transid verify failed" I understand that journal > and data are not in sync (data is newer). BTRFS tries to drop a snapshot > and fails there - is there a way to ignore it or force it? > > RAID controller does not signal new errors so I assume it's not a problem > of accessing some single disk block, but possibly some information was not > written to disk at the time of controller reset. ... > > mount -o recovery /dev/vg/snap /mnt/backup > > Feb 3 08:05:57 zeus kernel: [336619.494618] BTRFS info (device dm-2): > enabling auto recovery > Feb 3 08:05:57 zeus kernel: [336619.494625] BTRFS info (device dm-2): disk > space caching is enabled > Feb 3 08:09:32 zeus kernel: [336834.568348] BTRFS: checking UUID tree > Feb 3 08:10:44 zeus kernel: [336905.752787] BTRFS info (device dm-2): The > free space cache file (814462533632) is invalid. skip it > Feb 3 08:10:44 zeus kernel: [336905.752787] > Feb 3 08:11:26 zeus kernel: [336948.358199] BTRFS (device dm-2): parent > transid verify failed on 4052030455808 wanted 451805 found 451973 > Feb 3 08:11:26 zeus kernel: [336948.397901] BTRFS (device dm-2): parent > transid verify failed on 4052030455808 wanted 451805 found 451973 > Feb 3 08:11:46 zeus kernel: [336968.341996] BTRFS (device dm-2): parent > transid verify failed on 4052030455808 wanted 451805 found 451973 > Feb 3 08:11:46 zeus kernel: [336968.362567] BTRFS (device dm-2): parent > transid verify failed on 4052030455808 wanted 451805 found 451973 > Feb 3 08:11:46 zeus kernel: [336968.406344] BTRFS: error (device dm-2) in > btrfs_drop_snapshot:8367: errno=-5 IO failure > Feb 3 08:11:46 zeus kernel: [336968.418816] BTRFS info (device dm-2): forced > readonly > ... > The server is running kernel 3.19.0-79-generic (ubuntu 14.04), btrfs-tools > 3.12-1ubuntu0.1. > Does it make sense to use newer kernel and/or tools to recover? Running on kernel 4.4.0-62-generic now, procedure looks quite similar: mount -o recovery /dev/vg/snap /mnt/backup Feb 3 11:38:30 zeus kernel: [ 297.414369] BTRFS info (device dm-2): enabling auto recovery Feb 3 11:38:30 zeus kernel: [ 297.414375] BTRFS info (device dm-2): disk space caching is enabled Feb 3 11:41:54 zeus kernel: [ 501.145009] BTRFS: checking UUID tree Feb 3 11:43:02 zeus kernel: [ 568.938947] BTRFS info (device dm-2): The free space cache file (814462533632) is invalid. skip it Feb 3 11:43:02 zeus kernel: [ 568.938947] Feb 3 11:44:57 zeus kernel: [ 683.656849] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:44:57 zeus kernel: [ 683.718674] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:44:59 zeus kernel: [ 686.344684] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:44:59 zeus kernel: [ 686.370777] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:44:59 zeus kernel: [ 686.374094] BTRFS: error (device dm-2) in btrfs_drop_snapshot:9008: errno=-5 IO failure Feb 3 11:44:59 zeus kernel: [ 686.32] BTRFS info (device dm-2): forced readonly umount /mnt/backup Feb 3 11:46:36 zeus kernel: [ 783.112240] BTRFS error (device dm-2): cleaner transaction attach returned -30 btrfs-zero-log /dev/vg/snap # takes 180s, no messages mount -o recovery /dev/vg/snap /mnt/backup Feb 3 11:49:35 zeus kernel: [ 961.805605] BTRFS info (device dm-2): enabling auto recovery Feb 3 11:49:35 zeus kernel: [ 961.805611] BTRFS info (device dm-2): disk space caching is enabled Feb 3 11:53:03 zeus kernel: [ 1170.373099] BTRFS: checking UUID tree Feb 3 11:54:12 zeus kernel: [ 1238.660425] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:54:12 zeus kernel: [ 1238.807281] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:54:25 zeus kernel: [ 1252.132065] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:54:25 zeus kernel: [ 1252.422404] BTRFS error (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 11:54:25 zeus kernel: [ 1252.425953] BTRFS: error (device dm-2) in btrfs_drop_snapshot:9008: errno=-5 IO failure Feb 3 11:54:25 zeus kernel: [ 1252.429649] BTRFS info (device dm-2): forced readonly Feb 3 11:59:14 zeus kernel: [ 1541.593077] BTRFS warning (device dm-2): btrfs_uuid_scan_kthread failed -30 Feb 3 12:00:28 zeus kernel: [ 1614.931233] BTRFS error (device dm-2): parent
Re: Btrfs Heatmap - v4 ... colors!!
On 02/03/2017 12:25 PM, Timofey Titovets wrote: > Thank you for your great work: > JFYI Packaged in AUR: > https://aur.archlinux.org/packages/python-btrfs-heatmap/ Hey, thanks. Just wondering... what is that btrfs.py you refer to in... https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=python-btrfs ? It's a package, not a single file, so maybe that compile command doesn't do anything? I'm not familiar with arch, so correct me if I'm wrong. . ├── btrfs │ ├── crc32c.py │ ├── ctree.py │ ├── __init__.py │ ├── ioctl.py │ ├── utils.py -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
On 2017-02-03 04:14, Duncan wrote: Graham Cobb posted on Thu, 02 Feb 2017 10:52:26 + as excerpted: On 02/02/17 00:02, Duncan wrote: If it's a workaround, then many of the Linux procedures we as admins and users use every day are equally workarounds. Setting 007 perms on a dir that doesn't have anything immediately security vulnerable in it, simply to keep other users from even potentially seeing or being able to write to something N layers down the subdir tree, is standard practice. No. There is no need to normally place a read-only snapshot below a no-execute directory just to prevent write access to it. That is not part of the admin's expectation. Which is my point. This is no different than standard security practice, that an admin should be familiar with and using without even having to think about it. Btrfs is simply making the same assumptions that everyone else does, that an admin knows what they are doing and sets the upstream permissions with that in mind. If they don't, how is that btrfs' fault? Because btrfs intends the receive snapshot to be read-only. That is the expectation of the sysadmin. Read-only *after* completion, yes. But a sysadmin that believes really setting something read-only and then trying to write to it from userspace, which is what btrfs does until the receive is done, should work, doesn't understand the meaning of read-only. Meanwhile, Austin said most of what I'd say, probably better than I'd say it, so I won't repeat it here, but I agree with him. Even though it is security-related, I agree it isn't the highest priority btrfs bug. It can probably wait until receive is being worked on for other reasons. But if it isn't going to be fixed any time soon, it should be documented in the Wiki and the man page, with the suggested workround for anyone who needs to make sure the receive won't be tampered with. One thing I was going to say in the previous post and forgot, is that not withstanding all the technical reasons, I do agree that documenting it in the manpage, etc, would be a good idea. In that I agree with both you and Austin. =:^) I can look at making a patch for this, but it may be next week before I have time (I'm not great at multi-tasking when it comes to software development, and I'm in the middle of helping to fix a bug in Ansible right now). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v4 ... colors!!
Thank you for your great work: JFYI Packaged in AUR: https://aur.archlinux.org/packages/python-btrfs-heatmap/ -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to dump/find parity of RAID-5 file?
Hi. Came across this thread https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html Exploring possibility of adding test-scripts around these area using dump-tree & corrupt-block.But unable to figure-out how to get parity of file or find its location. dump-tree output gave, item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 145096704) itemoff 15557 itemsize 144 length 134217728 owner 2 stripe_len 65536 type DATA|RAID5 io_align 65536 io_width 65536 sector_size 4096 num_stripes 3 sub_stripes 0 stripe 0 devid 3 offset 6368 # Is this parity?Seems empty? dev_uuid f62df114-186c-4e48-8152-9ed15aa078b4 stripe 1 devid 2 offset 6368 # Contains file data-stripe-1 dev_uuid c0aeaab0-e57e-4f7a-9356-db1878876d9f stripe 2 devid 1 offset 83034112 # Contains file data-stripe-2 dev_uuid 637b3666-9d8f-4ec4-9969-53b0b933b9b1 thanks. Cheers. Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs_drop_snapshot "IO failure" after RAID controller reset
Dear all, the RAID controller underneath our 32T BTRFS container had a sudden reset, and after rebooting BTRFS drops to readonly after some list of messages. I did recovery + btrfs-zero-log + recovery (using a LVM snapshot), yet the error persists. From "transid verify failed" I understand that journal and data are not in sync (data is newer). BTRFS tries to drop a snapshot and fails there - is there a way to ignore it or force it? RAID controller does not signal new errors so I assume it's not a problem of accessing some single disk block, but possibly some information was not written to disk at the time of controller reset. mount -o recovery /dev/vg/snap /mnt/backup Feb 3 08:05:57 zeus kernel: [336619.494618] BTRFS info (device dm-2): enabling auto recovery Feb 3 08:05:57 zeus kernel: [336619.494625] BTRFS info (device dm-2): disk space caching is enabled Feb 3 08:09:32 zeus kernel: [336834.568348] BTRFS: checking UUID tree Feb 3 08:10:44 zeus kernel: [336905.752787] BTRFS info (device dm-2): The free space cache file (814462533632) is invalid. skip it Feb 3 08:10:44 zeus kernel: [336905.752787] Feb 3 08:11:26 zeus kernel: [336948.358199] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:11:26 zeus kernel: [336948.397901] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:11:46 zeus kernel: [336968.341996] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:11:46 zeus kernel: [336968.362567] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:11:46 zeus kernel: [336968.406344] BTRFS: error (device dm-2) in btrfs_drop_snapshot:8367: errno=-5 IO failure Feb 3 08:11:46 zeus kernel: [336968.418816] BTRFS info (device dm-2): forced readonly umount /mnt/backup Feb 3 08:14:13 zeus kernel: [337114.733143] BTRFS warning (device dm-2): page private not zero on page 4049746657280 Feb 3 08:14:13 zeus kernel: [337114.733148] BTRFS warning (device dm-2): page private not zero on page 4049746661376 Feb 3 08:14:13 zeus kernel: [337114.733151] BTRFS warning (device dm-2): page private not zero on page 4049746665472 Feb 3 08:14:13 zeus kernel: [337114.733154] BTRFS warning (device dm-2): page private not zero on page 4049746669568 btrfs-zero-log /dev/vg/snap # takes about 180s, no messages mount -o recovery /dev/vg/snap /mnt/backup Feb 3 08:17:01 zeus kernel: [337282.701412] BTRFS info (device dm-2): enabling auto recovery Feb 3 08:17:01 zeus kernel: [337282.701418] BTRFS info (device dm-2): disk space caching is enabled Feb 3 08:20:30 zeus kernel: [337492.359931] BTRFS: checking UUID tree Feb 3 08:21:01 zeus kernel: [337523.269214] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:21:01 zeus kernel: [337523.382927] BTRFS (device dm-2): parent transid verify failed on 4052030455808 wanted 451805 found 451973 Feb 3 08:26:06 zeus kernel: [337828.19] BTRFS (device dm-2): parent transid verify failed on 4052043694080 wanted 451805 found 451973 Feb 3 08:26:06 zeus kernel: [337828.291338] BTRFS (device dm-2): parent transid verify failed on 4052043694080 wanted 451805 found 451973 Feb 3 08:26:11 zeus kernel: [337833.611569] BTRFS (device dm-2): parent transid verify failed on 4050351652864 wanted 451804 found 451973 Feb 3 08:26:12 zeus kernel: [337833.662051] BTRFS (device dm-2): parent transid verify failed on 4050351652864 wanted 451804 found 451973 Feb 3 08:26:15 zeus kernel: [337837.077964] BTRFS (device dm-2): parent transid verify failed on 4052066533376 wanted 451806 found 451974 Feb 3 08:26:15 zeus kernel: [337837.106540] BTRFS (device dm-2): parent transid verify failed on 4052066533376 wanted 451806 found 451974 Feb 3 08:26:20 zeus kernel: [337842.595882] BTRFS (device dm-2): parent transid verify failed on 4050367676416 wanted 451804 found 451973 Feb 3 08:26:21 zeus kernel: [337842.686296] BTRFS (device dm-2): parent transid verify failed on 4050367676416 wanted 451804 found 451973 Feb 3 08:26:24 zeus kernel: [337845.666495] BTRFS (device dm-2): parent transid verify failed on 4051971883008 wanted 451804 found 451973 Feb 3 08:26:24 zeus kernel: [337845.728624] BTRFS (device dm-2): parent transid verify failed on 4051971883008 wanted 451804 found 451973 Feb 3 08:26:27 zeus kernel: [337848.780978] BTRFS (device dm-2): parent transid verify failed on 4051397328896 wanted 451804 found 451973 Feb 3 08:26:27 zeus kernel: [337848.827572] BTRFS (device dm-2): parent transid verify failed on 4051397328896 wanted 451804 found 451973 Feb 3 08:26:27 zeus kernel: [337849.116946] BTRFS (device dm-2): parent transid verify failed on 4052072022016 wanted 451806 found 451974 Feb 3 08:26:27 zeus kernel: [337849.164664] BTRFS (device dm-2): parent transid verify failed on 4052072022016 wanted 451806 found 451974 Feb 3 08:26:29
Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.
Austin S. Hemmelgarn posted on Thu, 02 Feb 2017 07:49:50 -0500 as excerpted: > I think (although I'm not sure about it) that this: > http://www.spinics.net/lists/linux-btrfs/msg47283.html is the first > posting of the patch series. Yes. That looks like it. Thanks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
Graham Cobb posted on Thu, 02 Feb 2017 10:52:26 + as excerpted: > On 02/02/17 00:02, Duncan wrote: >> If it's a workaround, then many of the Linux procedures we as admins >> and users use every day are equally workarounds. Setting 007 perms on >> a dir that doesn't have anything immediately security vulnerable in it, >> simply to keep other users from even potentially seeing or being able >> to write to something N layers down the subdir tree, is standard >> practice. > > No. There is no need to normally place a read-only snapshot below a > no-execute directory just to prevent write access to it. That is not > part of the admin's expectation. > >> Which is my point. This is no different than standard security >> practice, >> that an admin should be familiar with and using without even having to >> think about it. Btrfs is simply making the same assumptions that >> everyone else does, that an admin knows what they are doing and sets >> the upstream permissions with that in mind. If they don't, how is that >> btrfs' fault? > > Because btrfs intends the receive snapshot to be read-only. That is the > expectation of the sysadmin. Read-only *after* completion, yes. But a sysadmin that believes really setting something read-only and then trying to write to it from userspace, which is what btrfs does until the receive is done, should work, doesn't understand the meaning of read-only. Meanwhile, Austin said most of what I'd say, probably better than I'd say it, so I won't repeat it here, but I agree with him. > Even though it is security-related, I agree it isn't the highest > priority btrfs bug. It can probably wait until receive is being worked > on for other reasons. But if it isn't going to be fixed any time soon, > it should be documented in the Wiki and the man page, with the suggested > workround for anyone who needs to make sure the receive won't be > tampered with. One thing I was going to say in the previous post and forgot, is that not withstanding all the technical reasons, I do agree that documenting it in the manpage, etc, would be a good idea. In that I agree with both you and Austin. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16(assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted | Correct | Correct| | Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)| Read out D1 |Read out D2 |Read full stripe | Check csum |Check csum |Check parity | Csum mismatch |Csum match, continue|Parity mismatch | handle_errored_block||handle_errored_block | Read out full stripe || Read out full stripe| D1 csum error(err++) || D1 csum error(err++)| Recover D1 || Recover D1 | So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protect, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo BaroncelliSigned-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e68369d425b0..2be7f2546e3a 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1067,6 +1067,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) unsigned int have_csum; struct scrub_block *sblocks_for_recheck; /* holds one for each mirror */ struct scrub_block *sblock_bad; + int lock_ret; int ret; int mirror_index; int page_num; @@ -1096,6 +1097,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + /* +* For RAID5/6 race can happen for different dev scrub thread. +* For data corruption, Parity and Data thread will both try +* to recovery the data. +* Race can lead to double added csum error, or even unrecoverable +* error. +*/ + ret = lock_full_stripe(fs_info, logical, GFP_NOFS); + if (ret < 0) + return ret; + if (sctx->is_dev_replace && !is_metadata && !have_csum) { sblocks_for_recheck = NULL; goto nodatasum_case; @@ -1430,6 +1442,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) kfree(sblocks_for_recheck); } + lock_ret = unlock_full_stripe(fs_info, logical); + if (lock_ret < 0) + return lock_ret; return 0; } -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs: replace: Use ref counts to avoid destroying target device when canceled
When dev-replace and scrub are run at the same time, dev-replace can be canceled by scrub. It's quite common for btrfs/069. The backtrace would be like: general protection fault: [#1] SMP Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] RIP: 0010:[] [] generic_make_request_checks+0x198/0x5a0 Call Trace: [] ? generic_make_request+0xcf/0x290 [] generic_make_request+0x24/0x290 [] ? generic_make_request+0xcf/0x290 [] submit_bio+0x6e/0x120 [] ? page_in_rbio+0x4d/0x80 [btrfs] [] ? rbio_orig_end_io+0x80/0x80 [btrfs] [] finish_rmw+0x401/0x550 [btrfs] [] validate_rbio_for_rmw+0x36/0x40 [btrfs] [] raid_rmw_end_io+0x7d/0x90 [btrfs] [] bio_endio+0x56/0x60 [] end_workqueue_fn+0x3c/0x40 [btrfs] [] btrfs_scrubparity_helper+0xef/0x610 [btrfs] [] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] [] process_one_work+0x2af/0x720 [] ? process_one_work+0x22b/0x720 [] worker_thread+0x4b/0x4f0 [] ? process_one_work+0x720/0x720 [] ? process_one_work+0x720/0x720 [] kthread+0xf3/0x110 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x27/0x40 While in that case, target device can be destroyed at cancel time, leading to a user-after-free bug: Process A (dev-replace) | Process B(scrub) -- |(Any RW is OK) |scrub_setup_recheck_block() ||- btrfs_map_sblock() | Got a bbio with tgtdev btrfs_dev_replace_finishing()| |- btrfs_destory_dev_replace_tgtdev()| |- call_rcu(free_device) | |- __free_device() | |- kfree(device)| | Scrub worker: | Access bbio->stripes[], which | contains tgtdev. | This triggers general protection. The bug is mostly obvious for RAID5/6 since raid56 choose to keep old rbio and rbio->bbio for later steal, this hugely enlarged the race window and makes it much easier to trigger the bug. This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device to wait for all its user released the target device. Signed-off-by: Qu Wenruo--- fs/btrfs/dev-replace.c | 7 ++- fs/btrfs/volumes.c | 36 +++- fs/btrfs/volumes.h | 10 ++ 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 5de280b9ad73..794a6a0bedf2 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, rcu_str_deref(src_device->name), src_device->devid, rcu_str_deref(tgt_device->name)); - tgt_device->is_tgtdev_for_dev_replace = 0; tgt_device->devid = src_device->devid; src_device->devid = BTRFS_DEV_REPLACE_DEVID; memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp)); @@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_dev_replace_unlock(dev_replace, 1); + /* +* Only change is_tgtdev_for_dev_replace flag after all its +* users get released. +*/ + wait_target_device(tgt_device); + tgt_device->is_tgtdev_for_dev_replace = 0; btrfs_rm_dev_replace_blocked(fs_info); btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 3c3c69c0eee4..84db9fb22b7d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, WARN_ON(!tgtdev); mutex_lock(_info->fs_devices->device_list_mutex); + wait_target_device(tgtdev); btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev); if (tgtdev->bdev) @@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, device->is_tgtdev_for_dev_replace = 1; device->mode = FMODE_EXCL; device->dev_stats_valid = 1; + atomic_set(>tgtdev_refs, 0); + init_waitqueue_head(>tgtdev_wait); set_blocksize(device->bdev, 4096); device->fs_devices = fs_info->fs_devices; list_add(>dev_list, _info->fs_devices->devices); @@ -2624,6 +2627,8 @@ void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info, tgtdev->sector_size = sectorsize; tgtdev->fs_info = fs_info; tgtdev->in_fs_metadata = 1; + atomic_set(>tgtdev_refs, 0); + init_waitqueue_head(>tgtdev_wait); } static noinline int btrfs_update_device(struct btrfs_trans_handle *trans, @@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int
[PATCH 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protect, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which stores a rb_tree of mutex for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/scrub.c | 177 + 3 files changed, 184 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6a823719b6c5..0dc0b113a691 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -639,6 +639,10 @@ struct btrfs_block_group_cache { * Protected by free_space_lock. */ int needs_free_space; + + /* Scrub full stripe lock tree for RAID5/6 scrub */ + struct rb_root scrub_lock_root; + spinlock_t scrub_lock; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dcd2e798767e..79769c168230 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -130,6 +130,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache) if (atomic_dec_and_test(>count)) { WARN_ON(cache->pinned > 0); WARN_ON(cache->reserved > 0); + WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root)); kfree(cache->free_space_ctl); kfree(cache); } @@ -9910,6 +9911,8 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, atomic_set(>count, 1); spin_lock_init(>lock); + spin_lock_init(>scrub_lock); + cache->scrub_lock_root = RB_ROOT; init_rwsem(>data_rwsem); INIT_LIST_HEAD(>list); INIT_LIST_HEAD(>cluster_list); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 9a94670536a6..e68369d425b0 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -240,6 +240,13 @@ struct scrub_warning { struct btrfs_device *dev; }; +struct scrub_full_stripe_lock { + struct rb_node node; + u64 logical; + u64 refs; + struct mutex mutex; +}; + static void scrub_pending_bio_inc(struct scrub_ctx *sctx); static void scrub_pending_bio_dec(struct scrub_ctx *sctx); static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx); @@ -351,6 +358,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) } /* + * Caller must hold cache->scrub_root_lock. + * + * Return existing full stripe and increase it refs + * Or return NULL, and insert @fstripe_lock into the bg cache + */ +static struct scrub_full_stripe_lock * +add_scrub_lock(struct btrfs_block_group_cache *cache, + struct scrub_full_stripe_lock *fstripe_lock) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct scrub_full_stripe_lock *entry; + + p = >scrub_lock_root.rb_node; + while (*p) { + parent = *p; + entry = rb_entry(parent, struct scrub_full_stripe_lock, node); + if (fstripe_lock->logical < entry->logical) { + p = &(*p)->rb_left; + } else if (fstripe_lock->logical > entry->logical) { + p = &(*p)->rb_right; + } else { + entry->refs++; + return entry; + } + } + /* Insert new one */ + rb_link_node(_lock->node, parent, p); + rb_insert_color(_lock->node, >scrub_lock_root); + + return NULL; +} + +static struct scrub_full_stripe_lock * +search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr) +{ + struct rb_node *node; + struct scrub_full_stripe_lock *entry; + + node = cache->scrub_lock_root.rb_node; + while (node) { + entry = rb_entry(node, struct scrub_full_stripe_lock, node); + if (bytenr < entry->logical) + node = node->rb_left; + else if (bytenr > entry->logical) + node = node->rb_right; + else + return entry; + } + return NULL; +} + +/* + * Helper to get full stripe logical from a normal bytenr. + * Thanks to the chaos of scrub structures, we need to get it all + * by ourselves, using btrfs_map_sblock(). + */ +static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr, + u64 *bytenr_ret) +{ + struct btrfs_bio *bbio = NULL; + u64 len; + int ret; + + /* Just use map_sblock() to get full stripe logical */ + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr, + , , 0, 1); + if (ret || !bbio || !bbio->raid_map) + goto error; + *bytenr_ret = bbio->raid_map[0]; + btrfs_put_bbio(bbio); +
[PATCH 3/5] btrfs: raid56: Use correct stolen pages to calculate P/Q
In the following situation, scrub will calculate wrong parity to overwrite correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0x (Bad) | 0xcdcd | 0x| --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K The calltrace of such corruption is as following: scrub_bio_end_io_worker() get called for each extent read out |- scriub_block_complete() |- Data extent csum mismatch |- scrub_handle_errored_block |- scrub_recheck_block() |- scrub_submit_raid56_bio_wait() |- raid56_parity_recover() Now we have a rbio with correct data stripe 1 recovered. Let's call it "good_rbio". scrub_parity_check_and_repair() |- raid56_parity_submit_scrub_rbio() |- lock_stripe_add() | |- steal_rbio() | |- Recovered data are steal from "good_rbio", stored into |rbio->stripe_pages[] |Now rbio->bio_pages[] are bad data read from disk. |- async_scrub_parity() |- scrub_parity_work() (delayed_call to scrub_parity_work) scrub_parity_work() |- raid56_parity_scrub_stripe() |- validate_rbio_for_parity_scrub() |- finish_parity_scrub() |- Recalculate parity using *BAD* pages in rbio->bio_pages[] So good parity is overwritten with *BAD* one The fix is to introduce 2 new members, bad_ondisk_a/b, to struct btrfs_raid_bio, to info scrub code to use correct data pages to re-calculate parity. Reported-by: Goffredo BaroncelliSigned-off-by: Qu Wenruo --- fs/btrfs/raid56.c | 62 +++ 1 file changed, 58 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d2a9a1ee5361..453eefdcb591 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -133,6 +133,16 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + /* +* For steal_rbio, we can steal recovered correct page, +* but in finish_parity_scrub(), we still use bad on-disk +* page to calculate parity. +* Use these members to info finish_parity_scrub() to use +* correct pages +*/ + int bad_ondisk_a; + int bad_ondisk_b; + int scrubp; /* * number of pages needed to represent the full @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) if (!test_bit(RBIO_CACHE_READY_BIT, >flags)) return; + /* Record recovered stripe number */ + if (src->faila != -1) + dest->bad_ondisk_a = src->faila; + if (src->failb != -1) + dest->bad_ondisk_b = src->failb; + for (i = 0; i < dest->nr_pages; i++) { s = src->stripe_pages[i]; if (!s || !PageUptodate(s)) { @@ -999,6 +1015,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info, rbio->stripe_npages = stripe_npages; rbio->faila = -1; rbio->failb = -1; + rbio->bad_ondisk_a = -1; + rbio->bad_ondisk_b = -1; atomic_set(>refs, 1); atomic_set(>error, 0); atomic_set(>stripes_pending, 0); @@ -2261,6 +2279,9 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio) int bit; int index; struct page *page; + struct page *bio_page; + void *ptr; + void *bio_ptr; for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) { for (i = 0; i < rbio->real_stripes; i++) { @@ -2271,6 +2292,23 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio) page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); if (!page) return -ENOMEM; + + /* +* It's possible that only several pages need recover, +* and rest are all good. +* In that case we need to copy good bio pages into +* stripe_pages[], as we will use stripe_pages[] other +* than bio_pages[] in finish_parity_scrub(). +*/ +
[PATCH 4/5] btrfs: raid56: Don't keep rbio for later steal
Before this patch, btrfs raid56 will keep raid56 rbio even all its IO is done. This may save some time allocating rbio, but it can cause deadly use-after-free bug, for the following case: Original fs: 4 devices RAID5 Process A | Process B -- | Start device replace |Now the fs has 5 devices |devid 0: replace device |devid 1~4: old devices btrfs_map_bio() | |- __btrfs_map_block() | |bbio has 5 stripes | |including devid 0 | |- raid56_parity_write() | | raid_write_end_io() | |- rbio_orig_end_io()| |- unlock_stripe()| Keeps the old rbio for| later steal, which has| stripe for devid 0| | Cancel device replace |Now the fs has 4 devices |devid 0 is freed Some IO happens | raid_write_end_io() | |- rbio_orig_end_io()| |- unlock_stripe()| |- steal_rbio()| Use old rbio, whose | bbio has freed devid 0| stripe| Any access to rbio->bbio will| cause general protection or NULL | pointer dereference | Such bug can already be triggered by fstests btrfs/069 for RAID5/6 profiles. Fix it by not keeping old rbio in unlock_stripe(), so we just free the finished rbio and rbio->bbio, so above problem wont' happen. Signed-off-by: Qu Wenruo--- fs/btrfs/raid56.c | 18 +- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 453eefdcb591..aba82b95ec73 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -776,7 +776,6 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) int bucket; struct btrfs_stripe_hash *h; unsigned long flags; - int keep_cache = 0; bucket = rbio_bucket(rbio); h = rbio->fs_info->stripe_hash_table->table + bucket; @@ -788,19 +787,6 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_lock(>bio_list_lock); if (!list_empty(>hash_list)) { - /* -* if we're still cached and there is no other IO -* to perform, just leave this rbio here for others -* to steal from later -*/ - if (list_empty(>plug_list) && - test_bit(RBIO_CACHE_BIT, >flags)) { - keep_cache = 1; - clear_bit(RBIO_RMW_LOCKED_BIT, >flags); - BUG_ON(!bio_list_empty(>bio_list)); - goto done; - } - list_del_init(>hash_list); atomic_dec(>refs); @@ -848,13 +834,11 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) goto done_nolock; } } -done: spin_unlock(>bio_list_lock); spin_unlock_irqrestore(>lock, flags); done_nolock: - if (!keep_cache) - remove_rbio_from_cache(rbio); + remove_rbio_from_cache(rbio); } static void __free_raid_bio(struct btrfs_raid_bio *rbio) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] raid56: variant bug fixes
This patchset can be fetched from my github repo: https://github.com/adam900710/linux.git raid56_fixes It's based on v4.10-rc6 and none of the patch is modified after its first appearance in mail list. The patchset fixes the following bugs: 1) False alert or wrong csum error number when scrubbing RAID5/6 The bug itself won't cause any damage to fs, just pure race. 2) Corrupted data stripe rebuild corrupts P/Q So scrub makes one error into another, not really fixing anything 3) Use-after-free caused by cancelling dev-replace This is quite a deadly bug, since cancelling dev-replace can easily cause kernel panic, and thanks to raid bio steal, it makes the race windows quite large. Can be triggered by btrfs/069. After all the fixes applied, no scrub/replace related regression can be detected. Qu Wenruo (5): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race condition btrfs: raid56: Use correct stolen pages to calculate P/Q btrfs: raid56: Don't keep rbio for later steal btrfs: replace: Use ref counts to avoid destroying target device when canceled fs/btrfs/ctree.h | 4 ++ fs/btrfs/dev-replace.c | 7 +- fs/btrfs/extent-tree.c | 3 + fs/btrfs/raid56.c | 80 +++-- fs/btrfs/scrub.c | 192 + fs/btrfs/volumes.c | 36 +- fs/btrfs/volumes.h | 10 +++ 7 files changed, 309 insertions(+), 23 deletions(-) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html