Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
On Thu, Aug 20, 2015 at 7:38 AM, Austin S Hemmelgarn wrote: > Just for reference, I've found that it is usually safer to delete the > missing device first if possible, then add the new one and re-balance. There > seem to be some edge-cases in the code for deleting missing devices. > The problem is that you can't do that if there's not enough space on the remaining devices to hold all the data. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
On Wed, Aug 19, 2015 at 1:22 AM, Qu Wenruo wrote: > > > Timothy Normand Miller wrote on 2015/08/18 22:55 -0400: >> >> On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo >> wrote: >>> >>> >>> >>> Timothy Normand Miller wrote on 2015/08/18 22:46 -0400: >>>> >>>> >>>> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo >>>> wrote: >>>>> >>>>> >>>>> Hi Timothy, >>>>> >>>>> Although I have replied to the bugzilla, IMHO it's more appropriate to >>>>> discuss it in mail list, as it's not a kernel bug. >>>>> >>>> >>>> All four devices were online. The "missing" one was a drive that >>>> died, which was replaced by a new one, but btrfs wouldn't finish the >>>> deletion of the missing device. >>>> >>> By replaced, did you mean "btrfs replace"? Or just change the physical >>> disk >>> without using "btrfs replace"? >> >> >> Here's what happened: >> >> - A drive started throwing bad sectors. Somehow this caused metadata >> on other drives to get messed up. > > > Did that cause any huge damage? It seems that metadata was damaged on all drives. > >> - I took that drive offline and mounted degraded (it's a 4-drive RAID1) >> - I did a "btrfs add" on a new drive and then a "btrfs delete missing" >> - The replacement drive failed during the replacement operation, and >> everything went to crap. >> - With some help, I got a kernel patch that allowed me to mount the >> original three drives with TWO missing devices. > > > So the original 3 drives are still OK, > original bad one is missing, and the newly add one is also missing? > > That sounds quite repairable. Nothing I tried would run to completion. There were always errors. > >> - I added a brand new drive and then did "delete missing" again. This >> time, the first "delete missing" was successful, but it didn't fully >> balance the drives, and there was another missing device, so I had to >> do a "delete missing" again, and that failed. >> >> I wanted to get this back online and restored from a backup, but I was >> willing to keep it this way if people wanted to probe at, in case we >> can uncover any btrfs bugs. So it was suggested to get a metadata >> image, but that ran into some kind of bug in btrfs-image. > > If btrfs-image doesn't work, you can also try btrfs-debug-tree. > IIRC, debug-tree should be more robust than btrfs-image. > > BTW, have you tried btrfsck on it? Does it also cause the infinite loop? > > I'll also try to reproduce it and investigate the codes directly. Well, I had to get things back online, so I've restored from backup. I do have what limited metadata image I could get from btrfs-image. > > Thanks, > Qu > >> >> Currently, I'm restoring from backup, but I have at least a partial >> metadata dump. >> >> > -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo wrote: > > > Timothy Normand Miller wrote on 2015/08/18 22:46 -0400: >> >> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo >> wrote: >>> >>> Hi Timothy, >>> >>> Although I have replied to the bugzilla, IMHO it's more appropriate to >>> discuss it in mail list, as it's not a kernel bug. >>> >> >> All four devices were online. The "missing" one was a drive that >> died, which was replaced by a new one, but btrfs wouldn't finish the >> deletion of the missing device. >> > By replaced, did you mean "btrfs replace"? Or just change the physical disk > without using "btrfs replace"? Here's what happened: - A drive started throwing bad sectors. Somehow this caused metadata on other drives to get messed up. - I took that drive offline and mounted degraded (it's a 4-drive RAID1) - I did a "btrfs add" on a new drive and then a "btrfs delete missing" - The replacement drive failed during the replacement operation, and everything went to crap. - With some help, I got a kernel patch that allowed me to mount the original three drives with TWO missing devices. - I added a brand new drive and then did "delete missing" again. This time, the first "delete missing" was successful, but it didn't fully balance the drives, and there was another missing device, so I had to do a "delete missing" again, and that failed. I wanted to get this back online and restored from a backup, but I was willing to keep it this way if people wanted to probe at, in case we can uncover any btrfs bugs. So it was suggested to get a metadata image, but that ran into some kind of bug in btrfs-image. Currently, I'm restoring from backup, but I have at least a partial metadata dump. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo wrote: > Hi Timothy, > > Although I have replied to the bugzilla, IMHO it's more appropriate to > discuss it in mail list, as it's not a kernel bug. > All four devices were online. The "missing" one was a drive that died, which was replaced by a new one, but btrfs wouldn't finish the deletion of the missing device. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, wipe it out and start over or keep debugging?
I was doing it on an unmounted volume anyhow. On Tue, Aug 18, 2015 at 5:09 PM, Chris Murphy wrote: > On Tue, Aug 18, 2015 at 5:21 AM, Austin S Hemmelgarn > wrote: >> On 2015-08-17 14:52, Timothy Normand Miller wrote: >>> >>> I'm not sure if I'm doing this wrong. Here's what I'm seeing: >>> >>> # btrfs-image -c9 -t4 -w /mnt/btrfs ~/btrfs_dump.z >>> Superblock bytenr is larger than device size >>> Open ctree failed >>> create failed (No such file or directory) >> >> >> For the source, you need to specify the underlying block device, not the top >> of the mounted filesystem. It's trying to read the directory as a block >> device and getting very confused. We should probably add some kind of check >> to btrfs-image to warn about that. > > Should it even be possible to use btrfs-image on a mounted volume? If > it's written to at all, the collected image is going to be > inconsistent. > > -- > Chris Murphy -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chattr +C on subvolume
Never mind on that last lsattr question. I needed a "-d" option. Silly me. :) On Tue, Aug 18, 2015 at 1:39 PM, Timothy Normand Miller wrote: > Another weird thing I've noticed. I did this: > > chattr +C /mnt/btrfs/vms > > But both of these report nothing: > > lsattr /mnt/btrfs/vms > lsattr /mnt/vms > > Shouldn't at least one show the C attribute? > > > On Tue, Aug 18, 2015 at 1:36 PM, Timothy Normand Miller > wrote: >> Maybe this is a dumb question, but there are always corner cases. >> >> I have a subvolume where I want to disable CoW for VM disks. Maybe >> that's a dumb idea, but that's a recommendation I've seen here and >> there. Now, in the docs I've seen, +C applies to a directory. Does >> it apply to subvolumes? And do I apply it to the subvolume within the >> main volume, or do I apply it to the mount point where I've mounted >> the subvolume separately? Are there any cases where the flag applies >> or not depending on how you access the files? >> >> The same subvolume for me is accessible via /mnt/btrfs/vms (via the >> /mnt/btrfs mount point) and /mnt/vms (where the subvolume is mounted). >> I applied +C to /mnt/btrfs/vms. So what I'm trying to find out is if >> it also applies when files are accessed via /mnt/vms. >> >> Thanks. >> >> >> -- >> Timothy Normand Miller, PhD >> Assistant Professor of Computer Science, Binghamton University >> http://www.cs.binghamton.edu/~millerti/ >> Open Graphics Project > > > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chattr +C on subvolume
Another weird thing I've noticed. I did this: chattr +C /mnt/btrfs/vms But both of these report nothing: lsattr /mnt/btrfs/vms lsattr /mnt/vms Shouldn't at least one show the C attribute? On Tue, Aug 18, 2015 at 1:36 PM, Timothy Normand Miller wrote: > Maybe this is a dumb question, but there are always corner cases. > > I have a subvolume where I want to disable CoW for VM disks. Maybe > that's a dumb idea, but that's a recommendation I've seen here and > there. Now, in the docs I've seen, +C applies to a directory. Does > it apply to subvolumes? And do I apply it to the subvolume within the > main volume, or do I apply it to the mount point where I've mounted > the subvolume separately? Are there any cases where the flag applies > or not depending on how you access the files? > > The same subvolume for me is accessible via /mnt/btrfs/vms (via the > /mnt/btrfs mount point) and /mnt/vms (where the subvolume is mounted). > I applied +C to /mnt/btrfs/vms. So what I'm trying to find out is if > it also applies when files are accessed via /mnt/vms. > > Thanks. > > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
chattr +C on subvolume
Maybe this is a dumb question, but there are always corner cases. I have a subvolume where I want to disable CoW for VM disks. Maybe that's a dumb idea, but that's a recommendation I've seen here and there. Now, in the docs I've seen, +C applies to a directory. Does it apply to subvolumes? And do I apply it to the subvolume within the main volume, or do I apply it to the mount point where I've mounted the subvolume separately? Are there any cases where the flag applies or not depending on how you access the files? The same subvolume for me is accessible via /mnt/btrfs/vms (via the /mnt/btrfs mount point) and /mnt/vms (where the subvolume is mounted). I applied +C to /mnt/btrfs/vms. So what I'm trying to find out is if it also applies when files are accessed via /mnt/vms. Thanks. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs-image gets stuck, using 100%, looping on bad file descriptor
I've filed a bug report on this: https://bugzilla.kernel.org/show_bug.cgi?id=103081 -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, wipe it out and start over or keep debugging?
I ran the following command. It spent a lot of time creating a 1672450048 byte file. Then it stopped writing to the file and started using 100% CPU. It's currently doing no I/O, and it's been doing that for a while now. Is that supposed to happen? On Tue, Aug 18, 2015 at 9:30 AM, Timothy Normand Miller wrote: > In that case, do I need to do all four block devices separately, or > will the tool figure it out? > > On Tue, Aug 18, 2015 at 7:21 AM, Austin S Hemmelgarn > wrote: >> On 2015-08-17 14:52, Timothy Normand Miller wrote: >>> >>> I'm not sure if I'm doing this wrong. Here's what I'm seeing: >>> >>> # btrfs-image -c9 -t4 -w /mnt/btrfs ~/btrfs_dump.z >>> Superblock bytenr is larger than device size >>> Open ctree failed >>> create failed (No such file or directory) >> >> >> For the source, you need to specify the underlying block device, not the top >> of the mounted filesystem. It's trying to read the directory as a block >> device and getting very confused. We should probably add some kind of check >> to btrfs-image to warn about that. >> >> > > > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, wipe it out and start over or keep debugging?
In that case, do I need to do all four block devices separately, or will the tool figure it out? On Tue, Aug 18, 2015 at 7:21 AM, Austin S Hemmelgarn wrote: > On 2015-08-17 14:52, Timothy Normand Miller wrote: >> >> I'm not sure if I'm doing this wrong. Here's what I'm seeing: >> >> # btrfs-image -c9 -t4 -w /mnt/btrfs ~/btrfs_dump.z >> Superblock bytenr is larger than device size >> Open ctree failed >> create failed (No such file or directory) > > > For the source, you need to specify the underlying block device, not the top > of the mounted filesystem. It's trying to read the directory as a block > device and getting very confused. We should probably add some kind of check > to btrfs-image to warn about that. > > -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, wipe it out and start over or keep debugging?
I'm not sure if I'm doing this wrong. Here's what I'm seeing: # btrfs-image -c9 -t4 -w /mnt/btrfs ~/btrfs_dump.z Superblock bytenr is larger than device size Open ctree failed create failed (No such file or directory) On Mon, Aug 17, 2015 at 7:43 AM, Austin S Hemmelgarn wrote: > On 2015-08-15 17:46, Timothy Normand Miller wrote: >> >> To those of you who have been helping out with my 4-drive RAID1 >> situation, is there anything further we should do to investigate this, >> in case we can uncover any more bugs, or should I just wipe everything >> out and restore from backup? >> > If you need the system back online, then my suggestion would be to use > btrfs-image to get metadata images of the disks (there's an option to clear > out private data if need be), and then restore from backup. That way, we > still have the problematic images to work with and examine. > -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
So, wipe it out and start over or keep debugging?
To those of you who have been helping out with my 4-drive RAID1 situation, is there anything further we should do to investigate this, in case we can uncover any more bugs, or should I just wipe everything out and restore from backup? -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
Here's the associated bug report with the full dmesg: https://bugzilla.kernel.org/show_bug.cgi?id=102941 On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller wrote: > So I tried deleting the files that I think are the problem, and the > file system went suddenly read-only, and I got this in dmesg: > > A bunch of these first messages: > [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize > 53 > [39710.420118] extent refs 1 gen 166914 flags 1 > [39710.420119] extent data backref root 949 objectid 440675 > offset 2621440 count 1 > [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize > 53 > [39710.420120] extent refs 1 gen 166914 flags 1 > [39710.420121] extent data backref root 949 objectid 440675 > offset 3145728 count 1 > [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize > 53 > [39710.420122] extent refs 1 gen 166914 flags 1 > [39710.420122] extent data backref root 949 objectid 440675 > offset 3670016 count 1 > [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize > 53 > [39710.420123] extent refs 1 gen 166914 flags 1 > [39710.420124] extent data backref root 949 objectid 440675 > offset 4194304 count 1 > [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize > 53 > [39710.420125] extent refs 1 gen 166914 flags 1 > [39710.420126] extent data backref root 949 objectid 440675 > offset 4718592 count 1 > [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize > 53 > [39710.420127] extent refs 1 gen 166914 flags 1 > [39710.420127] extent data backref root 949 objectid 440675 > offset 5242880 count 1 > [39710.420128] BTRFS error (device sdc): unable to find ref byte nr > 1668272218112 parent 0 root 949 owner 1032823 offset 655360 > [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232: > errno=-2 No such entry > [39710.420131] BTRFS: error (device sdc) in > btrfs_run_delayed_refs:2821: errno=-2 No such entry > [39710.431108] pending csums is 5795840 > > On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller > wrote: >> I didn't quite understand "profile and convert", since I can't find a >> profile option. Is this something your patch adds? >> >> Before I do that, however, I have to deal with this: >> >> compute0 ~ # btrfs device delete missing /mnt/btrfs >> ERROR: error removing the device 'missing' - Input/output error >> >> [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off >> 623218688 csum 2756583412 expected csum 4104700738 >> [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off >> 623222784 csum 2568037276 expected csum 275151414 >> [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off >> 623226880 csum 2227564114 expected csum 3824181799 >> [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off >> 623230976 csum 3298529275 expected csum 1155389604 >> [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off >> 623235072 csum 2603391790 expected csum 1861925401 >> [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off >> 623239168 csum 2044148708 expected csum 3227559459 >> [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off >> 623243264 csum 615351306 expected csum 2720021058 >> [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off >> 623218688 csum 2756583412 expected csum 4104700738 >> [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off >> 623222784 csum 2568037276 expected csum 275151414 >> [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off >> 623226880 csum 2227564114 expected csum 3824181799 >> >> Because of this, it won't delete the missing device. How do I get >> past this? I'm pretty sure the problem is in some files I want to >> delete anyhow. Would deleting them solve the problem? >> >> On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain wrote: >>> >>>> BTW, when this is all over with, how do I make sure there are really >>>> two copies of everything? Will a scrub verify this? Should I run a >>>> balance operation? >>> >>> pls use 'btrfs bal profile and convert' to migrate single chunk (if any >>> created when there were lesser number of RW-able devices) back to your >>> desired raid1. Do this when all the devices are back online. Kindly note >>> there is a bug in the btrfs VM that you won't be able to br
Re: "delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
Oh, it went read-only because it OOPSed: [39710.419966] [ cut here ] [39710.419969] WARNING: CPU: 1 PID: 5624 at fs/btrfs/extent-tree.c:6226 __btrfs_free_extent+0x873/0xc80() [39710.419970] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl ipv6 binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek ppdev snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel snd_hda_intel snd_hda_controller kvm snd_hda_codec snd_hda_core microcode snd_hwdep pcspkr snd_pcm snd_timer i2c_i801 snd lpc_ich mfd_core parport_pc battery xts gf128mul aes_x86_64 cbc sha256_generic libiscsi scsi_transport_iscsi tg3 ptp pps_core libphy sky2 r8169 pcnet32 mii e1000 bnx2 fuse nfs lockd grace sunrpc reiserfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod firewire_core hid_sunplus hid_sony hid_samsung hid_pl hid_petalynx hid_gyration usbhid uhci_hcd usb_storage ehci_pci [39710.419991] ehci_hcd aic94xx libsas qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 cciss 3w_9xxx 3w_ mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg sata_mv sata_sil24 sata_sil pata_marvell [39710.420003] CPU: 1 PID: 5624 Comm: kworker/u8:7 Tainted: GW 4.1.4-gentoo #1 [39710.420003] Hardware name: ECS H87H3-M/H87H3-M, BIOS 4.6.5 07/16/2013 [39710.420005] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [39710.420006] 8197e672 81794418 [39710.420008] 81049cbc 01846cc5e000 880064d12000 e000 [39710.420009] fffe 8127bc03 000fc277 [39710.420010] Call Trace: [39710.420012] [] ? dump_stack+0x40/0x50 [39710.420014] [] ? warn_slowpath_common+0x7c/0xb0 [39710.420015] [] ? __btrfs_free_extent+0x873/0xc80 [39710.420018] [] ? cpumask_next_and+0x30/0x50 [39710.420019] [] ? enqueue_task_fair+0x2c3/0xdb0 [39710.420021] [] ? btrfs_delayed_ref_lock+0x2c/0x260 [39710.420022] [] ? __btrfs_run_delayed_refs+0x42c/0x1280 [39710.420024] [] ? __sb_start_write+0x3d/0xe0 [39710.420025] [] ? btrfs_run_delayed_refs.part.58+0x5e/0x270 [39710.420026] [] ? delayed_ref_async_start+0x78/0x90 [39710.420028] [] ? normal_work_helper+0x73/0x2a0 [39710.420029] [] ? process_one_work+0x13c/0x3d0 [39710.420031] [] ? worker_thread+0x63/0x480 [39710.420032] [] ? process_one_work+0x3d0/0x3d0 [39710.420033] [] ? kthread+0xce/0xf0 [39710.420034] [] ? kthread_create_on_node+0x180/0x180 [39710.420036] [] ? ret_from_fork+0x42/0x70 [39710.420037] [] ? kthread_create_on_node+0x180/0x180 [39710.420038] ---[ end trace 0b4fe6057cd7a1a4 ]--- On Sat, Aug 15, 2015 at 9:13 AM, Timothy Normand Miller wrote: > So I tried deleting the files that I think are the problem, and the > file system went suddenly read-only, and I got this in dmesg: > > A bunch of these first messages: > [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize > 53 > [39710.420118] extent refs 1 gen 166914 flags 1 > [39710.420119] extent data backref root 949 objectid 440675 > offset 2621440 count 1 > [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize > 53 > [39710.420120] extent refs 1 gen 166914 flags 1 > [39710.420121] extent data backref root 949 objectid 440675 > offset 3145728 count 1 > [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize > 53 > [39710.420122] extent refs 1 gen 166914 flags 1 > [39710.420122] extent data backref root 949 objectid 440675 > offset 3670016 count 1 > [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize > 53 > [39710.420123] extent refs 1 gen 166914 flags 1 > [39710.420124] extent data backref root 949 objectid 440675 > offset 4194304 count 1 > [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize > 53 > [39710.420125] extent refs 1 gen 166914 flags 1 > [39710.420126] extent data backref root 949 objectid 440675 > offset 4718592 count 1 > [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize > 53 > [39710.420127] extent refs 1 gen 166914 flags 1 > [39710.420127] extent data backref root 949 objectid 440675 > offset 5242880 count 1 > [39710.420128] BTRFS error (device sdc): unable to find ref byte nr > 1668272218112 parent 0 root 949 owner 1032823 offset 655360 > [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232: > errno=-2 No such entry > [39710.420131] BTRFS: error (device sdc) in > btrfs_run_delayed_refs:2821: errno=-2 No s
Re: "delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
So I tried deleting the files that I think are the problem, and the file system went suddenly read-only, and I got this in dmesg: A bunch of these first messages: [39710.420118] item 45 key (1668296151040 168 524288) itemoff 1557 itemsize 53 [39710.420118] extent refs 1 gen 166914 flags 1 [39710.420119] extent data backref root 949 objectid 440675 offset 2621440 count 1 [39710.420120] item 46 key (1668296675328 168 524288) itemoff 1504 itemsize 53 [39710.420120] extent refs 1 gen 166914 flags 1 [39710.420121] extent data backref root 949 objectid 440675 offset 3145728 count 1 [39710.420121] item 47 key (1668297199616 168 524288) itemoff 1451 itemsize 53 [39710.420122] extent refs 1 gen 166914 flags 1 [39710.420122] extent data backref root 949 objectid 440675 offset 3670016 count 1 [39710.420123] item 48 key (1668297723904 168 524288) itemoff 1398 itemsize 53 [39710.420123] extent refs 1 gen 166914 flags 1 [39710.420124] extent data backref root 949 objectid 440675 offset 4194304 count 1 [39710.420125] item 49 key (1668298248192 168 524288) itemoff 1345 itemsize 53 [39710.420125] extent refs 1 gen 166914 flags 1 [39710.420126] extent data backref root 949 objectid 440675 offset 4718592 count 1 [39710.420126] item 50 key (1668298772480 168 524288) itemoff 1292 itemsize 53 [39710.420127] extent refs 1 gen 166914 flags 1 [39710.420127] extent data backref root 949 objectid 440675 offset 5242880 count 1 [39710.420128] BTRFS error (device sdc): unable to find ref byte nr 1668272218112 parent 0 root 949 owner 1032823 offset 655360 [39710.420129] BTRFS: error (device sdc) in __btrfs_free_extent:6232: errno=-2 No such entry [39710.420131] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2821: errno=-2 No such entry [39710.431108] pending csums is 5795840 On Sat, Aug 15, 2015 at 8:51 AM, Timothy Normand Miller wrote: > I didn't quite understand "profile and convert", since I can't find a > profile option. Is this something your patch adds? > > Before I do that, however, I have to deal with this: > > compute0 ~ # btrfs device delete missing /mnt/btrfs > ERROR: error removing the device 'missing' - Input/output error > > [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off > 623218688 csum 2756583412 expected csum 4104700738 > [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off > 623222784 csum 2568037276 expected csum 275151414 > [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off > 623226880 csum 2227564114 expected csum 3824181799 > [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off > 623230976 csum 3298529275 expected csum 1155389604 > [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off > 623235072 csum 2603391790 expected csum 1861925401 > [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off > 623239168 csum 2044148708 expected csum 3227559459 > [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off > 623243264 csum 615351306 expected csum 2720021058 > [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off > 623218688 csum 2756583412 expected csum 4104700738 > [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off > 623222784 csum 2568037276 expected csum 275151414 > [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off > 623226880 csum 2227564114 expected csum 3824181799 > > Because of this, it won't delete the missing device. How do I get > past this? I'm pretty sure the problem is in some files I want to > delete anyhow. Would deleting them solve the problem? > > On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain wrote: >> >>> BTW, when this is all over with, how do I make sure there are really >>> two copies of everything? Will a scrub verify this? Should I run a >>> balance operation? >> >> pls use 'btrfs bal profile and convert' to migrate single chunk (if any >> created when there were lesser number of RW-able devices) back to your >> desired raid1. Do this when all the devices are back online. Kindly note >> there is a bug in the btrfs VM that you won't be able to bring a device >> online with out unmount -> mount (I am working to fix). btrfs-progs will be >> wrong in this case don't depend too much on that. >> So to understand inside of btrfs kernel volume I generally use: >> https://patchwork.kernel.org/patch/5816011/ >> >> In there if bdev is null it indicates device is scanned but not part of VM >> yet. Then unmount -> mount will bring device back to be part of VM. >> >>>> After applying Anand's patch, I was able to mount my 4-drive RAID1 >>>
Re: "delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
I didn't quite understand "profile and convert", since I can't find a profile option. Is this something your patch adds? Before I do that, however, I have to deal with this: compute0 ~ # btrfs device delete missing /mnt/btrfs ERROR: error removing the device 'missing' - Input/output error [13058.298763] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.298775] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.298782] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 [13058.298788] BTRFS warning (device sdc): csum failed ino 596 off 623230976 csum 3298529275 expected csum 1155389604 [13058.298794] BTRFS warning (device sdc): csum failed ino 596 off 623235072 csum 2603391790 expected csum 1861925401 [13058.298801] BTRFS warning (device sdc): csum failed ino 596 off 623239168 csum 2044148708 expected csum 3227559459 [13058.298807] BTRFS warning (device sdc): csum failed ino 596 off 623243264 csum 615351306 expected csum 2720021058 [13058.329747] BTRFS warning (device sdc): csum failed ino 596 off 623218688 csum 2756583412 expected csum 4104700738 [13058.329759] BTRFS warning (device sdc): csum failed ino 596 off 623222784 csum 2568037276 expected csum 275151414 [13058.329770] BTRFS warning (device sdc): csum failed ino 596 off 623226880 csum 2227564114 expected csum 3824181799 Because of this, it won't delete the missing device. How do I get past this? I'm pretty sure the problem is in some files I want to delete anyhow. Would deleting them solve the problem? On Sat, Aug 15, 2015 at 12:59 AM, Anand Jain wrote: > >> BTW, when this is all over with, how do I make sure there are really >> two copies of everything? Will a scrub verify this? Should I run a >> balance operation? > > pls use 'btrfs bal profile and convert' to migrate single chunk (if any > created when there were lesser number of RW-able devices) back to your > desired raid1. Do this when all the devices are back online. Kindly note > there is a bug in the btrfs VM that you won't be able to bring a device > online with out unmount -> mount (I am working to fix). btrfs-progs will be > wrong in this case don't depend too much on that. > So to understand inside of btrfs kernel volume I generally use: > https://patchwork.kernel.org/patch/5816011/ > > In there if bdev is null it indicates device is scanned but not part of VM > yet. Then unmount -> mount will bring device back to be part of VM. > >>> After applying Anand's patch, I was able to mount my 4-drive RAID1 >>> and bring a new fourth drive online. > >>> However, something weird happened >>> where the first "delete missing" only deleted one missing drive and >>> only did a partial duplication. I've posted a bug report here: > > that seems to be normal to me. unless I am missing something else / clarity. > > > Thanks, Anand -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
BTW, when this is all over with, how do I make sure there are really two copies of everything? Will a scrub verify this? Should I run a balance operation? On Fri, Aug 14, 2015 at 11:29 PM, Timothy Normand Miller wrote: > After applying Anand's patch, I was able to mount my 4-drive RAID1 and > bring a new fourth drive online. However, something weird happened > where the first "delete missing" only deleted one missing drive and > only did a partial duplication. I've posted a bug report here: > > https://bugzilla.kernel.org/show_bug.cgi?id=102901 > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"delete missing" with two missing devices doesn't delete both missing, only does a partial reconstruction
After applying Anand's patch, I was able to mount my 4-drive RAID1 and bring a new fourth drive online. However, something weird happened where the first "delete missing" only deleted one missing drive and only did a partial duplication. I've posted a bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=102901 -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't mount degraded. How to remove/add drives OFFLINE?
I applied that patch to my 4.1.4, it mounted degraded, and now it's balancing to the new drive. Thanks for all the help! On Fri, Aug 14, 2015 at 8:28 PM, Anand Jain wrote: > > >> Just to be clear, I removed the drive (the original failed drive) when >> the power was off, then powered up, and then mounted degraded. That's >> not dangerous that I know of. > > > patch has details. pls refer. >> >> >> Where is this patch, and what kernel versions can this be applied to? > > > > https://patchwork.kernel.org/patch/7014141/ > > its on 4.3. but should apply nice on below. > > thanks > Anand -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't mount degraded. How to remove/add drives OFFLINE?
On Fri, Aug 14, 2015 at 7:49 PM, Anand Jain wrote: > >> >> - I had a drive fail, so I removed it and mounted degraded. > > > that bit dangerous to do without the below patch. patch has more details > why. Just to be clear, I removed the drive (the original failed drive) when the power was off, then powered up, and then mounted degraded. That's not dangerous that I know of. > >> - I hooked up a replacement drive, did an "add" on that one, and did a >> "delete missing". >> - During the rebalance, the replacement drive failed, there were OOPSes, >> etc. >> - Now, although all of my data is there, I can't mount degraded, >> because btrfs is complaining that too many devices are missing (3 are >> there, but it sees 2 missing). > > > > This is addressed in the patch > > [PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile > Where is this patch, and what kernel versions can this be applied to? -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't mount degraded. How to remove/add drives OFFLINE?
I'm not sure my situation is quite like the one you linked, so here's my bug report: https://bugzilla.kernel.org/show_bug.cgi?id=102881 On Fri, Aug 14, 2015 at 2:44 PM, Chris Murphy wrote: > On Fri, Aug 14, 2015 at 12:12 PM, Timothy Normand Miller > wrote: >> Sorry about that empty email. I hit a wrong key, and gmail decided to send. >> >> Anyhow, my replacement drive is going to arrive this evening, and I >> need to know how to add it to my btrfs array. Here's the situation: >> >> - I had a drive fail, so I removed it and mounted degraded. >> - I hooked up a replacement drive, did an "add" on that one, and did a >> "delete missing". >> - During the rebalance, the replacement drive failed, there were OOPSes, etc. >> - Now, although all of my data is there, I can't mount degraded, >> because btrfs is complaining that too many devices are missing (3 are >> there, but it sees 2 missing). > > It might be related to this (long) bug: > https://bugzilla.kernel.org/show_bug.cgi?id=92641 > > While Btrfs RAID 1 can tolerate only a single device failure, what you > have is an in-progress rebuild of a missing device. If it becomes > missing, the volume should be no worse off than it was before. But > Btrfs doesn't see it this way, instead is sees this as two separate > missing devices and now too many devices missing and it refuses to > proceed. And there's no mechanism to remove missing devices unless you > can mount rw. So it's stuck. > > >> So I could use some help with cleaning up this mess. All the data is >> there, so I need to know how to either force it to mount degraded, or >> add and remove devices offline. Where do I begin? > > You can try to ask on IRC. I have no ideas for this scenario, I've > tried and failed. My case was throw away, what should still be > possible is using btrfs restore. > > >> Also, doesn't it seem a bit arbitrary that there are "too many >> missing," when all of the data is there? If I understand correctly, >> all four drives in my RAID1 should all have copies of the metadata, > > No that's not correct. RAID 1 means 2 copies of metadata. In a 4 > device RAID 1 that's still only 2 copies. It is not n-way RAID 1. > > But that doesn't matter here, the problem is that Btrfs has a narrow > idea of the volume, it assumes without context that once the number of > devices is below the minimum, the volume can't be mounted. In reality, > an exception exists if the failure is for an in-progress rebuild of a > missing drive. That drive failing should mean the volume is no worse > off than before but Btrfs doesn't know that. > > Pretty sure about that anyway. > > >> and of the remaining three good drives, there should be one or two >> copies of every data block. So it's all there, but btrfs has decided, >> based on the NUMBER of missing devices, that it won't mount. >> Shouldn't it refuse to mount if it knows there is data missing? For >> that matter, why should it even refuse in that case? So some data >> might missing, so it should throw some errors if you try to access >> that missing data. Right? > > I think no data is missing, no metadata is missing, and Btrfs is > confused and stuck in this case. > > -- > Chris Murphy -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't mount degraded. How to remove/add drives OFFLINE?
Sorry about that empty email. I hit a wrong key, and gmail decided to send. Anyhow, my replacement drive is going to arrive this evening, and I need to know how to add it to my btrfs array. Here's the situation: - I had a drive fail, so I removed it and mounted degraded. - I hooked up a replacement drive, did an "add" on that one, and did a "delete missing". - During the rebalance, the replacement drive failed, there were OOPSes, etc. - Now, although all of my data is there, I can't mount degraded, because btrfs is complaining that too many devices are missing (3 are there, but it sees 2 missing). So I could use some help with cleaning up this mess. All the data is there, so I need to know how to either force it to mount degraded, or add and remove devices offline. Where do I begin? Also, doesn't it seem a bit arbitrary that there are "too many missing," when all of the data is there? If I understand correctly, all four drives in my RAID1 should all have copies of the metadata, and of the remaining three good drives, there should be one or two copies of every data block. So it's all there, but btrfs has decided, based on the NUMBER of missing devices, that it won't mount. Shouldn't it refuse to mount if it knows there is data missing? For that matter, why should it even refuse in that case? So some data might missing, so it should throw some errors if you try to access that missing data. Right? Thanks! -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't mount degraded. How to remove/add drives OFFLINE?
My -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
Ok, here's what's happening. A few years ago, I took my old WD green drives and put them in a box as backups to a new array of Seagate drives. When one of those seagate drives failed (just out of warranty, of course), I replaced it with one of the WD's. That was cooking along just fine until just a few days ago when it started throwing bad sectors and for some reason caused btrfs to have lots of problems with the system block on the other three drives. I tried to add the other spare and remove the old spare, but for whatever reason, this second spare (which had been fine when I boxed it in an anti-static bag), is now failing catastrophically. Now that that has happened, the btrfs volume is stuck in a funny state where it won't mount in degraded mode, because it thinks there should be five devices, but there are only the original three. I'm going to go ahead and order a new drive. Meanwhile, is there a way to add and remove drives from volumes that can't be mounted? On Wed, Aug 12, 2015 at 4:48 PM, Timothy Normand Miller wrote: > Actually, it didn't resume. The "btrfs delete missing" was using 100% > of the I/O bandwidth but wasn't actually doing any disk reads of > writes. I tried to reboot, but the system wouldn't go down, so after > waiting 10 minutes, I power-cycled. Now I can't mount at all and > here's what dmesg says about that: > > [ 236.118419] BTRFS info (device sdb): allowing degraded mounts > [ 236.118421] BTRFS info (device sdb): disk space caching is enabled > [ 236.165470] BTRFS: bdev (null) errs: wr 1724, rd 305, flush 45, > corrupt 0, gen 2 > [ 245.883595] BTRFS: too many missing devices, writeable mount is not allowed > [ 245.946570] BTRFS: open_ctree failed > > It thinks now that there should be five devices, and since there are > only three available, it won't let me mount. > > # btrfs filesystem show > Label: none uuid: 49ac9ad2-b529-4e6e-aef9-1c5b9e8a72f8 > Total devices 1 FS bytes used 28.26GiB > devid1 size 79.69GiB used 41.03GiB path /dev/sda3 > > warning, device 1 is missing > warning, device 1 is missing > warning devid 1 not found already > warning devid 5 not found already > Label: none uuid: ecdff84d-b4a2-4286-a1c1-cd7e5396901c > Total devices 5 FS bytes used 1.46TiB > devid2 size 931.51GiB used 767.00GiB path /dev/sdd > devid3 size 931.51GiB used 745.03GiB path /dev/sdc > devid4 size 931.51GiB used 767.00GiB path /dev/sdb > *** Some devices missing > > btrfs-progs v4.1.2 > > > > On Wed, Aug 12, 2015 at 4:27 PM, Timothy Normand Miller > wrote: >> It resumed on its own. Weird. >> >> On Wed, Aug 12, 2015 at 4:23 PM, Timothy Normand Miller >> wrote: >>> On Wed, Aug 12, 2015 at 2:10 PM, Chris Murphy >>> wrote: >>> >>>> >>>> Anyway it looks like it's hardware related, but I don't know what >>>> device ata4.00 is, so maybe this helps: >>>> http://superuser.com/questions/617192/mapping-ata-device-number-to-logical-device-name >>> >>> # ata=4; ls -l /sys/block/sd* | grep $(grep $ata >>> /sys/class/scsi_host/host*/unique_id | awk -F'/' '{print $5}') >>> lrwxrwxrwx 1 root root 0 Aug 12 16:21 /sys/block/sde -> >>> ../devices/pci:00/:00:1f.5/ata4/host3/target3:0:0/3:0:0:0/block/sde >>> >>> sde is the newly attached drive, replacing the one that had appeared >>> to have bad sectors. So it looks like either this new motherboard has >>> a bad connector, or the cable is bad. I'm going to swap it out for a >>> different SATA cable. How do I resume the failed operation? And >>> should I reboot because of the OOPSes? >>> >>> -- >>> Timothy Normand Miller, PhD >>> Assistant Professor of Computer Science, Binghamton University >>> http://www.cs.binghamton.edu/~millerti/ >>> Open Graphics Project >> >> >> >> -- >> Timothy Normand Miller, PhD >> Assistant Professor of Computer Science, Binghamton University >> http://www.cs.binghamton.edu/~millerti/ >> Open Graphics Project > > > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
Actually, it didn't resume. The "btrfs delete missing" was using 100% of the I/O bandwidth but wasn't actually doing any disk reads of writes. I tried to reboot, but the system wouldn't go down, so after waiting 10 minutes, I power-cycled. Now I can't mount at all and here's what dmesg says about that: [ 236.118419] BTRFS info (device sdb): allowing degraded mounts [ 236.118421] BTRFS info (device sdb): disk space caching is enabled [ 236.165470] BTRFS: bdev (null) errs: wr 1724, rd 305, flush 45, corrupt 0, gen 2 [ 245.883595] BTRFS: too many missing devices, writeable mount is not allowed [ 245.946570] BTRFS: open_ctree failed It thinks now that there should be five devices, and since there are only three available, it won't let me mount. # btrfs filesystem show Label: none uuid: 49ac9ad2-b529-4e6e-aef9-1c5b9e8a72f8 Total devices 1 FS bytes used 28.26GiB devid1 size 79.69GiB used 41.03GiB path /dev/sda3 warning, device 1 is missing warning, device 1 is missing warning devid 1 not found already warning devid 5 not found already Label: none uuid: ecdff84d-b4a2-4286-a1c1-cd7e5396901c Total devices 5 FS bytes used 1.46TiB devid2 size 931.51GiB used 767.00GiB path /dev/sdd devid3 size 931.51GiB used 745.03GiB path /dev/sdc devid4 size 931.51GiB used 767.00GiB path /dev/sdb *** Some devices missing btrfs-progs v4.1.2 On Wed, Aug 12, 2015 at 4:27 PM, Timothy Normand Miller wrote: > It resumed on its own. Weird. > > On Wed, Aug 12, 2015 at 4:23 PM, Timothy Normand Miller > wrote: >> On Wed, Aug 12, 2015 at 2:10 PM, Chris Murphy >> wrote: >> >>> >>> Anyway it looks like it's hardware related, but I don't know what >>> device ata4.00 is, so maybe this helps: >>> http://superuser.com/questions/617192/mapping-ata-device-number-to-logical-device-name >> >> # ata=4; ls -l /sys/block/sd* | grep $(grep $ata >> /sys/class/scsi_host/host*/unique_id | awk -F'/' '{print $5}') >> lrwxrwxrwx 1 root root 0 Aug 12 16:21 /sys/block/sde -> >> ../devices/pci:00/:00:1f.5/ata4/host3/target3:0:0/3:0:0:0/block/sde >> >> sde is the newly attached drive, replacing the one that had appeared >> to have bad sectors. So it looks like either this new motherboard has >> a bad connector, or the cable is bad. I'm going to swap it out for a >> different SATA cable. How do I resume the failed operation? And >> should I reboot because of the OOPSes? >> >> -- >> Timothy Normand Miller, PhD >> Assistant Professor of Computer Science, Binghamton University >> http://www.cs.binghamton.edu/~millerti/ >> Open Graphics Project > > > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
It resumed on its own. Weird. On Wed, Aug 12, 2015 at 4:23 PM, Timothy Normand Miller wrote: > On Wed, Aug 12, 2015 at 2:10 PM, Chris Murphy wrote: > >> >> Anyway it looks like it's hardware related, but I don't know what >> device ata4.00 is, so maybe this helps: >> http://superuser.com/questions/617192/mapping-ata-device-number-to-logical-device-name > > # ata=4; ls -l /sys/block/sd* | grep $(grep $ata > /sys/class/scsi_host/host*/unique_id | awk -F'/' '{print $5}') > lrwxrwxrwx 1 root root 0 Aug 12 16:21 /sys/block/sde -> > ../devices/pci:00/:00:1f.5/ata4/host3/target3:0:0/3:0:0:0/block/sde > > sde is the newly attached drive, replacing the one that had appeared > to have bad sectors. So it looks like either this new motherboard has > a bad connector, or the cable is bad. I'm going to swap it out for a > different SATA cable. How do I resume the failed operation? And > should I reboot because of the OOPSes? > > -- > Timothy Normand Miller, PhD > Assistant Professor of Computer Science, Binghamton University > http://www.cs.binghamton.edu/~millerti/ > Open Graphics Project -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Wed, Aug 12, 2015 at 2:10 PM, Chris Murphy wrote: > > Anyway it looks like it's hardware related, but I don't know what > device ata4.00 is, so maybe this helps: > http://superuser.com/questions/617192/mapping-ata-device-number-to-logical-device-name # ata=4; ls -l /sys/block/sd* | grep $(grep $ata /sys/class/scsi_host/host*/unique_id | awk -F'/' '{print $5}') lrwxrwxrwx 1 root root 0 Aug 12 16:21 /sys/block/sde -> ../devices/pci:00/:00:1f.5/ata4/host3/target3:0:0/3:0:0:0/block/sde sde is the newly attached drive, replacing the one that had appeared to have bad sectors. So it looks like either this new motherboard has a bad connector, or the cable is bad. I'm going to swap it out for a different SATA cable. How do I resume the failed operation? And should I reboot because of the OOPSes? -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
I added a new device and then did a delete missing. I lost the terminal (should have used gnu screen), so I didn't see the stdout, but the operation aborted at some point. There's ton of output in dmesg related to this, along with some OOPSes, which I have attached as "dmesg2" here: https://bugzilla.kernel.org/show_bug.cgi?id=102691 -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 5:24 PM, Chris Murphy wrote: >> There is still data redundancy. Will a scrub at least notice that the >> copies differ? > > No, that's what I mean by "nodatasum means no raid1 self-healing is > possible". You have data redundancy, but without checksums btrfs has > no way to know if they differ. It doesn't do two reads and compares > them, it's just like md raid, it picks one device, and so long as > there's no read error from the device, that copy of the data is > assumed to be good. Ok, that makes sense. I'm guessing it wouldn't be worth it to add a feature like this because (a) few people use nodatacow or end up in my situation, and (b) if they did, and the two copies were inconsistent, what would you do? I suppose for me, it would be nice to know which files were affected. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 4:48 PM, Chris Murphy wrote: > > The compress is ignored, and it looks like nodatasum and nodatacow > apply to everything. The nodatasum means no raid1 self-healing is > possible for any data on the entire volume. Metadata checksumming is > still enabled. Ugh. So I need to change my fstab file. I swear, some expert on IRC told me that this should work fine, which is why I did it. In fact, I think they recommended it on the basis that I wanted to put VM images on one of the subvolumes. This discussion occurred a long time ago, well before RAID5 was even partially implemented. There is still data redundancy. Will a scrub at least notice that the copies differ? -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 3:57 PM, Chris Murphy wrote: > On Tue, Aug 11, 2015 at 12:04 PM, Timothy Normand Miller > wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=102691 > > [7.729124] BTRFS: device fsid ecdff84d-b4a2-4286-a1c1-cd7e5396901c > devid 2 transid 226237 /dev/sdd > [7.746115] BTRFS: device fsid ecdff84d-b4a2-4286-a1c1-cd7e5396901c > devid 4 transid 226237 /dev/sdb > [7.826493] BTRFS: device fsid ecdff84d-b4a2-4286-a1c1-cd7e5396901c > devid 3 transid 226237 /dev/sdc > > What do you get for 'btrfs fi show' # btrfs fi show Label: none uuid: 49ac9ad2-b529-4e6e-aef9-1c5b9e8a72f8 Total devices 1 FS bytes used 28.33GiB devid1 size 79.69GiB used 41.03GiB path /dev/sda3 Label: none uuid: ecdff84d-b4a2-4286-a1c1-cd7e5396901c Total devices 4 FS bytes used 1.46TiB devid2 size 931.51GiB used 767.00GiB path /dev/sdd devid3 size 931.51GiB used 760.03GiB path /dev/sdc devid4 size 931.51GiB used 767.00GiB path /dev/sdb *** Some devices missing Label: none uuid: f9331766-e50a-43d5-98dc-fabf5c68321d Total devices 1 FS bytes used 2.99TiB devid1 size 3.64TiB used 3.01TiB path /dev/sde1 btrfs-progs v4.1.2 > > I see devid 2, 3, 4 only for this volume UUID. So you definitely > appear to have a failed device and that's why it doesn't mount > automatically at boot time. You just need to use -o degraded, and that > should work assuming no problems with the other three devices. If it > does work, 'btrfs replace start...' is the ideal way to replace the > failed drive. It's missing because I physically disconnected it. Someone on IRC suggested I try this in case the drive with the bad sector was interfering. Of course, now that I've done this and mounted read/write, we can't reintegrate the failing drive. If I lose the array, I won't cry. The backup appears to be complete. But it would be convenient to avoid having to restore from scratch, and I'm hoping this might help you guys too in some way. I really like btrfs, and I would like provide you with whatever info might contribute something. > > Maybe someone else can say whether nodatacow as a subvolume mount > option will apply this to the entire volume. At the moment, I'm only trying to mount the whole volume, just so I could recover and scrub it, although as I mentioned in my earlier email, the scrub aborts with no report of why and with "0 errors." -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 3:47 PM, Chris Murphy wrote: > > Huh. I thought nodatacow applies to an entire volume only, not per > subvolume unless you use chattr +C (in which case it can be per > subvolume, directory or per file). I could be confused, but I think > you have mutually exclusive mount options. Well, at the time I set up this system, I asked on IRC, and people said it should work. I've never seen any errors from this. >> >> [94312.091613] BTRFS info (device sdc): allowing degraded mounts >> [94312.091618] BTRFS info (device sdc): disk space caching is enabled >> [94312.194513] BTRFS: bdev (null) errs: wr 1724, rd 305, flush 45, >> corrupt 0, gen 2 >> [94319.824563] BTRFS: checking UUID tree > > I don't see any mount failure message. It worked then? Yes and no. It's mounted, but a scrub aborts silently: # btrfs scrub status /mnt/btrfs/ scrub status for ecdff84d-b4a2-4286-a1c1-cd7e5396901c scrub started at Tue Aug 11 13:56:36 2015 and was aborted after 01:31:55 total bytes scrubbed: 2.19TiB with 0 errors No new messages appeared in dmesg, so I can't tell why it aborted. It's also odd that it reports zero errors, given that it aborted. -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 1:56 PM, Timothy Normand Miller wrote: > On Tue, Aug 11, 2015 at 12:21 AM, Chris Murphy > wrote: >> The entire dmesg is still useful because it should show libata errors >> if these aren't fully failed drives. So you should file a bug and >> include, literally, the entire unedited dmesg. > > Alright, I'll do that. Thanks! > Here you go: https://bugzilla.kernel.org/show_bug.cgi?id=102691 -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Tue, Aug 11, 2015 at 12:21 AM, Chris Murphy wrote: > On Mon, Aug 10, 2015 at 7:23 PM, Timothy Normand Miller > wrote: >> On Mon, Aug 10, 2015 at 6:52 PM, Chris Murphy >> wrote: > >>> - complete dmesg for the failed mount >> >> It really doesn't say much. I have things like this: >> [8.643535] BTRFS info (device sdc): disk space caching is enabled >> [8.643789] BTRFS: failed to read the system array on sdc >> [8.706062] BTRFS: open_ctree failed >> [8.707124] BTRFS info (device sdc): disk space caching is enabled >> [8.710924] BTRFS: failed to read the system array on sdc >> [8.766080] BTRFS: open_ctree failed >> [8.766903] BTRFS info (device sdc): setting nodatacow, compression >> disabled >> [8.766905] BTRFS info (device sdc): disk space caching is enabled >> [8.767152] BTRFS: failed to read the system array on sdc >> [8.936019] BTRFS: open_ctree failed >> [8.936906] BTRFS info (device sdc): disk space caching is enabled >> [8.939922] BTRFS: failed to read the system array on sdc >> [8.995984] BTRFS: open_ctree failed >> [8.996796] BTRFS info (device sdc): disk space caching is enabled >> [8.997093] BTRFS: failed to read the system array on sdc >> [9.125936] BTRFS: open_ctree failed > > It looks like there's not enough redundancy remaining to mount and in > such a case there's really not much to be done. > > I don't see nodatacow in your fstab, so I don't know why that's > happening. That means no checksumming for data. Sorry. I was dumb. I only showed you the entry for what I was trying to mount manually. I have subvolumes, and this is what is in my fstab: UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /home btrfs compress=lzo,noatime,space_cache,subvol=home 0 2 UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /mnt/btrfs btrfs compress=lzo,noatime,space_cache 0 2 UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /mnt/vms btrfs noatime,nodatacow,space_cache,subvol=vms 0 2 UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /mnt/oldfiles btrfs compress=lzo,noatime,space_cache,subvol=oldfiles 0 2 UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /mnt/backup btrfs compress=lzo,noatime,space_cache,subvol=backup 0 2 > > >> >> Also, when I manually try to mount, I get things like this: >> >> # mount /mnt/btrfs >> mount: wrong fs type, bad option, bad superblock on /dev/sdc, >>missing codepage or helper program, or other error > > Have you tried to mount with -o degraded? Ooh! I can do that! Mounting ro,degraded, I see this: [94197.902443] BTRFS info (device sdc): allowing degraded mounts [94197.902448] BTRFS info (device sdc): disk space caching is enabled [94198.240621] BTRFS: bdev (null) errs: wr 1724, rd 305, flush 45, corrupt 0, gen 2 Mounting rw,degraded, I see this: [94312.091613] BTRFS info (device sdc): allowing degraded mounts [94312.091618] BTRFS info (device sdc): disk space caching is enabled [94312.194513] BTRFS: bdev (null) errs: wr 1724, rd 305, flush 45, corrupt 0, gen 2 [94319.824563] BTRFS: checking UUID tree > > > >> Well, if I get something lengthy, I'll attach it to my bug report. >> Did the information I reported help at all? > > The entire dmesg is still useful because it should show libata errors > if these aren't fully failed drives. So you should file a bug and > include, literally, the entire unedited dmesg. Alright, I'll do that. Thanks! > > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Damaged filesystem, can read, can't repair, error says to contact devs
On Mon, Aug 10, 2015 at 6:52 PM, Chris Murphy wrote: > Four needed things: > - kernel version 4.1.0-gentoo-r1, although I have also tried 4.1.4. > - btrfs-progs version 4.1.2 > - complete dmesg for the failed mount It really doesn't say much. I have things like this: [8.643535] BTRFS info (device sdc): disk space caching is enabled [8.643789] BTRFS: failed to read the system array on sdc [8.706062] BTRFS: open_ctree failed [8.707124] BTRFS info (device sdc): disk space caching is enabled [8.710924] BTRFS: failed to read the system array on sdc [8.766080] BTRFS: open_ctree failed [8.766903] BTRFS info (device sdc): setting nodatacow, compression disabled [8.766905] BTRFS info (device sdc): disk space caching is enabled [8.767152] BTRFS: failed to read the system array on sdc [8.936019] BTRFS: open_ctree failed [8.936906] BTRFS info (device sdc): disk space caching is enabled [8.939922] BTRFS: failed to read the system array on sdc [8.995984] BTRFS: open_ctree failed [8.996796] BTRFS info (device sdc): disk space caching is enabled [8.997093] BTRFS: failed to read the system array on sdc [9.125936] BTRFS: open_ctree failed Also, when I manually try to mount, I get things like this: # mount /mnt/btrfs mount: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. For this fstab entry: UUID=ecdff84d-b4a2-4286-a1c1-cd7e5396901c /mnt/btrfs btrfs compress=lzo,noatime,space_cache 0 2 # mount -t btrfs /dev/sdd /mnt/btrfs mount: wrong fs type, bad option, bad superblock on /dev/sdd, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. > - complete btrfs check output (you mostly have this but since the > version isn't included, it's not clear this is the entire output) I pasted it all. > > The last two can be included as attachments in a bugzilla.kernel.org > bug report and the URL posted in this thread. Typically MUA wrapping > nerfs the dmesg making it hard to read, so attachments to a bug report > are better. Well, if I get something lengthy, I'll attach it to my bug report. Did the information I reported help at all? I think that btrfs just isn't being informative about the problem. Are there other commands I can run to get more detailed reports? BTW, I tried disconnecting the drive with the bad sector. I still get all the same errors and can't repair. > > Bugs get reported both in bugzilla and on the list. > https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#How_do_I_report_bugs_and_issues.3F > > Sometimes it takes a while for devs to respond, they also get worked > on even without responses just because there's so many improvements > each release. > > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Damaged filesystem, can read, can't repair, error says to contact devs
Hi, everyone, I have a four-drive RAID1 array, and since yesterday, some problem has rendered it unmountable (read/write anyhow). One drive reports a read error, so maybe the drive is failing, but I've had that happen before, and it was easy to swap in a new drive. This time, two more drives are reporting that they "failed to read the system array." I managed to mount it read-only (by specifying the node of the fourth drive) and rsync everything to a backup drive. Now I'd like to try to repair. This is where I'm running into problems. Since I can't mount it read-write, I can't do a scrub, so I tried "btrfs check --repair", and this is what I got: # btrfs check --repair /dev/sde enabling repair mode Checking filesystem on /dev/sde UUID: ecdff84d-b4a2-4286-a1c1-cd7e5396901c checking extents ref mismatch on [1667931533312 524288] extent item 1, found 2 attempting to repair backref discrepency for bytenr 1667931533312 Ref doesn't match the record start and is compressed, please take a btrfs-image of this file system and send it to a btrfs developer so they can complete this functionality for bytenr 1667931639808 failed to repair damaged filesystem, aborting Since this specifically told me to contact a developer, I figured this is something you guys want to know about. :) Also, I was wondering if perhaps someone can help me figure out how to repair it. There are only two files that appear to be unrecoverable when I rsync, and I can restore those from an earlier backup. Since I can't mount read/write, I can't go and delete those files, so I seem to be stuck. BTRFS works beautifully with single drive configurations. I have multiple, and I've never had a problem. On the other hand seem to have LOTS of trouble with 4-drive RAID1. I get OOPSes regularly. I've tried reporting them on bugzilla.kernel.org, but it doesn't appear that btrfs devs actually use that. Is this list a better place to report those? Thanks for the help! -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University Open Graphics Project -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html