Re: Unmountable Array After Drive Failure During Device Deletion
- Array is good. All drives are accounted for, btrfs scrub runs cleanly. btrfs fi show shows no missing drives and reasonable allocations. - I start btrfs dev del to remove devid 9. It chugs along with no errors, until: - Another drive in the array (NOT THE ONE I RAN DEV DEL ON) fails, and all reads and writes to it fail, causing the SCSI errors above. - I attempt clean shutdown. It takes too long for because my drive controller card is buzzing loudly and the neighbors are sensitive to noise, so: - I power down the machine uncleanly. - I remove the failed drive, NOT the one I ran dev del on. - I reboot, attempt to mount with various options, all of which cause the kernel to yell at me and the mount command returns failure. devid 9 is device delete in-progress, and while that's occurring devid 15 fails completely. Is that correct? Either devid 14 or devid 10 (from memory) dropped out, devid 15 is still working. Because previously you reported, in part this: devid 15 size 1.82TB used 1.47TB path /dev/sdd *** Some devices missing And this: sd 0:2:3:0: [sdd] Unhandled error code Yeah, those two are from different boots. sdd is the one that dropped out, and after a reboot another (working) drive was renumbered to sdd. Sorry for the confusion. (Also note that if devid 15 was missing, it would not be reported in btrfs fi show.) That why I was confused. It looks like dead/missing device is one devid, and then devid 15 /dev/sdd is also having hardware problems - because all of this was posted at the same time. But I take it they're different boots and the /dev/sdd's are actually two different devids. So devid 9 was deleted and then devid 14 failed. Right? Lovely when /dev/sdX changes between boots. It never finished the deletion (was probably about halfway through, based on previous dev dels), but otherwise yes. From what I understand, at all points there should be at least two copies of every extent during a dev del when all chunks are allocated RAID10 (and they are, according to btrfs fi df ran before on the mounted fs). Because of this, I expect to be able to use the chunks from the (not successfully removed) devid=9, as I have done many many times before due to other btrfs bugs that needed unclean shutdowns during dev del. I haven't looked at the code or read anything this specific on the state of the file system during a device delete. But my expectation is that there are 1-2 chunks available for writes. And 2-3 chunks available for reads. Some writes must be only one copy because a chunk hasn't yet been replicated elsewhere, and presumably the device being deleted is not subject to writes as the transid also implies. Whereas devid 9 is one set of chunks for reading, those chunks have pre-existing copies elsewhere in the file system so that's two copies. And there's a replication in progress of the soon to be removed chunks. So that's up to three copies. Problem is that for sure you've lost some chunks due to the failed/missing device. Normal raid10, it's unambiguous whether we've lost two mirrored sets. With Btrfs that's not clear as chunks are distributed. So it's possible that there are some chunks that don't exist at all for writes, and only 1 for reads. It may be no chunks are in common between devid 9 and the dead one. It may be only a couple of data or metadata chunks are in common. Under the assumption devid=9 is good, if a slightly out of date on transid (which ALL data says is true), I should be able to completely recover all data, because data that was not modified during the deletion resides on devid=9, and data that was modified should be redundantly (RAID10) stored on the remaining drives, and thus should work given this case of a single drive failure. Is this not the case? Does btrfs not maintain redundancy during device removal? Good questions. I'm not certain. But the speculation seems reasonable, not accounting for the missing device. That's what makes this different. btrfs read error corrected: ino 1 off 87601116364800 (dev /dev/sdf sector 62986400) btrfs read error corrected: ino 1 off 87601116798976 (dev /dev/sdg sector 113318256) I'm not sure what constitutes a btrfs read error, maybe the device it originally requested data from didn't have it where it was expected but was able to find it on these devices. If the drive itself has a problem reading a sector and ECC can't correct it, it reports the read error to libata. So kernel messages report this with a line that starts with the word exception and then a line with cmd that shows what command and LBAs where issued to the drive, and then a res line that should contain an error mask with the actual error - bus error, media error. Very often you don't see these and instead see link reset messages, which means the drive is hanging doing something (probably attempting ECC) but then the linux
Unmountable Array After Drive Failure During Device Deletion
I'm using btrfs in data and metadata RAID10 on drives (not on md or any other fanciness.) I was removing a drive (btrfs dev del) and during that operation, a different drive in the array failed. Having not had this happen before, I shut down the machine immediately due to the extremely loud piezo buzzer on the drive controller card. I attempted to do so cleanly, but the buzzer cut through my patience and after 4 minutes I cut the power. Afterwards, I located and removed the failed drive from the system, and then got back to linux. The array no longer mounts (failed to read the system array on sdc), with nearly identical messages when attempted with -o recovery and -o recovery,ro. btrfsck asserts and coredumps, as usual. The drive that was being removed is devid 9 in the array, and is /dev/sdm1 in the btrfs fi show seen below. Kernel 3.12.4-1-ARCH, btrfs-progs v0.20-rc1-358-g194aa4a-dirty (archlinux build.) Can I recover the array? == dmesg during failure == ... sd 0:2:3:0: [sdd] Unhandled error code sd 0:2:3:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 sd 0:2:3:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 89 5b 00 00 00 80 00 end_request: I/O error, dev sdd, sector 646535936 btrfs_dev_stat_print_on_error: 7791 callbacks suppressed btrfs: bdev /dev/sdd errs: wr 315858, rd 230194, flush 0, corrupt 0, gen 0 sd 0:2:3:0: [sdd] Unhandled error code sd 0:2:3:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 sd 0:2:3:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 89 5b 80 00 00 80 00 end_request: I/O error, dev sdd, sector 646536064 ... == dmesg after new boot, mounting attempt == btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: disk space caching is enabled btrfs: failed to read the system array on sdc btrfs: open_ctree failed == dmesg after new boot, mounting attempt with -o recovery,ro == btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: enabling auto recovery btrfs: disk space caching is enabled btrfs: failed to read the system array on sdc btrfs: open_ctree failed == btrfsck == deep# btrfsck /dev/sda warning, device 14 is missing warning devid 14 not found already parent transid verify failed on 87601116364800 wanted 4893969 found 4893913 parent transid verify failed on 87601116364800 wanted 4893969 found 4893913 parent transid verify failed on 87601116381184 wanted 4893969 found 4893913 parent transid verify failed on 87601116381184 wanted 4893969 found 4893913 parent transid verify failed on 87601115320320 wanted 4893969 found 4893913 parent transid verify failed on 87601115320320 wanted 4893969 found 4893913 parent transid verify failed on 87601117097984 wanted 4893969 found 4892460 parent transid verify failed on 87601117097984 wanted 4893969 found 4892460 Ignoring transid failure Checking filesystem on /dev/sda UUID: d5e17c49-d980-4bde-bd96-3c8bc95ea077 checking extents parent transid verify failed on 87601117159424 wanted 4893969 found 4893913 parent transid verify failed on 87601117159424 wanted 4893969 found 4893913 parent transid verify failed on 87601116368896 wanted 4893969 found 4893913 parent transid verify failed on 87601116368896 wanted 4893969 found 4893913 parent transid verify failed on 87601117163520 wanted 4893969 found 4893913 parent transid verify failed on 87601117163520 wanted 4893969 found 4893913 parent transid verify failed on 87601117638656 wanted 4893969 found 4893913 parent transid verify failed on 87601117638656 wanted 4893969 found 4893913 Ignoring transid failure parent transid verify failed on 87601117171712 wanted 4893969 found 4893913 parent transid verify failed on 87601117171712 wanted 4893969 found 4893913 parent transid verify failed on 87601117175808 wanted 4893969 found 4893913 parent transid verify failed on 87601117175808 wanted 4893969 found 4893913 parent transid verify failed on 87601117188096 wanted 4893969 found 4893913 parent transid verify failed on 87601117188096 wanted 4893969 found 4893913 parent transid verify failed on 87601116807168 wanted 4893969 found 4893913 parent transid verify failed on 87601116807168 wanted 4893969 found 4893913 Ignoring transid failure parent transid verify failed on 87601117642752 wanted 4893969 found 4893913 parent transid verify failed on 87601117642752 wanted 4893969 found 4893913 Ignoring transid failure parent transid verify failed on 87601117650944 wanted 4893969 found 4893913 parent transid verify failed on 87601117650944 wanted 4893969 found 4893913 Ignoring transid failure Couldn't map the block 5764607523034234880 btrfsck: volumes.c:1019: btrfs_num_copies: Assertion `!(!ce)' failed. zsh: abort (core dumped) btrfsck /dev/sda == btrfs fi show == Label: 'lake' uuid: d5e17c49-d980-4bde-bd96-3c8bc95ea077 Total devices 10 FS bytes used 7.43TB devid9 size 1.82TB used 1.61TB path /dev/sdm1 devid 12 size 1.82TB used 1.47TB path /dev/sdb devid 16 size 1.82TB used 1.47TB path /dev/sde devid 13 size 1.82TB used 1.47TB path /dev/sdc devid 11
Re: Unmountable Array After Drive Failure During Device Deletion
I'm using btrfs in data and metadata RAID10 on drives (not on md or any other fanciness.) I was removing a drive (btrfs dev del) and during that operation, a different drive in the array failed. Having not had this happen before, I shut down the machine immediately due to the extremely loud piezo buzzer on the drive controller card. I attempted to do so cleanly, but the buzzer cut through my patience and after 4 minutes I cut the power. Afterwards, I located and removed the failed drive from the system, and then got back to linux. The array no longer mounts (failed to read the system array on sdc), with nearly identical messages when attempted with -o recovery and -o recovery,ro. This may be a stupid question, but you're missing a drive so the filesystem will be degraded, but you didn't mention that in your mount options, so... Did you try mounting with -o degraded (possibly with recovery, etc, also, but just try -o degraded plus any normal options first)? I did not try degraded because I didn't remember that there were two different options for handling broken btrfs volumes. mount -o degraded,ro yields: btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: allowing degraded mounts btrfs: disk space caching is enabled parent transid verify failed on 87601116364800 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116364800 (dev /dev/sdf sector 62986400) parent transid verify failed on 87601116381184 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116381184 (dev /dev/sdf sector 62986432) parent transid verify failed on 87601115320320 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601115320320 (dev /dev/sdf sector 62985896) parent transid verify failed on 87601116368896 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116368896 (dev /dev/sdf sector 62986408) parent transid verify failed on 87601116377088 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116377088 (dev /dev/sdf sector 62986424) btrfs: bdev (null) errs: wr 344288, rd 230234, flush 0, corrupt 0, gen 0 btrfs: bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 btrfs: bdev /dev/sdg errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 parent transid verify failed on 87601117097984 wanted 4893969 found 4892460 Failed to read block groups: -5 btrfs: open_ctree failed mount -o degraded,recovery,ro yields: btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: allowing degraded mounts btrfs: enabling auto recovery btrfs: disk space caching is enabled parent transid verify failed on 87601116798976 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116798976 (dev /dev/sdg sector 113318256) parent transid verify failed on 87601119379456 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601119379456 (dev /dev/sdg sector 113319456) parent transid verify failed on 87601116774400 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116774400 (dev /dev/sdg sector 113318208) parent transid verify failed on 87601119391744 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601119391744 (dev /dev/sdg sector 113319480) parent transid verify failed on 87601116778496 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116778496 (dev /dev/sdg sector 113318216) parent transid verify failed on 87601116786688 wanted 4893969 found 4893849 btrfs read error corrected: ino 1 off 87601116786688 (dev /dev/sdg sector 113318232) btrfs: bdev (null) errs: wr 344288, rd 230234, flush 0, corrupt 0, gen 0 btrfs: bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 btrfs: bdev /dev/sdg errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 parent transid verify failed on 8760515136 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760515136 (dev /dev/sdg sector 113315616) parent transid verify failed on 8760523328 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760523328 (dev /dev/sdg sector 113315632) parent transid verify failed on 8760535616 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760535616 (dev /dev/sdg sector 113315656) parent transid verify failed on 8760556096 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760556096 (dev /dev/sdg sector 113315696) Failed to read block groups: -5 btrfs: open_ctree failed -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable Array After Drive Failure During Device Deletion
I'm using btrfs in data and metadata RAID10 on drives (not on md or any other fanciness.) I was removing a drive (btrfs dev del) and during that operation, a different drive in the array failed. Having not had this happen before, I shut down the machine immediately due to the extremely loud piezo buzzer on the drive controller card. I attempted to do so cleanly, but the buzzer cut through my patience and after 4 minutes I cut the power. Afterwards, I located and removed the failed drive from the system, and then got back to linux. The array no longer mounts (failed to read the system array on sdc), with nearly identical messages when attempted with -o recovery and -o recovery,ro. This may be a stupid question, but you're missing a drive so the filesystem will be degraded, but you didn't mention that in your mount options, so... Did you try mounting with -o degraded (possibly with recovery, etc, also, but just try -o degraded plus any normal options first)? I did not try degraded because I didn't remember that there were two different options for handling broken btrfs volumes. mount -o degraded,ro yields: btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: allowing degraded mounts btrfs: disk space caching is enabled parent transid verify failed on 87601116364800 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116364800 (dev /dev/sdf sector 62986400) parent transid verify failed on 87601116381184 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116381184 (dev /dev/sdf sector 62986432) parent transid verify failed on 87601115320320 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601115320320 (dev /dev/sdf sector 62985896) parent transid verify failed on 87601116368896 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116368896 (dev /dev/sdf sector 62986408) parent transid verify failed on 87601116377088 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116377088 (dev /dev/sdf sector 62986424) btrfs: bdev (null) errs: wr 344288, rd 230234, flush 0, corrupt 0, gen 0 btrfs: bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 btrfs: bdev /dev/sdg errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 parent transid verify failed on 87601117097984 wanted 4893969 found 4892460 Failed to read block groups: -5 btrfs: open_ctree failed mount -o degraded,recovery,ro yields: btrfs: device label lake devid 11 transid 4893967 /dev/sda btrfs: allowing degraded mounts btrfs: enabling auto recovery btrfs: disk space caching is enabled parent transid verify failed on 87601116798976 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116798976 (dev /dev/sdg sector 113318256) parent transid verify failed on 87601119379456 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601119379456 (dev /dev/sdg sector 113319456) parent transid verify failed on 87601116774400 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116774400 (dev /dev/sdg sector 113318208) parent transid verify failed on 87601119391744 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601119391744 (dev /dev/sdg sector 113319480) parent transid verify failed on 87601116778496 wanted 4893969 found 4893913 btrfs read error corrected: ino 1 off 87601116778496 (dev /dev/sdg sector 113318216) parent transid verify failed on 87601116786688 wanted 4893969 found 4893849 btrfs read error corrected: ino 1 off 87601116786688 (dev /dev/sdg sector 113318232) btrfs: bdev (null) errs: wr 344288, rd 230234, flush 0, corrupt 0, gen 0 btrfs: bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 btrfs: bdev /dev/sdg errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 parent transid verify failed on 8760515136 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760515136 (dev /dev/sdg sector 113315616) parent transid verify failed on 8760523328 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760523328 (dev /dev/sdg sector 113315632) parent transid verify failed on 8760535616 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760535616 (dev /dev/sdg sector 113315656) parent transid verify failed on 8760556096 wanted 4893968 found 4893913 btrfs read error corrected: ino 1 off 8760556096 (dev /dev/sdg sector 113315696) Failed to read block groups: -5 btrfs: open_ctree failed I should also mention that the corrupt 4 errs on /dev/sdm1 and /dev/sdg are there from an earlier btrfs extent corruption bug, and do not exist on the filesystem anymore (a scrub hours before the device deletion completed with 0 errors.) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable Array After Drive Failure During Device Deletion
On 12/19/2013 02:21 PM, Chris Murphy wrote: On Dec 19, 2013, at 2:26 AM, Chris Kastorff encryp...@gmail.com wrote: btrfs-progs v0.20-rc1-358-g194aa4a-dirty Most of what you're using is in the kernel so this is not urgent but if it gets to needing btrfs check/repair, I'd upgrade to v3.12 progs: https://www.archlinux.org/packages/testing/x86_64/btrfs-progs/ Adding the testing repository is a bad idea for this machine; turning off the testing repository is extremely error prone. Instead, I am now using the btrfs tools from git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git's master (specifically 8cae184), which reports itself as: deep# ./btrfs version Btrfs v3.12 sd 0:2:3:0: [sdd] Unhandled error code sd 0:2:3:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 sd 0:2:3:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 89 5b 00 00 00 80 00 end_request: I/O error, dev sdd, sector 646535936 btrfs_dev_stat_print_on_error: 7791 callbacks suppressed btrfs: bdev /dev/sdd errs: wr 315858, rd 230194, flush 0, corrupt 0, gen 0 sd 0:2:3:0: [sdd] Unhandled error code sd 0:2:3:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 sd 0:2:3:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 89 5b 80 00 00 80 00 end_request: I/O error, dev sdd, sector 646536064 These are hardware errors. And you have missing devices, or at least a message of missing devices. So if a device went bad, and a new one added without deleting the missing one, then the new device only has new data. Data hasn't been recovered and replicated to the replacement. So it's possible with a missing device that's not removed, and a 2nd device failure, to lose some data. This is not what happened, as I explained earlier; I shall explain again, with more verbosity: - Array is good. All drives are accounted for, btrfs scrub runs cleanly. btrfs fi show shows no missing drives and reasonable allocations. - I start btrfs dev del to remove devid 9. It chugs along with no errors, until: - Another drive in the array (NOT THE ONE I RAN DEV DEL ON) fails, and all reads and writes to it fail, causing the SCSI errors above. - I attempt clean shutdown. It takes too long for because my drive controller card is buzzing loudly and the neighbors are sensitive to noise, so: - I power down the machine uncleanly. - I remove the failed drive, NOT the one I ran dev del on. - I reboot, attempt to mount with various options, all of which cause the kernel to yell at me and the mount command returns failure. From what I understand, at all points there should be at least two copies of every extent during a dev del when all chunks are allocated RAID10 (and they are, according to btrfs fi df ran before on the mounted fs). Because of this, I expect to be able to use the chunks from the (not successfully removed) devid=9, as I have done many many times before due to other btrfs bugs that needed unclean shutdowns during dev del. Under the assumption devid=9 is good, if a slightly out of date on transid (which ALL data says is true), I should be able to completely recover all data, because data that was not modified during the deletion resides on devid=9, and data that was modified should be redundantly (RAID10) stored on the remaining drives, and thus should work given this case of a single drive failure. Is this not the case? Does btrfs not maintain redundancy during device removal? btrfs read error corrected: ino 1 off 87601116364800 (dev /dev/sdf sector 62986400) btrfs read error corrected: ino 1 off 87601116798976 (dev /dev/sdg sector 113318256) I'm not sure what constitutes a btrfs read error, maybe the device it originally requested data from didn't have it where it was expected but was able to find it on these devices. If the drive itself has a problem reading a sector and ECC can't correct it, it reports the read error to libata. So kernel messages report this with a line that starts with the word exception and then a line with cmd that shows what command and LBAs where issued to the drive, and then a res line that should contain an error mask with the actual error - bus error, media error. Very often you don't see these and instead see link reset messages, which means the drive is hanging doing something (probably attempting ECC) but then the linux SCSI layer hits its 30 second time out on the (hanged) queued command and resets the drive instead of waiting any longer. And that's a problem also because it prevents bad sectors from being fixed by Btrfs. So they just get worse to the point where then it can't do anything about the situation. There was a single drive immediately failing all its writes and reads because that's how the controller card was configured. No ECC failures, no timeouts. I have hit those issues on other arrays, but the drive controller I'm using here correctly and immediately returned errors on requests when the drive failed. I am no stranger to SCSI error messages on both shitty drive interfaces (which behave as you
Re: BTRFS error in __btrfs_inc_extent_ref:1935: Object already exists
I have a (larger, 7x2TB at RAID10) filesystem that was recently hit by this. Same story; filesystem works normally, balance start, works for a while, then fails with similar stack traces and remounts read-only, after a reboot does not mount at all with similar error messages and stack traces. The FS is still in that state. I'll grab an image and mail a link privately. I don't need to do anything special for btrfs-image on a multi-device fs, right? Kernel version is 3.8.4-1-ARCH (archlinux.) On Mon, Apr 1, 2013 at 6:31 AM, Josef Bacik jba...@fusionio.com wrote: On Mon, Apr 01, 2013 at 02:12:07AM -0600, Roman Mamedov wrote: On Mon, 1 Apr 2013 04:36:05 +0600 Roman Mamedov r...@romanrm.ru wrote: Hello, After a reboot the filesystem now does not mount at all, with similar messages. So thinking this was an isolated incident, I foolishly continued setting up scheduled balance on other systems with btrfs that I have. And got into exactly the same situation on another machine!! Trying to balance this with -dusage=5, on kernel 3.8.5: Data: total=215.01GB, used=141.76GB System, DUP: total=32.00MB, used=32.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=9.38GB, used=1.09GB Same messages, Object already exists. While I currently left the previously mentioned 2TB FS in an unmounted broken state, still waiting for any response from you on how to properly recover from this problem, in this new case I needed to restore the machine as soon as possible. I tried btrfsck --repair, it corrected a lot of errors, but in the end gave up with a message saying that it can't repair the filesystem; then I did btrfs-zero-log. After this the FS started mounting successfully again. Not sure if I got any data corruption as a result, but this is the root FS and /home, and the machine successfully booted up with no data lost in any of the apps that were active just before the crash (e.g browser, IM and IRC clients), so probably not. Can you capture an image of these broken file systems the next time it happens? You'll need to clone the progs here git://github.com/josefbacik/btrfs-progs.git and build and then run btrfs-image -w /dev/whatever blah.img and then upload blah.img up somewhere I can pull it down. You can use the -t and -c options too, but the -w is the most important since you have extent tree corruption. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel Panic while defragging a large file
I have a btrfs volume spread over three 3TB disks, RAID1 data and metadata. The machine is old and underpowered; a 32-bit Atom box with 2GB of RAM. On it is a 1TB sparse file which is a dm-crypt volume containing an ext4 filesystem. For the past few months, I've been writing very slowly to the inner ext4 filesystem (~20KB/s.) I have not been running with autodefrag, so this file is very heavily fragmented (259627 extents according to filefrag.) The box is running the latest archlinux kernel: $ uname -a Linux cracker 3.7.5-1-ARCH #1 SMP PREEMPT Mon Jan 28 10:38:12 CET 2013 i686 GNU/Linux And the latest btrfs-progs in archlinux (forever v0.19 (ugh)) Running: btrfs fi defrag /media/lake/pu9 Results in work for about 15 seconds, then several kernel BUGs over a short period, followed soon after by a kernel panic. There are several scattered wrong amount of free space messages before this, which I assume are the result of previous crashes and are harmless. Note: this trace has some long lines truncated due to journalctl truncating by default. If desired, I can reproduce while telling journalctl not to truncate. Also, gmail might hard-wrap others (ugh.) block group 8580959109120 has an wrong amount of free space btrfs: failed to load free space cache for block group 8580959109120 BUG: unable to handle kernel paging request at 8829 IP: [c022f968] __kmalloc+0x58/0x160 *pde = Oops: [#1] PREEMPT SMP Modules linked in: nfsd auth_rpcgss nfs_acl tun ext4 crc16 jbd2 mbcache sha... i2c_a pata_acpi ata_piix uhci_hcd libata scsi_mod ehci_hcd usbcore usb_common Pid: 1149, comm: btrfs-worker-4 Tainted: G O 3.7.5-1-ARCH #1 ASUS.../1000H EIP: 0060:[c022f968] EFLAGS: 00010282 CPU: 1 EIP is at __kmalloc+0x58/0x160 EAX: EBX: ef638000 ECX: 8829 EDX: a341 ESI: c0723f50 EDI: f5802480 EBP: f035be88 ESP: f035be60 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 8829 CR3: 3015a000 CR4: 07d0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process btrfs-worker-4 (pid: 1149, ti=f035a000 task=f0072530 task.ti=f035a000) Stack: f035bec8 f871f909 f4c2e800 f86eb754 00e0 8050 8829 ef638000 f035bee4 f86eb754 e6f42780 eff90c00 f32b7d01 f1665dc4 eff90de0 f32b7c00 f4c2ef80 efc0c480 802a001f Call Trace: [f871f909] ? btrfs_map_bio+0x179/0x240 [btrfs] [f86eb754] ? btrfs_csum_one_bio+0x54/0x2e0 [btrfs] [f86eb754] btrfs_csum_one_bio+0x54/0x2e0 [btrfs] [f86fa3df] __btrfs_submit_bio_start+0x2f/0x40 [btrfs] [f86ee1dd] run_one_async_start+0x3d/0x60 [btrfs] [f8722ac3] worker_loop+0xe3/0x480 [btrfs] [c0164365] ? __wake_up_common+0x45/0x70 [f87229e0] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs] [c015b2f4] kthread+0x94/0xa0 [c016] ? hrtimer_start+0x30/0x30 [c04fdbf7] ret_from_kernel_thread+0x1b/0x28 [c015b260] ? kthread_freezable_should_stop+0x50/0x50 Code: 89 c7 76 63 8b 4d 04 89 4d e4 8b 07 64 03 05 f4 e6 71 c0 8b 50 04 8b ... cb 8b EIP: [c022f968] __kmalloc+0x58/0x160 SS:ESP 0068:f035be60 CR2: 8829 ---[ end trace 8efd563dc8ae9b53 ]--- Several other kernel BUG lines and stack traces about unable to handle paging request at %x occur soon after, on various PIDs and various stack traces (including some from a writev to a socket, a fairly well-tested operation.) Eventually (~10 seconds) the kernel panics. My screen is too small to see the whole message, but I can probably scrounge it up with some effort if that's desired. This feels like a kernel running out of ram problem. I'm running rsync -avPS to defragment the file more manually, but will keep the old version around in case further testing is desired. -Chris K -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html