btrfs raid56 Was: "csum failed" that was not detected by scrub
Jaap Pieroen posted on Fri, 02 May 2014 17:48:13 + as excerpted: > Duncan <1i5t5.duncan cox.net> writes: > > >> To those that know the details, this tells the story. >> >> Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the >> incomplete bits. btrfs scrub doesn't know how to deal with raid5/6 >> properly just yet. >> The raid5/6 page (which I didn't otherwise see conveniently linked, I >> dug it out of the recent changes list since I knew it was there from >> on-list discussion): >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > So raid5 is much more useless than I assumed. I read Marc's blog and > figured that btrfs was ready enough. > > I' really in trouble now. I tried to get rid of raid5 by doing a convert > balance to raid1. But of course this triggered the same issue. And now I > have a dead system because the first thing btrfs does after mounting is > continue the balance which will crash the system and send me into a > vicious loop. > > - How can I stop btrfs from continuing balancing? That one's easy. See the Documentation/filesystems/btrfs.txt file in the kernel tree or the wiki for btrfs mount options, one of which is "skip_balance", to address this very sort of problem! =:^) Alternatively, mounting it read-only should prevent further changes including the balance, at least allowing you to get the data off the filesystem. > - How can I salvage this situation and convert to raid1? > > Unfortunately I have little spare drives left. Not enough to contain > 4.7TiB of data.. :( [OK, this goes a bit philosophical, but it's something to think about...] If you've done your research and followed the advice of the warnings when you do a mkfs.btrfs or on the wiki, not a problem, since you know that btrfs is still under heavy development and that as a result, it's even more critical to have current tested backups for anything you value anyway. Simply use those backups. Which, by definition, means that if you don't have such backups, you didn't consider the data all that valuable after all, actions perhaps giving the lie to your claims. And no excuse for not doing the research either, since if you really care about your data, you research a filesystem you're not familiar with before trusting your data to it. So again, if you didn't know btrfs was experimental and thus didn't have those backups, by definition your actions say you didn't really care about the data you put on it, no matter what your words might say. OTOH, there *IS* such a thing as not realizing the value of something until you're in the process of losing it... that I do understand. But of course try telling that to, for instance, someone who has just lost a loved one that they never actually /told/ them that... Sometimes it's simply too late. Tho if it's going to happen, at least here I'd much rather it happen to some data, than one of my own loved ones... Anyway, at least for now you should still be able to recover most of the data using skip_balance or read-only mounting. My guess is that if push comes to shove you can either prioritize that data and give up a TiB or two if it comes to that, or scrimp here and there, putting a few gigs on the odd blank DVD you may have lying around or downgrading a few meals to Raman-noodle to afford the $100 or so shipped that pricewatch says a new 3 TB drive costs, these days. I've been there, and have found that if I think I need it bad enough, that $100 has a way of appearing, like I said even if I'm noodling it for a few meals to do it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
On May 2, 2014, at 3:08 PM, Hugo Mills wrote: > On Fri, May 02, 2014 at 01:21:50PM -0600, Chris Murphy wrote: >> >> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.dun...@cox.net> wrote: >>> >>> Something tells me btrfs replace (not device replace, simply replace) >>> should be moved to btrfs device replace… >> >> The syntax for "btrfs device" is different though; replace is like balance: >> btrfs balance start and btrfs replace start. And you can also get a status >> on it. We don't (yet) have options to stop, start, resume, which could maybe >> come in handy for long rebuilds and a reboot is required (?) although maybe >> that just gets handled automatically: set it to pause, then unmount, then >> reboot, then mount and resume. >> >>> Well, I'd say two copies if it's only two devices in the raid1... would >>> be true raid1. But if it's say four devices in the raid1, as is >>> certainly possible with btrfs raid1, that if it's not mirrored 4-way >>> across all devices, it's not true raid1, but rather some sort of hybrid >>> raid, raid10 (or raid01) if the devices are so arranged, raid1+linear if >>> arranged that way, or some form that doesn't nicely fall into a well >>> defined raid level categorization. >> >> Well, md raid1 is always n-way. So if you use -n 3 and specify three >> devices, you'll get 3-way mirroring (3 mirrors). But I don't know any >> hardware raid that works this way. They all seem to be raid 1 is strictly >> two devices. At 4 devices it's raid10, and only in pairs. >> >> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something >> like raid1 (2 copies) + linear/concat. But that allocation is round robin. I >> don't read code but based on how a 3 disk raid1 volume grows VDI files as >> it's filled it looks like 1GB chunks are copied like this >> >> Disk1Disk2 Disk3 >> 134 124 235 >> 679 578 689 >> >> So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; >> disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up >> 18GB of space, 6GB on each drive. You can't do this with any other raid1 as >> far as I know. You do definitely run out of space on one disk first though >> because of uneven metadata to data chunk allocation. > > The algorithm is that when the chunk allocator is asked for a block > group (in pairs of chunks for RAID-1), it picks the number of chunks > it needs, from different devices, in order of the device with the most > free space. So, with disks of size 8, 4, 4, you get: > > Disk 1: 12345678 > Disk 2: 1357 > Disk 3: 2468 > > and with 8, 8, 4, you get: > > Disk 1: 1234568A > Disk 2: 1234579A > Disk 3: 6789 Sure in my example I was assuming equal size disks. But it's a good example to have uneven disks also, because it exemplifies all the more the flexibility btrfs replication has, over alternatives, with odd numbered *and* uneven size disks. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
On Fri, May 02, 2014 at 01:21:50PM -0600, Chris Murphy wrote: > > On May 2, 2014, at 2:23 AM, Duncan <1i5t5.dun...@cox.net> wrote: > > > > Something tells me btrfs replace (not device replace, simply replace) > > should be moved to btrfs device replace… > > The syntax for "btrfs device" is different though; replace is like balance: > btrfs balance start and btrfs replace start. And you can also get a status on > it. We don't (yet) have options to stop, start, resume, which could maybe > come in handy for long rebuilds and a reboot is required (?) although maybe > that just gets handled automatically: set it to pause, then unmount, then > reboot, then mount and resume. > > > Well, I'd say two copies if it's only two devices in the raid1... would > > be true raid1. But if it's say four devices in the raid1, as is > > certainly possible with btrfs raid1, that if it's not mirrored 4-way > > across all devices, it's not true raid1, but rather some sort of hybrid > > raid, raid10 (or raid01) if the devices are so arranged, raid1+linear if > > arranged that way, or some form that doesn't nicely fall into a well > > defined raid level categorization. > > Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, > you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid > that works this way. They all seem to be raid 1 is strictly two devices. At 4 > devices it's raid10, and only in pairs. > > Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something > like raid1 (2 copies) + linear/concat. But that allocation is round robin. I > don't read code but based on how a 3 disk raid1 volume grows VDI files as > it's filled it looks like 1GB chunks are copied like this > > Disk1 Disk2 Disk3 > 134 124 235 > 679 578 689 > > So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; > disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up > 18GB of space, 6GB on each drive. You can't do this with any other raid1 as > far as I know. You do definitely run out of space on one disk first though > because of uneven metadata to data chunk allocation. The algorithm is that when the chunk allocator is asked for a block group (in pairs of chunks for RAID-1), it picks the number of chunks it needs, from different devices, in order of the device with the most free space. So, with disks of size 8, 4, 4, you get: Disk 1: 12345678 Disk 2: 1357 Disk 3: 2468 and with 8, 8, 4, you get: Disk 1: 1234568A Disk 2: 1234579A Disk 3: 6789 Hugo. > Anyway I think we're off the rails with raid1 nomenclature as soon as we have > 3 devices. It's probably better to call it replication, with an assumed > default of 2 replicates unless otherwise specified. > > There's definitely a benefit to a 3 device volume with 2 replicates, > efficiency wise. As soon as we go to four disks 2 replicates it makes more > sense to do raid10, although I haven't tested odd device raid10 setups so I'm > not sure what happens. > > > Chris Murphy > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Prisoner unknown: Return to Zenda. --- signature.asc Description: Digital signature
Re: Unable to boot
On May 2, 2014, at 4:00 AM, George Pochiscan wrote: > Hello, > > I have a problem with a server with Fedora 20 and BTRFS. This server had > frequent hard restarts before the filesystem got corrupt and we are unable to > boot it. > > We have a HP Proliant server with 4 disks @1TB each and Software RAID 5. > It had Debian installed (i don't know the version) and right now i'm using > fedora 20 live to try to rescue the system. Fedora 20 Live has kernel 3.11.10 and btrfs-progs 0.20.rc1.20131114git9f0c53f-1.fc20. So the general rule of thumb without knowing exactly what the problem and solution is, is to try a much newer kernel and btrfs-progs, like a Fedora Rawhide live media. These are built daily, but don't always succeed so you can go here to find the latest of everything: https://apps.fedoraproject.org/releng-dash/ Find Fedora Live Desktop or Live KDE and click on details. Click the green link under descendants livecd. And then under Output listing you'll see an ISO you can download, the one there right now is Fedora-Live-Desktop-x86_64-rawhide-20140502.iso - but of course this changes daily. You might want to boot with kernel parameter slub_debug=- (that's a minus symbol) because all but Monday built Rawhide kernels have a bunch of kernel debug options enabled which makes it quite slow. > > When we try btrfsck /dev/md127 i have a lot of checksum errors, and the > output is: > > Checking filesystem on /dev/md127 > UUID: e068faf0-2c16-4566-9093-e6d1e21a5e3c > checking extents > checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 > checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 > checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 > checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 > Csum didn't match > checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 > checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 > checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 > checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 > Csum didn't match > - > > extent buffer leak: start 1006686208 len 4096 > found 32039247396 bytes used err is -22 > total csum bytes: 41608612 > total tree bytes: 388857856 > total fs tree bytes: 310124544 > total extent tree bytes: 22016000 > btree space waste bytes: 126431234 > file data blocks allocated: 47227326464 > referenced 42595635200 > Btrfs v3.12 I suggest a recent Rawhide build. And I suggest just trying to mount the file system normally first, and post anything that appears in dmesg. And if the mount fails, then try mount option -o recovery, and also post any dmesg messages from that too, and note whether or not it mounts. Finally if that doesn't work either then see if -o ro,recovery works and what kernel messages you get. > > > > When i attempt to repair i have the following error: > - > Backref 1005817856 parent 5 root 5 not found in extent tree > backpointer mismatch on [1005817856 4096] > owner ref check failed [1006686208 4096] > repaired damaged extent references > Failed to find [1000525824, 168, 4096] > btrfs unable to find ref byte nr 1000525824 parent 0 root 1 owner 1 offset 0 > btrfsck: extent-tree.c:1752: write_one_cache_group: Assertion `!(ret)' failed. > Aborted > You really shouldn't use --repair right off the bat, it's not a recommended early step, you should try normal mounting with newer kernels first, then recovery mount options first. Sometimes the repair option makes things worse. I'm not sure what its safety status is as of v3.14. https://btrfs.wiki.kernel.org/index.php/Problem_FAQ Fedora includes btrfs-zero-log already so depending on the kernel messages you might try that before a btrfsck --repair. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
On May 2, 2014, at 2:23 AM, Duncan <1i5t5.dun...@cox.net> wrote: > > Something tells me btrfs replace (not device replace, simply replace) > should be moved to btrfs device replace… The syntax for "btrfs device" is different though; replace is like balance: btrfs balance start and btrfs replace start. And you can also get a status on it. We don't (yet) have options to stop, start, resume, which could maybe come in handy for long rebuilds and a reboot is required (?) although maybe that just gets handled automatically: set it to pause, then unmount, then reboot, then mount and resume. > Well, I'd say two copies if it's only two devices in the raid1... would > be true raid1. But if it's say four devices in the raid1, as is > certainly possible with btrfs raid1, that if it's not mirrored 4-way > across all devices, it's not true raid1, but rather some sort of hybrid > raid, raid10 (or raid01) if the devices are so arranged, raid1+linear if > arranged that way, or some form that doesn't nicely fall into a well > defined raid level categorization. Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid that works this way. They all seem to be raid 1 is strictly two devices. At 4 devices it's raid10, and only in pairs. Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something like raid1 (2 copies) + linear/concat. But that allocation is round robin. I don't read code but based on how a 3 disk raid1 volume grows VDI files as it's filled it looks like 1GB chunks are copied like this Disk1 Disk2 Disk3 134 124 235 679 578 689 So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 18GB of space, 6GB on each drive. You can't do this with any other raid1 as far as I know. You do definitely run out of space on one disk first though because of uneven metadata to data chunk allocation. Anyway I think we're off the rails with raid1 nomenclature as soon as we have 3 devices. It's probably better to call it replication, with an assumed default of 2 replicates unless otherwise specified. There's definitely a benefit to a 3 device volume with 2 replicates, efficiency wise. As soon as we go to four disks 2 replicates it makes more sense to do raid10, although I haven't tested odd device raid10 setups so I'm not sure what happens. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "csum failed" that was not detected by scrub
Shilong Wang gmail.com> writes: > > Hello, > > There is a known RAID5/6 bug, i sent a patch to address this problem. > Could you please double check if your kernel source includes the > following commit: > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/? id=3b080b2564287be91605bfd1d5ee985696e61d3c > > RAID5/6 should detect checksum mismatch, it can not fix errors now. > > Thanks, > Wang Your patch seems to be in 3.15rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc1-trusty/CHANGES I tried rc3 but that made my system crash on boot.. I'm having bad luck -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:
Duncan <1i5t5.duncan cox.net> writes: > > To those that know the details, this tells the story. > > Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the > incomplete bits. btrfs scrub doesn't know how to deal with raid5/6 > properly just yet. > > While the operational bits of raid5/6 support are there, parity is > calculated and written, scrub, and recovery from a lost device, are not > yet code complete. Thus, it's effectively a slower, lower capacity raid0 > without scrub support at this point, except that when the code is > complete, you'll get an automatic "free" upgrade to full raid5 or raid6, > because the operational bits have been working since they were > introduced, just the recovery and scrub bits were bad, making it > effectively a raid0 in reliability terms, lose one and you've lost them > all. > > That's the big picture anyway. Marc Merlin recently did quite a bit of > raid5/6 testing and there's a page on the wiki now with what he found. > Additionally, I saw a scrub support for raid5/6 modes patch on the list > recently, but while it may be in integration, I believe it's too new to > have reached release yet. > > Wiki, for memory or bookmark: https://btrfs.wiki.kernel.org > > Direct user documentation link for bookmark (unwrap as necessary): > > https://btrfs.wiki.kernel.org/index.php/ > Main_Page#Guides_and_usage_information > > The raid5/6 page (which I didn't otherwise see conveniently linked, I dug > it out of the recent changes list since I knew it was there from on-list > discussion): > > https://btrfs.wiki.kernel.org/index.php/RAID56 > >Marc or Hugo or someone with a wiki account: Can this be more visibly > linked from the user-docs contents, added to the user docs category list, > and probably linked from at least the multiple devices and (for now) the > gotchas pages? > So raid5 is much more useless than I assumed. I read Marc's blog and figured that btrfs was ready enough. I' really in trouble now. I tried to get rid of raid5 by doing a convert balance to raid1. But of course this triggered the same issue. And now I have a dead system because the first thing btrfs does after mounting is continue the balance which will crash the system and send me into a vicious loop. - How can I stop btrfs from continuing balancing? - How can I salvage this situation and convert to raid1? Unfortunately I have little spare drives left. Not enough to contain 4.7TiB of data.. :( -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not increment on bio_index one by one
On Tue, Apr 29, 2014 at 01:07:58PM +0800, Liu Bo wrote: > 'bio_index' is just a index, it's really not necessary to do increment > one by one. > > Signed-off-by: Liu Bo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6 v2] Btrfs: add send_stream_version attribute to sysfs
On Fri, May 2, 2014 at 4:46 PM, David Sterba wrote: > On Sun, Apr 20, 2014 at 10:40:03PM +0100, Filipe David Borba Manana wrote: >> So that applications can find out what's the highest send stream >> version supported/implemented by the running kernel: >> >> $ cat /sys/fs/btrfs/send/stream_version >> 2 >> >> Signed-off-by: Filipe David Borba Manana >> --- >> >> V2: Renamed /sys/fs/btrfs/send_stream_version to >> /sys/fs/btrfs/send/stream_version, >> as in the future it might be useful to add other sysfs attrbutes related >> to >> send (other ro information or tunables like internal buffer sizes, etc). > > Sounds good, I don't see any issue with the separate directory. Mixing > it with /sys/fs/btrfs/features does not seem suitable for that if you > intend adding more entries. Yeah, I only didn't mix it with the features subdir because that relates to features that are settable, plus there's 2 versions of it, one global and one per fs (uuid) subdirectory (and it felt odd to me to add it to one of those subdirs and not the other). Thanks David > > Reviewed-by: David Sterba -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6 v2] Btrfs: add send_stream_version attribute to sysfs
On Sun, Apr 20, 2014 at 10:40:03PM +0100, Filipe David Borba Manana wrote: > So that applications can find out what's the highest send stream > version supported/implemented by the running kernel: > > $ cat /sys/fs/btrfs/send/stream_version > 2 > > Signed-off-by: Filipe David Borba Manana > --- > > V2: Renamed /sys/fs/btrfs/send_stream_version to > /sys/fs/btrfs/send/stream_version, > as in the future it might be useful to add other sysfs attrbutes related > to > send (other ro information or tunables like internal buffer sizes, etc). Sounds good, I don't see any issue with the separate directory. Mixing it with /sys/fs/btrfs/features does not seem suitable for that if you intend adding more entries. Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Btrfs-progs: send, bump stream version
On Tue, Apr 15, 2014 at 05:40:48PM +0100, Filipe David Borba Manana wrote: > This increases the send stream version from version 1 to version 2, adding > 2 new commands: > > 1) total data size - used to tell the receiver how much file data the stream >will add or update; > > 2) fallocate - used to pre-allocate space for files and to punch holes in > files. > > This is preparation work for subsequent changes that implement the new > features > (computing total data size and use fallocate for better performance). > > Signed-off-by: Filipe David Borba Manana The changes in the v2/3/4 look good, thanks. Patches added to next integratin. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] btrfs-progs: Print more info about device sizes
On Wed, Apr 30, 2014 at 02:37:16PM +0100, David Taylor wrote: > It makes more sense to me than 'Occupied' and seems cleaner than > 'Resized To'. It sort of mirrors how LVM describes PV / VG / LV > sizes, too. Do you have a concrete example how we could map the btrfs sizes based on LVM? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] btrfs-progs: Print more info about device sizes
On Wed, Apr 30, 2014 at 07:38:00PM +0200, Goffredo Baroncelli wrote: > On 04/30/2014 03:37 PM, David Taylor wrote: > > On Wed, 30 Apr 2014, Frank Kingswood wrote: > >> On 30/04/14 13:11, David Sterba wrote: > >>> On Wed, Apr 30, 2014 at 01:39:27PM +0200, Goffredo Baroncelli wrote: > > I found a bit unclear the "FS occupied" terms. > >>> > >>> We're running out of terms to describe and distinguish the space that > >>> the filesystem uses. > >>> > >>> 'occupied' seemed like a good choice to me, though it may be not obvious > >> > >> The space that the filesystem uses in total seems to me is called the > >> "size". It has nothing to do with utilization. > >> > >> /dev/sda6, ID: 2 > >> Device size:10.00GiB > >> Filesystem size: 5.00GiB > > > > FS size was what I was about to suggest, before I saw your reply. > > Pay attention that this value is not the Filesystem size, > but to the maximum space the of THE DEVICE the filesystem is allowed to use. I agree that plain 'Filesystem size' could be misleading, using the same term that has an established meaning can cause misuderstandings in bugreports. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "csum failed" that was not detected by scrub
Hello, 2014-05-02 17:42 GMT+08:00 Jaap Pieroen : > Hi all, > > I completed a full scrub: > root@nasbak:/home/jpieroen# btrfs scrub status /home/ > scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d > scrub started at Wed Apr 30 08:30:19 2014 and finished after 144131 seconds > total bytes scrubbed: 4.76TiB with 0 errors > > Then tried to remove a device: > root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home > > This triggered bug_on, with the following error in dmesg: csum failed > ino 258 off 1395560448 csum 2284440321 expected csum 319628859 > > How can there still be csum failures directly after a scrub? > If I rerun the scrub it still won't find any errors. I know this, > because I've had the same issue 3 times in a row. Each time running a > scrub and still being unable to remove the device. There is a known RAID5/6 bug, i sent a patch to address this problem. Could you please double check if your kernel source includes the following commit: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3b080b2564287be91605bfd1d5ee985696e61d3c RAID5/6 should detect checksum mismatch, it can not fix errors now. Thanks, Wang > > Kind Regards, > Jaap > > -- > Details: > > root@nasbak:/home/jpieroen# uname -a > Linux nasbak 3.14.1-031401-generic #201404141220 SMP Mon Apr 14 > 16:21:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux > > root@nasbak:/home/jpieroen# btrfs --version > Btrfs v3.14.1 > > root@nasbak:/home/jpieroen# btrfs fi df /home > Data, RAID5: total=4.57TiB, used=4.55TiB > System, RAID1: total=32.00MiB, used=352.00KiB > Metadata, RAID1: total=7.00GiB, used=5.59GiB > > root@nasbak:/home/jpieroen# btrfs fi show > Label: 'btrfs_storage' uuid: 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d > Total devices 6 FS bytes used 4.56TiB > devid1 size 1.82TiB used 1.31TiB path /dev/sde > devid2 size 1.82TiB used 1.31TiB path /dev/sdf > devid3 size 1.82TiB used 1.31TiB path /dev/sdg > devid4 size 931.51GiB used 25.00GiB path /dev/sdb > devid6 size 2.73TiB used 994.03GiB path /dev/sdh > devid7 size 2.73TiB used 994.03GiB path /dev/sdi > > Btrfs v3.14.1 > > jpieroen@nasbak:~$ dmesg > [227248.656438] BTRFS info (device sdi): relocating block group > 9735225016320 flags 129 > [227261.713860] BTRFS info (device sdi): found 9 extents > [227264.531019] BTRFS info (device sdi): found 9 extents > [227265.011826] BTRFS info (device sdi): relocating block group > 76265029632 flags 129 > [227274.052249] BTRFS info (device sdi): csum failed ino 258 off > 1395560448 csum 2284440321 expected csum 319628859 > [227274.052354] BTRFS info (device sdi): csum failed ino 258 off > 1395564544 csum 3646299263 expected csum 319628859 > [227274.052402] BTRFS info (device sdi): csum failed ino 258 off > 1395568640 csum 281259278 expected csum 319628859 > [227274.052449] BTRFS info (device sdi): csum failed ino 258 off > 1395572736 csum 2594807184 expected csum 319628859 > [227274.052492] BTRFS info (device sdi): csum failed ino 258 off > 1395576832 csum 4288971971 expected csum 319628859 > [227274.052537] BTRFS info (device sdi): csum failed ino 258 off > 1395580928 csum 752615894 expected csum 319628859 > [227274.052581] BTRFS info (device sdi): csum failed ino 258 off > 1395585024 csum 3828951500 expected csum 319628859 > [227274.061279] [ cut here ] > [227274.061354] kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116! > [227274.061445] invalid opcode: [#1] SMP > [227274.061509] Modules linked in: cuse deflate > [227274.061573] BTRFS info (device sdi): csum failed ino 258 off > 1395560448 csum 2284440321 expected csum 319628859 > [227274.061707] ctr twofish_generic twofish_x86_64_3way > twofish_x86_64 twofish_common camellia_generic camellia_x86_64 > serpent_sse2_x86_64 xts serpent_generic lrw gf128mul glue_helper > blowfish_generic blowfish_x86_64 blowfish_common cast5_generic > cast_common ablk_helper cryptd des_generic cmac xcbc rmd160 > crypto_null af_key xfrm_algo nfsd auth_rpcgss nfs_acl nfs lockd sunrpc > fscache dm_crypt ip6t_REJECT ppdev xt_hl ip6t_rt nf_conntrack_ipv6 > nf_defrag_ipv6 ipt_REJECT xt_comment xt_LOG kvm xt_recent microcode > xt_multiport xt_limit xt_tcpudp psmouse serio_raw xt_addrtype k10temp > edac_core ipt_MASQUERADE edac_mce_amd iptable_nat nf_nat_ipv4 > sp5100_tco nf_conntrack_ipv4 nf_defrag_ipv4 ftdi_sio i2c_piix4 > usbserial xt_conntrack ip6table_filter ip6_tables joydev > nf_conntrack_netbios_ns nf_conntrack_broadcast snd_hda_codec_via > nf_nat_ftp snd_hda_codec_hdmi nf_nat snd_hda_codec_generic > nf_conntrack_ftp nf_conntrack snd_hda_intel iptable_filter > ir_lirc_codec(OF) lirc_dev(OF) ip_tables snd_hda_codec > ir_mce_kbd_decoder(OF) x_tables snd_hwdep ir_sony_decoder(OF) > rc_tbs_nec(OF) ir_jvc_decoder(OF) snd_pcm ir_rc6_decoder(OF) > ir_rc5_decoder(OF) saa716x_tbs_dvb(OF) tbs6982fe(POF) tbs6680fe(POF) > ir_nec_decoder(OF) tbs6923fe(POF) tbs6985se(POF) t
Re: "csum failed" that was not detected by scrub
Jaap Pieroen posted on Fri, 02 May 2014 11:42:35 +0200 as excerpted: > I completed a full scrub: > root@nasbak:/home/jpieroen# btrfs scrub status /home/ > scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d > scrub started at Wed Apr 30 08:30:19 2014 > and finished after 144131 seconds > total bytes scrubbed: 4.76TiB with 0 errors > > Then tried to remove a device: > root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home > > This triggered bug_on, with the following error in dmesg: csum failed > ino 258 off 1395560448 csum 2284440321 expected csum 319628859 > > How can there still be csum failures directly after a scrub? Simple enough, really... > root@nasbak:/home/jpieroen# btrfs fi df /home > Data, RAID5: total=4.57TiB, used=4.55TiB > System, RAID1: total=32.00MiB, used=352.00KiB > Metadata, RAID1: total=7.00GiB, used=5.59GiB To those that know the details, this tells the story. Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the incomplete bits. btrfs scrub doesn't know how to deal with raid5/6 properly just yet. While the operational bits of raid5/6 support are there, parity is calculated and written, scrub, and recovery from a lost device, are not yet code complete. Thus, it's effectively a slower, lower capacity raid0 without scrub support at this point, except that when the code is complete, you'll get an automatic "free" upgrade to full raid5 or raid6, because the operational bits have been working since they were introduced, just the recovery and scrub bits were bad, making it effectively a raid0 in reliability terms, lose one and you've lost them all. That's the big picture anyway. Marc Merlin recently did quite a bit of raid5/6 testing and there's a page on the wiki now with what he found. Additionally, I saw a scrub support for raid5/6 modes patch on the list recently, but while it may be in integration, I believe it's too new to have reached release yet. Wiki, for memory or bookmark: https://btrfs.wiki.kernel.org Direct user documentation link for bookmark (unwrap as necessary): https://btrfs.wiki.kernel.org/index.php/ Main_Page#Guides_and_usage_information The raid5/6 page (which I didn't otherwise see conveniently linked, I dug it out of the recent changes list since I knew it was there from on-list discussion): https://btrfs.wiki.kernel.org/index.php/RAID56 @ Marc or Hugo or someone with a wiki account: Can this be more visibly linked from the user-docs contents, added to the user docs category list, and probably linked from at least the multiple devices and (for now) the gotchas pages? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Unable to boot
Hello, I have a problem with a server with Fedora 20 and BTRFS. This server had frequent hard restarts before the filesystem got corrupt and we are unable to boot it. We have a HP Proliant server with 4 disks @1TB each and Software RAID 5. It had Debian installed (i don't know the version) and right now i'm using fedora 20 live to try to rescue the system. When we try btrfsck /dev/md127 i have a lot of checksum errors, and the output is: Checking filesystem on /dev/md127 UUID: e068faf0-2c16-4566-9093-e6d1e21a5e3c checking extents checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 checksum verify failed on 1006686208 found 457560AC wanted 6B3ECE11 Csum didn't match checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 checksum verify failed on 1001492480 found 74CC3F5D wanted C222A2C9 Csum didn't match - extent buffer leak: start 1006686208 len 4096 found 32039247396 bytes used err is -22 total csum bytes: 41608612 total tree bytes: 388857856 total fs tree bytes: 310124544 total extent tree bytes: 22016000 btree space waste bytes: 126431234 file data blocks allocated: 47227326464 referenced 42595635200 Btrfs v3.12 When i attempt to repair i have the following error: - Backref 1005817856 parent 5 root 5 not found in extent tree backpointer mismatch on [1005817856 4096] owner ref check failed [1006686208 4096] repaired damaged extent references Failed to find [1000525824, 168, 4096] btrfs unable to find ref byte nr 1000525824 parent 0 root 1 owner 1 offset 0 btrfsck: extent-tree.c:1752: write_one_cache_group: Assertion `!(ret)' failed. Aborted I have installed btrfs version 3.12 Linux localhost 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [root@localhost liveuser]# btrfs fi show Label: none uuid: e068faf0-2c16-4566-9093-e6d1e21a5e3c Total devices 1 FS bytes used 40.04GiB devid1 size 1.82TiB used 43.04GiB path /dev/md127 Btrfs v3.12 Please advice. Thank you, George Pochiscan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"csum failed" that was not detected by scrub
Hi all, I completed a full scrub: root@nasbak:/home/jpieroen# btrfs scrub status /home/ scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d scrub started at Wed Apr 30 08:30:19 2014 and finished after 144131 seconds total bytes scrubbed: 4.76TiB with 0 errors Then tried to remove a device: root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home This triggered bug_on, with the following error in dmesg: csum failed ino 258 off 1395560448 csum 2284440321 expected csum 319628859 How can there still be csum failures directly after a scrub? If I rerun the scrub it still won't find any errors. I know this, because I've had the same issue 3 times in a row. Each time running a scrub and still being unable to remove the device. Kind Regards, Jaap -- Details: root@nasbak:/home/jpieroen# uname -a Linux nasbak 3.14.1-031401-generic #201404141220 SMP Mon Apr 14 16:21:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux root@nasbak:/home/jpieroen# btrfs --version Btrfs v3.14.1 root@nasbak:/home/jpieroen# btrfs fi df /home Data, RAID5: total=4.57TiB, used=4.55TiB System, RAID1: total=32.00MiB, used=352.00KiB Metadata, RAID1: total=7.00GiB, used=5.59GiB root@nasbak:/home/jpieroen# btrfs fi show Label: 'btrfs_storage' uuid: 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d Total devices 6 FS bytes used 4.56TiB devid1 size 1.82TiB used 1.31TiB path /dev/sde devid2 size 1.82TiB used 1.31TiB path /dev/sdf devid3 size 1.82TiB used 1.31TiB path /dev/sdg devid4 size 931.51GiB used 25.00GiB path /dev/sdb devid6 size 2.73TiB used 994.03GiB path /dev/sdh devid7 size 2.73TiB used 994.03GiB path /dev/sdi Btrfs v3.14.1 jpieroen@nasbak:~$ dmesg [227248.656438] BTRFS info (device sdi): relocating block group 9735225016320 flags 129 [227261.713860] BTRFS info (device sdi): found 9 extents [227264.531019] BTRFS info (device sdi): found 9 extents [227265.011826] BTRFS info (device sdi): relocating block group 76265029632 flags 129 [227274.052249] BTRFS info (device sdi): csum failed ino 258 off 1395560448 csum 2284440321 expected csum 319628859 [227274.052354] BTRFS info (device sdi): csum failed ino 258 off 1395564544 csum 3646299263 expected csum 319628859 [227274.052402] BTRFS info (device sdi): csum failed ino 258 off 1395568640 csum 281259278 expected csum 319628859 [227274.052449] BTRFS info (device sdi): csum failed ino 258 off 1395572736 csum 2594807184 expected csum 319628859 [227274.052492] BTRFS info (device sdi): csum failed ino 258 off 1395576832 csum 4288971971 expected csum 319628859 [227274.052537] BTRFS info (device sdi): csum failed ino 258 off 1395580928 csum 752615894 expected csum 319628859 [227274.052581] BTRFS info (device sdi): csum failed ino 258 off 1395585024 csum 3828951500 expected csum 319628859 [227274.061279] [ cut here ] [227274.061354] kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116! [227274.061445] invalid opcode: [#1] SMP [227274.061509] Modules linked in: cuse deflate [227274.061573] BTRFS info (device sdi): csum failed ino 258 off 1395560448 csum 2284440321 expected csum 319628859 [227274.061707] ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 xts serpent_generic lrw gf128mul glue_helper blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common ablk_helper cryptd des_generic cmac xcbc rmd160 crypto_null af_key xfrm_algo nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache dm_crypt ip6t_REJECT ppdev xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_comment xt_LOG kvm xt_recent microcode xt_multiport xt_limit xt_tcpudp psmouse serio_raw xt_addrtype k10temp edac_core ipt_MASQUERADE edac_mce_amd iptable_nat nf_nat_ipv4 sp5100_tco nf_conntrack_ipv4 nf_defrag_ipv4 ftdi_sio i2c_piix4 usbserial xt_conntrack ip6table_filter ip6_tables joydev nf_conntrack_netbios_ns nf_conntrack_broadcast snd_hda_codec_via nf_nat_ftp snd_hda_codec_hdmi nf_nat snd_hda_codec_generic nf_conntrack_ftp nf_conntrack snd_hda_intel iptable_filter ir_lirc_codec(OF) lirc_dev(OF) ip_tables snd_hda_codec ir_mce_kbd_decoder(OF) x_tables snd_hwdep ir_sony_decoder(OF) rc_tbs_nec(OF) ir_jvc_decoder(OF) snd_pcm ir_rc6_decoder(OF) ir_rc5_decoder(OF) saa716x_tbs_dvb(OF) tbs6982fe(POF) tbs6680fe(POF) ir_nec_decoder(OF) tbs6923fe(POF) tbs6985se(POF) tbs6928se(POF) tbs6982se(POF) tbs6991fe(POF) tbs6618fe(POF) saa716x_core(OF) tbs6922fe(POF) tbs6928fe(POF) tbs6991se(POF) stv090x(OF) dvb_core(OF) rc_core(OF) snd_timer snd soundcore asus_atk0110 parport_pc shpchp mac_hid lp parport btrfs xor raid6_pq pata_acpi hid_generic usbhid hid usb_storage radeon pata_atiixp r8169 mii i2c_algo_bit sata_sil24 ttm drm_kms_helper drm ahci libahci wmi [227274.064118] CPU: 1 PID: 15543 Comm: btrfs-endio-4 Tainted: PF O 3.14.1-031401-generic #201404141220 [227274.064246] Hardware name: System manufacturer System Product Name/M4A78LT-M, BIOS
Re: Help with space
On 02/05/14 10:23, Duncan wrote: Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted: On Thu, 1 May 2014, Duncan <1i5t5.dun...@cox.net> wrote: [snip] http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf Whether a true RAID-1 means just 2 copies or N copies is a matter of opinion. Papers such as the above seem to clearly imply that RAID-1 is strictly 2 copies of data. Thanks for that link. =:^) My position would be that reflects the original, but not the modern, definition. The paper seems to describe as raid1 what would later come to be called raid1+0, which quickly morphed into raid10, leaving the raid1 description only covering pure mirror-raid. Personally I'm flexible on using the terminology in day-to-day operations and discussion due to the fact that the end-result is "close enough". But ... The definition of "RAID 1" is still only a mirror of two devices. As far as I'm aware, Linux's mdraid is the only raid system in the world that allows N-way mirroring while still referring to it as "RAID1". Due to the way it handles data in chunks, and also due to its "rampant layering violations", *technically* btrfs's "RAID-like" features are not "RAID". To differentiate from "RAID", we're already using lowercase "raid" and, in the long term, some of us are also looking to do away with "raid{x}" terms altogether with what Hugo and I last termed as "csp notation". Changing the terminology is important - but it is particularly non-urgent. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Negative qgroup sizes
Thanks for the response, Duncan. On 01/05/14 17:58, Duncan wrote: > > Tho you are slightly outdated on your btrfs-progs version, 3.14.1 being > current. But I think the code in question is kernel code and the progs > simply report it, so I don't think that can be the problem in this case. Yes, I'm aware that 3.14 version of btrfs progs was already there, but this is just for couple of weeks and I'm pretty sure that the kernel code (which does the real time accounting) is broken. > So if you are doing snapshots, you can try not doing them (switching to > conventional backup if necessary) and see if that stabilizes your > numbers. If so, you know there's still more problems in that area. > > Of course if the subvolumes involved aren't snapshotted, then the problem > must be elsewhere, but I do know the snapshotting case /is/ reasonably > difficult to get right... while staying within a reasonable performance > envelope at least. > I have already searched and found some patches around this issue, but I thought I'd also mention the issue on this mailing list and hoped that I somehow missed something. The subvolumes are highly probable to be snapshotted, so this might indeed be the case. Cheers, Alin. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted: > On Thu, 1 May 2014, Duncan <1i5t5.dun...@cox.net> wrote: > > Am I missing something or is it impossible to do a disk replace on BTRFS > right now? > > I can delete a device, I can add a device, but I'd like to replace a > device. You're missing something... but it's easy to do as I almost missed it too even tho I was sure it was there. Something tells me btrfs replace (not device replace, simply replace) should be moved to btrfs device replace... > http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf > > Whether a true RAID-1 means just 2 copies or N copies is a matter of > opinion. Papers such as the above seem to clearly imply that RAID-1 is > strictly 2 copies of data. Thanks for that link. =:^) My position would be that reflects the original, but not the modern, definition. The paper seems to describe as raid1 what would later come to be called raid1+0, which quickly morphed into raid10, leaving the raid1 description only covering pure mirror-raid. And even then, the paper says mirrors in spots without specifically defining it as (only) two mirrors, but in others it seems to /assume/, without further explanation, just two mirrors. So I'd argue that even then the definition of raid1 allowed more than two mirrors, but that it just so happened that the examples and formulae given dealt with only two mirrors. Tho certainly I can see the room for differing opinions on the matter as well. > I don't have a strong opinion on how many copies of data can be involved > in a RAID-1, but I think that there's no good case to claim that only 2 > copies means that something isn't "true RAID-1". Well, I'd say two copies if it's only two devices in the raid1... would be true raid1. But if it's say four devices in the raid1, as is certainly possible with btrfs raid1, that if it's not mirrored 4-way across all devices, it's not true raid1, but rather some sort of hybrid raid, raid10 (or raid01) if the devices are so arranged, raid1+linear if arranged that way, or some form that doesn't nicely fall into a well defined raid level categorization. But still, opinions can differ. Point well made... and taken. =:^) >> Surprisingly, after shutting everything down, getting a new AC, and >> letting the system cool for a few hours, it pretty much all came back >> to life, including the CPU(s) (that was pre-multi-core, but I don't >> remember whether it was my dual socket original Opteron, or >> pre-dual-socket for me as well) which I had feared would be dead. > > CPUs have had thermal shutdown for a long time. When a CPU lacks such > controls (as some buggy Opteron chips did a few years ago) it makes the > IT news. That was certainly some years ago, and I remember for awhile, AMD Athlons didn't have thermal shutdown yet, while Intel CPUs of the time did. And that was an AMD CPU as I've run mostly AMD (with only specific exceptions) for literally decades, now. But what IDR for sure is whether it was my original AMD Athlon (500 MHz), or the Athlon C @ 1.2 GHz, or the dual Opteron 242s I ran for several years. If it was the original Athlon, it wouldn't have had thermal shutdown. If it was the Opterons I think they did, but I think the Athlon Cs were in the period when Intel had introduced thermal shutdown but AMD hadn't, and Tom's Hardware among others had dramatic videos of just exactly what happened if one actually tried to run the things without cooling, compared to running an Intel of the period. But I remember being rather surprised that the CPU(s) was/were unharmed, which means it very well could have been the Athlon C era, and I had seen the dramatic videos and knew my CPU wasn't protected. > I'd like to be able to run a combination of "dup" and RAID-1 for > metadata. ZFS has a "copies" option, it would be good if we could do > that. Well, if N-way-mirroring were possible, one could do more or less just that easily enough with suitable partitioning and setting the data vs metadata number of mirrors as appropriate... but of course with only two- way-mirroring and dup as choices... the only way to do it would be layering btrfs atop something else, say md/raid. And without real-time checksumming verification at the md/raid level... > I use BTRFS for all my backups too. I think that the chance of data > patterns triggering filesystem bugs that break backups as well as > primary storage is vanishingly small. The chance of such bugs being > latent for long enough that I can't easily recreate the data isn't worth > worrying about. The fact that my primary filesystems and their first backups are btrfs raid1 on dual SSDs, while secondary backups are on spinning rust, does factor into my calculations here. I ran reiserfs for many years, since I first switched to Linux full time in the early kernel 2.4 era in fact, and while it had its problems early on, since the introduction of orde