Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Fri, May 13, 2016 at 6:10 AM, Niccolò Belliwrote: > On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote: >> >> The fact that you're getting an OOPS involving core kernel threads >> (kswapd) is a pretty good indication that either there's a bug elsewhere in >> the kernel, or that something is wrong with your hardware. it's really >> difficult to be certain if you don't have a reliable test case though. > > > Talking about reliable test cases, I forgot to say that I definitely found > an interesting one. It doesn't lead to OOPS but perhaps something even more > interesting. While running countless stress tests I tried running some games > to stress the system in different ways. I chosed openmw (an open source > engine for Morrowind) and I played it for a while on my second external > monitor (while I watched at some monitoring tools on my first monitor). I > noticed that after playing a while I *always* lose internet connection (I > use an USB3 Gigabit Ethernet adapter). This isn't the only thing which > happens: even if the game keeps running flawlessly and the system *seems* to > work fine (I can drag windows, open the terminal...) lots of commands simply > stall (for example mounting a partition, unmounting it, rebooting...). I can > reliably reproduce it, it ALWAYS happens. Well there are a bunch of kernel debug options. If your kernel has CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y at compile time you can boot with boot parameter slub_debug=1 to enable it and maybe there'll be something more revealing about the problems you're having. More aggressive is CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably. And then there's some Btrfs debug options for compile time, and are enabled with mount options. But I think the problem you're having isn't specific to Btrfs or someone else would have run into it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote: The fact that you're getting an OOPS involving core kernel threads (kswapd) is a pretty good indication that either there's a bug elsewhere in the kernel, or that something is wrong with your hardware. it's really difficult to be certain if you don't have a reliable test case though. Talking about reliable test cases, I forgot to say that I definitely found an interesting one. It doesn't lead to OOPS but perhaps something even more interesting. While running countless stress tests I tried running some games to stress the system in different ways. I chosed openmw (an open source engine for Morrowind) and I played it for a while on my second external monitor (while I watched at some monitoring tools on my first monitor). I noticed that after playing a while I *always* lose internet connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only thing which happens: even if the game keeps running flawlessly and the system *seems* to work fine (I can drag windows, open the terminal...) lots of commands simply stall (for example mounting a partition, unmounting it, rebooting...). I can reliably reproduce it, it ALWAYS happens. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 2016-05-13 07:07, Niccolò Belli wrote: On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote: That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+, so it may catch issues they don't. You can start this directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition. 2. This is a Dell system. If you still have the utility partition which Dell ships all their per-provisioned systems with, that should have a hardware diagnostics tool. I doubt that this will find anything (it's part of their QA procedure AFAICT), but it's probably worth trying, as the memory testing in that uses yet another slightly different implementation of the typical tests. You can usually find this in the boot interrupt menu accessed by hitting F12 before the boot-loader loads. I tried the Dell System Test, including the enhanced optional ram tests and it was fine. I also tried the Microsoft one, which passed. BUT if I select the advanced test in the Microsoft One it always stops at 21% of first test. The test menus are still working, but fans get quiet and it keeps writing "test running... 21%" forever. I tried it many times and it always got stuck at 21%, so I suspect a test suite bug instead of a ram failure. I've actually seen this before on other systems (different completion percentage on each system, but otherwise the same), all of them ended up actually having a bad CPU or MB, although the ones with CPU issues were fine after BIOS updates which included newer microcode. I also noticed some other interesting behaviours: while I was running the usual scrub+check (both were fine) from the livecd I noticed this in dmesg: [ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 Corrupt? But both scrub and check were fine... I double checked scrub and check and they were still fine. It's worth noting that these are running counts of errors since the last time the stats were reset (and they only get reset manually). If you haven't reset the stats, then this isn't all that surprising. This is what happened another time: https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU I was making a backup of my partition USING DD from the livecd. It wasn't even mounted if I recall correctly! The fact that you're getting an OOPS involving core kernel threads (kswapd) is a pretty good indication that either there's a bug elsewhere in the kernel, or that something is wrong with your hardware. it's really difficult to be certain if you don't have a reliable test case though. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? Yes, except for the point release (the system is slightly ahead of the liveusb). On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4. FWIW, I've been running 4.5 with almost no issues on my laptop since it came out (the few issues I have had are not unique to 4.5, and are all ultimately firmware issues (Lenovo has been getting _really_ bad recently about having broken ACPI and EFI implementations...)). Of course, I'm also running Gentoo, so everything is built locally, but I doubt that that has much impact on stability. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset. Funny considering it is sold as a Linux laptop. Unfortunately they only tested it with the ancient Ubuntu 14.04. Sadly, this is pretty typical for anything sold as a 'Linux' system that isn't a server. Even for the servers sold as such, it's not unusual for it to only be tested with with old versions of CentOS. Now, I hadn't thought of this before, but it's a Dell system, so you're trapping out to SMBIOS for everything under the sun, and if they don't pass a correct memory map (or correct ACPI tables) to the OS during boot, then there may be some sections of RAM that both Linux and the firmware think they can use, which could definitely result in symptoms like bad RAM while still consistently passing memory tests (because they don't
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote: That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+, so it may catch issues they don't. You can start this directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition. 2. This is a Dell system. If you still have the utility partition which Dell ships all their per-provisioned systems with, that should have a hardware diagnostics tool. I doubt that this will find anything (it's part of their QA procedure AFAICT), but it's probably worth trying, as the memory testing in that uses yet another slightly different implementation of the typical tests. You can usually find this in the boot interrupt menu accessed by hitting F12 before the boot-loader loads. I tried the Dell System Test, including the enhanced optional ram tests and it was fine. I also tried the Microsoft one, which passed. BUT if I select the advanced test in the Microsoft One it always stops at 21% of first test. The test menus are still working, but fans get quiet and it keeps writing "test running... 21%" forever. I tried it many times and it always got stuck at 21%, so I suspect a test suite bug instead of a ram failure. I also noticed some other interesting behaviours: while I was running the usual scrub+check (both were fine) from the livecd I noticed this in dmesg: [ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 Corrupt? But both scrub and check were fine... I double checked scrub and check and they were still fine. This is what happened another time: https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU I was making a backup of my partition USING DD from the livecd. It wasn't even mounted if I recall correctly! On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? Yes, except for the point release (the system is slightly ahead of the liveusb). On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset. Funny considering it is sold as a Linux laptop. Unfortunately they only tested it with the ancient Ubuntu 14.04. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Thu, May 12, 2016 at 04:35:24PM +0200, Niccolò Belli wrote: > When doing the btrfs check I also always do a btrfs scrub and it never found > any error. Once it didn't manage to finish the scrub because of: > BTRFS critical (device dm-0): corrupt leaf, slot offset bad: > block=670597120,root=1, slot=6 > and btrfs scrub status reported "was aborted after 00:00:10". > > Talking about scrub I created a systemd timer to run scrub hourly and I > noticed 2 *uncorrectable* errors suddenly appeared on my system. So I > immediately re-run the scrub just to confirm it and then I rebooted into the > Arch live usb and runned btrfs check: the metadata were perfect. So I runned > btrfs scrub from the live usb and there were no errors at all! I rebooted > into my system and runned scrub once again and the uncorrectable errors > where really gone! It happened two times in the past few days. That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? > Almost no patches get applied by the Arch kernel team: > https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux > At the moment the only one is an harmless > "change-default-console-loglevel.patch". Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. Contrast with 4.1.x and 4.4.x, which runs for months between reboots for me. Maybe there's a regression in 4.5.x, maybe I did something wrong in my config or build, or maybe I just have too few data points to draw any conclusions, but my data so far is telling me to stay on 4.4.x until something changes (i.e. wait for a 4.5.x stable update or skip directly to 4.6.x). :-/ It's always worth trying this if only to eliminate regression as a possible root cause early. In practice, every mainline kernel release has a regression that affects at least one combination of config options and hardware. btrfs is stable enough now that you can be running one or two releases behind to avoid a problem elsewhere in the kernel. > Another option will be crashing it with my car's wheels hoping that because > of my comprehensive insurance policy Dell will give me the next model (the > Skylake one) as a replacement (hoping that it will not suffer from the same > issue of the Broadwell one). The first rule of Insurance Fraud Club: don't talk about Insurance Fraud Club. ;) It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset. signature.asc Description: Digital signature
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 2016-05-12 10:35, Niccolò Belli wrote: On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote: Did you also check the data matches the backup? btrfs check will only look at the metadata, which is 0.1% of what you've copied. From what you've written, there should be a lot of errors in the data too. If you have incorrect data but btrfs scrub finds no incorrect checksums, then your storage layer is probably fine and we have to look at CPU, host RAM, and software as possible culprits. The logs you've posted so far indicate that bad metadata (e.g. negative item lengths, nonsense transids in metadata references but sane transids in the referred pages) is getting into otherwise valid and well-formed btrfs metadata pages. Since these pages are protected by checksums, the corruption can't be originating in the storage layer--if it was, the pages should be rejected as they are read from disk, before btrfs even looks at them, and the insane transid should be the "found" one not the "expected" one. That suggests there is either RAM corruption happening _after_ the data is read from disk (i.e. while the pages are cached in RAM), or a severe software bug in the kernel you're running. When doing the btrfs check I also always do a btrfs scrub and it never found any error. Once it didn't manage to finish the scrub because of: BTRFS critical (device dm-0): corrupt leaf, slot offset bad: block=670597120,root=1, slot=6 and btrfs scrub status reported "was aborted after 00:00:10". Talking about scrub I created a systemd timer to run scrub hourly and I noticed 2 *uncorrectable* errors suddenly appeared on my system. So I immediately re-run the scrub just to confirm it and then I rebooted into the Arch live usb and runned btrfs check: the metadata were perfect. So I runned btrfs scrub from the live usb and there were no errors at all! I rebooted into my system and runned scrub once again and the uncorrectable errors where really gone! It happened two times in the past few days. This would indicate to me that you've either got bad RAM (most likely), or some other hardware component is not working correctly. It's not unusual for hardware issues to be intermittent. Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever maintains your kernel had a bad day and merged a patch they should not have. Almost no patches get applied by the Arch kernel team: https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux At the moment the only one is an harmless "change-default-console-loglevel.patch". Try a minimal configuration with as few drivers as possible loaded, especially GPU drivers and anything from the staging subdirectory--when these drivers have bugs, they ruin everything. Arch kernel team is quite conservative regarding staging/experimental features, I remember they rejected some config patches I submitted because of this. Anyway I will try to blacklist as many kernel modules as I can. Maybe blacklisting GPU is too much because if I can't actually use my laptop it will be much more difficult to reproduce the issue. Disable the GPU driver, but make sure you have the VGA_CONSOLE config enabled, and you should be fine (you'll just get a 80x25 text-mode console instead of a high-resolution one). Try memtest86+ which has a few more/different tests than memtest86. I have encountered RAM modules that pass memtest86 but fail memtest86+ and vice versa. Try memtester, a memory tester that runs as a Linux process, so it can detect corruption caused when device drivers spray data randomly into RAM, or when the CPU thermal controls are influenced by Linux (an overheating CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop designs rely on the OS for thermal management). Try running more than one memory testing process, in case there is a bug in your hardware that affects interactions between multiple cores (memtest is single-threaded). You can run memtest86 inside a kvm (e.g. kvm -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues. Kernel compiles are a bad way to test RAM. I've successfully built kernels on hosts with known RAM failures. The kernels don't always work properly, but it's quite rare to see a build fail outright. I didn't use memtest86+ because of the lack of EFI support, but I just tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours without issues. Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4 -turns 10" together for 12 hours without any issue so I think both my ram and cpu are ok. That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+,
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote: Did you also check the data matches the backup? btrfs check will only look at the metadata, which is 0.1% of what you've copied. From what you've written, there should be a lot of errors in the data too. If you have incorrect data but btrfs scrub finds no incorrect checksums, then your storage layer is probably fine and we have to look at CPU, host RAM, and software as possible culprits. The logs you've posted so far indicate that bad metadata (e.g. negative item lengths, nonsense transids in metadata references but sane transids in the referred pages) is getting into otherwise valid and well-formed btrfs metadata pages. Since these pages are protected by checksums, the corruption can't be originating in the storage layer--if it was, the pages should be rejected as they are read from disk, before btrfs even looks at them, and the insane transid should be the "found" one not the "expected" one. That suggests there is either RAM corruption happening _after_ the data is read from disk (i.e. while the pages are cached in RAM), or a severe software bug in the kernel you're running. When doing the btrfs check I also always do a btrfs scrub and it never found any error. Once it didn't manage to finish the scrub because of: BTRFS critical (device dm-0): corrupt leaf, slot offset bad: block=670597120,root=1, slot=6 and btrfs scrub status reported "was aborted after 00:00:10". Talking about scrub I created a systemd timer to run scrub hourly and I noticed 2 *uncorrectable* errors suddenly appeared on my system. So I immediately re-run the scrub just to confirm it and then I rebooted into the Arch live usb and runned btrfs check: the metadata were perfect. So I runned btrfs scrub from the live usb and there were no errors at all! I rebooted into my system and runned scrub once again and the uncorrectable errors where really gone! It happened two times in the past few days. Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever maintains your kernel had a bad day and merged a patch they should not have. Almost no patches get applied by the Arch kernel team: https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux At the moment the only one is an harmless "change-default-console-loglevel.patch". Try a minimal configuration with as few drivers as possible loaded, especially GPU drivers and anything from the staging subdirectory--when these drivers have bugs, they ruin everything. Arch kernel team is quite conservative regarding staging/experimental features, I remember they rejected some config patches I submitted because of this. Anyway I will try to blacklist as many kernel modules as I can. Maybe blacklisting GPU is too much because if I can't actually use my laptop it will be much more difficult to reproduce the issue. Try memtest86+ which has a few more/different tests than memtest86. I have encountered RAM modules that pass memtest86 but fail memtest86+ and vice versa. Try memtester, a memory tester that runs as a Linux process, so it can detect corruption caused when device drivers spray data randomly into RAM, or when the CPU thermal controls are influenced by Linux (an overheating CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop designs rely on the OS for thermal management). Try running more than one memory testing process, in case there is a bug in your hardware that affects interactions between multiple cores (memtest is single-threaded). You can run memtest86 inside a kvm (e.g. kvm -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues. Kernel compiles are a bad way to test RAM. I've successfully built kernels on hosts with known RAM failures. The kernels don't always work properly, but it's quite rare to see a build fail outright. I didn't use memtest86+ because of the lack of EFI support, but I just tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours without issues. Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4 -turns 10" together for 12 hours without any issue so I think both my ram and cpu are ok. I can think only about two possible culprits now (correct me if I'm wrong): 1) A btrfs bug 2) Another module screwing things around I can do nothing about btrfs bugs so I will try to hunt the second option. This is the list of modules I'm running: lsmod | awk '$4 == ""' | awk '{print $1}' | sort 8250_dw ac acpi_als acpi_pad aesni_intel ahci algif_skcipher ansi_cprng arc4 atkbd battery bnep btrfs btusb cdc_ether cmac coretemp crc32c_intel crc32_pclmul crct10dif_pclmul dell_laptop dell_wmi dm_crypt drbg ecb elan_i2c evdev ext4 fan fjes ghash_clmulni_intel gpio_lynxpoint hid_generic hid_multitouch hmac i2c_designware_platform i2c_hid i2c_i801 i915 input_leds int3400_thermal int3402_thermal int3403_thermal intel_hid intel_pch_thermal intel_powerclamp intel_rapl ip_tables
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Mon, May 9, 2016 at 8:53 AM, Niccolò Belliwrote: > I cannot manage to survive such annoying workflow for long, so I really hope > someone will manage to track the bug down soon. I suggest perseverance :) despite how tedious this is. Btrfs is more aware of its state than other file systems, so if you give up and go to ext4 it's entirely possible corruption is still happening but you won't know it until there's a lot more damage. At the least if you have to give up I'd suggest XFS and make sure you're using not older than xfsprogs 3.2.3 which will make a V5 file system that uses metadata checksumming by default. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Hi, Le 09/05/2016 16:53, Niccolò Belli a écrit : > On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote: >> Are you using any power management tweaks? > > Yes, as stated in my very first post I use TLP with > SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the > bug even without TLP. Also in the past week I've alwyas been on AC. > > On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: >> Memtest doesn't replicate typical usage patterns very well. My usual >> testing for RAM involves not just memtest, but also booting into a >> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel >> source, and then running as many concurrent kernel builds as cores, >> each with as many make jobs as cores (so if you've got a quad core >> CPU (or a dual core with hyperthreading), it would be running 4 >> builds with -j4 passed to make). GCC seems to have memory usage >> patterns that reliably trigger memory errors that aren't caught by >> memtest, so this generally gives good results. > > Building kernel with 4 concurrent threads is not an issue for my > system, in fact I do compile a lot and I never had any issue. Note : I once had a server which would pass memtest86 and repeated kernel compilations maxing out the CPU threads but couldn't at the same time reliably compile a kernel and copy large amounts of data. I think I lost my little automated test suite (I should definitely look for it again or code it from scratch) but what I did on new servers since that time was : 1/ create a file larger than the system's RAM (this makes sure you will read and write all data from disk and not only caches and might catch controller hardware problems too) with dd if=/dev/urandom (several gigabytes of random data exercise many different patterns, far more than what memtest86 would test), compute its md5 checksum 2/ launch a subprocess repeatedly compiling the kernel with more jobs than available CPU threads and stopping as soon as the make exit code was != 0. 3/ launch another subprocess repeatedly copying the random file to another location and exiting when the md5 checksum didn't match the source. Let it run as a burn-in test for as long as you can afford (from experience after 24 hours if it's still running the probability that the test will find a problem becomes negligible). If one of the subprocess stopped by itself your hardware is not stable. This actually caught a few unstable systems before it could go into production for me. Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Austin S. Hemmelgarn posted on Mon, 09 May 2016 14:21:57 -0400 as excerpted: > This practice evolved out of the fact that the only bad RAM I've ever > dealt with either completely failed to POST (which can have all kinds of > interesting symptoms if it's just one module, some MB's refuse to boot, > some report the error, others just disable the module and act like > nothing happened), or passed all the memory testing tools I threw at it > (memtest86, memtest86+, memtester, concurrent memtest86 invocations from > Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed > under heavy concurrent random access, which can be reliably produced by > running a bunch of big software builds at the same time with the CPU > insanely over-committed. My (likely much more limited) experience matches yours. Tho FWIW, in my case I did find that one of the more common memory failure indicators was bz2-ed tarball decompression, where the tarball would fail its decompression checksum safety checks. However, that most reliably happened in the context of a heavily loaded system doing other package builds in parallel to the package tarball extraction that failed. In my case, I even had ECC RAM, but it was apparently just slightly out of spec for its labeled and internally configured memory speeds (PC3200 DDR1 at the time), at least on my hardware. Once I got a BIOS update that let me, I slightly downclocked the memory (to PC3000, IIRC), and it was absolutely solid, no more errors, even with tightened up wait-state timings. Later I upgraded RAM, and the new RAM worked just fine at the same PC3200 speeds that were a problem for the older RAM. The problem was apparently that while the RAM cells that memcheck checks were fine, it was testing in an otherwise calm environment (not much choice since you can only boot to the test directly and can't do anything else at the same time), without all the other stuff going on in the hectic environment of a multi-package parallel build, that apparently happened to occasionally trigger the edge-case that would corrupt things. And FWIW, I still have major respect for how well reiserfs behaved under those conditions. No filesystem can be expected to be 100% reliable when it's getting corrupted data due to bad memory, but reiserfs held up remarkably well, far better than btrfs did under similar conditions (but then with the PCI and SATA bus) a few year later, forcing me back to reiserfs for a time, which again, continued to work like a champ, even under hardware conditions that were absolutely unworkable with btrfs. I had a heat-related (AC went out, in Phoenix, in the summer, 40+ C outside, 50+C inside, who knows what the disks were!?) head crash on a disk too, where the partitions that were mounted and likely had the head flying over them were damaged beyond (easy) recovery, but other partitions on the same disk were absolutely fine, and I actually continued to run off them for a few months after cooling everything back down. That sort of experience is the reason I still use reiserfs on spinning rust, including my second and third level backups, even while I'm running btrfs on the ssds for the working system and primary backup. It's also the reason I continue to use a partitioned system with multiple independent filesystems (btrfs raid1 on a pair of ssds for most of the working btrfs and primary backups, individual ssd btrfs in dup mode for /boot, and its backup on the other ssd), instead of putting my data eggs all in the same filesystem basket with subvolumes, where if the filesystem goes out all the subvolumes go with it! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 2016-05-09 12:29, Zygo Blaxell wrote: On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote: While trying to find a common denominator for my issue I did lots of backups of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data write every time), without any issue (after restoring the backup I alwyas check the parition with btrfs check). So disk doesn't seem to be the culprit. Did you also check the data matches the backup? btrfs check will only look at the metadata, which is 0.1% of what you've copied. From what you've written, there should be a lot of errors in the data too. If you have incorrect data but btrfs scrub finds no incorrect checksums, then your storage layer is probably fine and we have to look at CPU, host RAM, and software as possible culprits. This is a good point. The logs you've posted so far indicate that bad metadata (e.g. negative item lengths, nonsense transids in metadata references but sane transids in the referred pages) is getting into otherwise valid and well-formed btrfs metadata pages. Since these pages are protected by checksums, the corruption can't be originating in the storage layer--if it was, the pages should be rejected as they are read from disk, before btrfs even looks at them, and the insane transid should be the "found" one not the "expected" one. That suggests there is either RAM corruption happening _after_ the data is read from disk (i.e. while the pages are cached in RAM), or a severe software bug in the kernel you're running. Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever maintains your kernel had a bad day and merged a patch they should not have. Try a minimal configuration with as few drivers as possible loaded, especially GPU drivers and anything from the staging subdirectory--when these drivers have bugs, they ruin everything. Try memtest86+ which has a few more/different tests than memtest86. I have encountered RAM modules that pass memtest86 but fail memtest86+ and vice versa. Try memtester, a memory tester that runs as a Linux process, so it can detect corruption caused when device drivers spray data randomly into RAM, or when the CPU thermal controls are influenced by Linux (an overheating CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop designs rely on the OS for thermal management). Try running more than one memory testing process, in case there is a bug in your hardware that affects interactions between multiple cores (memtest is single-threaded). You can run memtest86 inside a kvm (e.g. kvm -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues. Kernel compiles are a bad way to test RAM. I've successfully built kernels on hosts with known RAM failures. The kernels don't always work properly, but it's quite rare to see a build fail outright. My original suggestion that prompted that part of the comment was to run a bunch of concurrent kernel builds (I only use kernel builds myself because it's a big project with essentially zero build dependencies, if I had the patience and space (and a LiveCD with the right tools and packages installed), I'd probably be using something like LibreOffice or Chromium instead), each run with as many jobs as CPU's (so on a quad-core system, run a dozen or so concurrently with make -j4). I don't use this as my sole test (I also use multiple other tools), but I find that this does a particularly good job of exercising things that memtest doesn't, and I don't just make sure the build's succeed, but also that the compiled kernel images all match, because if there's bad RAM, the resultant images will often be different in some way (and I had forgotten to mention this bit). This practice evolved out of the fact that the only bad RAM I've ever dealt with either completely failed to POST (which can have all kinds of interesting symptoms if it's just one module, some MB's refuse to boot, some report the error, others just disable the module and act like nothing happened), or passed all the memory testing tools I threw at it (memtest86, memtest86+, memtester, concurrent memtest86 invocations from Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed under heavy concurrent random access, which can be reliably produced by running a bunch of big software builds at the same time with the CPU insanely over-committed. I could probably produce a similar workload with tmpfs and FIO, but it's a lot quicker and easier to remember how to do a kernel build than it is to remember the complex incantations needed to get FIO to do anything interesting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote: > While trying to find a common denominator for my issue I did lots of backups > of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot > dozens of times (triggering a 150GB+ random data write every time), without > any issue (after restoring the backup I alwyas check the parition with btrfs > check). So disk doesn't seem to be the culprit. Did you also check the data matches the backup? btrfs check will only look at the metadata, which is 0.1% of what you've copied. From what you've written, there should be a lot of errors in the data too. If you have incorrect data but btrfs scrub finds no incorrect checksums, then your storage layer is probably fine and we have to look at CPU, host RAM, and software as possible culprits. The logs you've posted so far indicate that bad metadata (e.g. negative item lengths, nonsense transids in metadata references but sane transids in the referred pages) is getting into otherwise valid and well-formed btrfs metadata pages. Since these pages are protected by checksums, the corruption can't be originating in the storage layer--if it was, the pages should be rejected as they are read from disk, before btrfs even looks at them, and the insane transid should be the "found" one not the "expected" one. That suggests there is either RAM corruption happening _after_ the data is read from disk (i.e. while the pages are cached in RAM), or a severe software bug in the kernel you're running. Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever maintains your kernel had a bad day and merged a patch they should not have. Try a minimal configuration with as few drivers as possible loaded, especially GPU drivers and anything from the staging subdirectory--when these drivers have bugs, they ruin everything. Try memtest86+ which has a few more/different tests than memtest86. I have encountered RAM modules that pass memtest86 but fail memtest86+ and vice versa. Try memtester, a memory tester that runs as a Linux process, so it can detect corruption caused when device drivers spray data randomly into RAM, or when the CPU thermal controls are influenced by Linux (an overheating CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop designs rely on the OS for thermal management). Try running more than one memory testing process, in case there is a bug in your hardware that affects interactions between multiple cores (memtest is single-threaded). You can run memtest86 inside a kvm (e.g. kvm -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues. Kernel compiles are a bad way to test RAM. I've successfully built kernels on hosts with known RAM failures. The kernels don't always work properly, but it's quite rare to see a build fail outright. > [...]I have the feeling that "autodefrag" enhances the > chances to get corruption, but I'm not 100% sure about it. Anyway, > triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", giving > high chances to get irrecoverable corruption. When running such command it > simply extracts the tarballs from the cache and overwrites the already > installed files. It doesn't write lots of data (after reinstallation my > system is still quite small, just a few GBs) but it seems to be enough to > displease the filesystem. pacman probably does a lot of fsync() which will do a lot of metadata tree updates. autodefrag triples the I/O load for fragmented files and most of that extra load is metadata tree writes. Both will make the symptoms of your problem worse. signature.asc Description: Digital signature
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote: Are you using any power management tweaks? Yes, as stated in my very first post I use TLP with SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bug even without TLP. Also in the past week I've alwyas been on AC. On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: Memtest doesn't replicate typical usage patterns very well. My usual testing for RAM involves not just memtest, but also booting into a LiveCD (usually SystemRescueCD), pulling down a copy of the kernel source, and then running as many concurrent kernel builds as cores, each with as many make jobs as cores (so if you've got a quad core CPU (or a dual core with hyperthreading), it would be running 4 builds with -j4 passed to make). GCC seems to have memory usage patterns that reliably trigger memory errors that aren't caught by memtest, so this generally gives good results. Building kernel with 4 concurrent threads is not an issue for my system, in fact I do compile a lot and I never had any issue. On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: On a similar note, badblocks doesn't replicate filesystem like access patterns, it just runs sequentially through the entire disk. This isn't as likely to give bad results, but it's still important to know. In particular, try running it over a dmcrypt volume a couple of times (preferably with a different key each time, pulling keys from /dev/urandom works well for this), as that will result in writing different data. For what it's worth, when I'm doing initial testing of new disks, I always use ddrescue to copy /dev/zero over the whole disk, then do it twice through dmcrypt with different keys, copying from the disk to /dev/null after each pass. This gives random data on disk as a starting point (which is good if you're going to use dmcrypt), and usually triggers reallocation of any bad sectors as early as possible. While trying to find a common denominator for my issue I did lots of backups of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data write every time), without any issue (after restoring the backup I alwyas check the parition with btrfs check). So disk doesn't seem to be the culprit. On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: 1. If you have an eSATA port, try plugging your hard disk in there and see if things work. If that works but having the hard drive plugged in internally doesn't, then the issue is probably either that specific SATA port (in which case your chip-set is bad and you should get a new system), or the SATA connector itself (or the wiring, but that's not as likely when it's traces on a PCB). Normally I'd suggest just swapping cables and SATA ports, but that's not really possible with a laptop. 2. If you have access to a reasonably large flash drive, or to a USB to SATA adapter, try that as well, if it works on that but not internally (or on an eSATA port), you've probably got a bad SATA controller, and should get a new system. My laptop doesn't have an eSATA port and my only big enough external drive is currently used for daily backups, since I fear for data loss. On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: 3. Try things without dmcrypt. Adding extra layers makes it harder to determine what is actually wrong. If it works without dmcrypt, try using different parameters for the encryption (different ciphers is what I would try first). If it works reliably without dmcrypt, then it's either a bug in dmcrypt (which I don't think is very likely), or it's bad interaction between dmcrypt and BTRFS. If it works with some encryption parameters but not others, then that will help narrow down where the issue is. On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote: You're making the troubleshooting unnecessarily difficult by continuing to use non-default options. *shrug* Every single layer you add complicates the setup and troubleshooting. Of course all of it should work together, many people do. But you're the one having the problem so in order to demonstrate whether this is a software bug or hardware problem, you need to test it with the most basic setup possible --> btrfs on plain partitions and default mount options. I will try to recap because you obviously missed my previous e-mail: I managed to replicate the irrecoverable corruption bug even with default options and no dmcrypt at all. Somehow it was a bit more difficult to replicate with default options and so I started to play with different combinations to find if there was something which increased the chances of getting corruption. I have the feeling that "autodefrag" enhances the chances to get corruption, but I'm not 100% sure about it. Anyway, triggering a whole packages reinstall with "pacaur -S $(pacman
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 2016-05-07 12:11, Niccolò Belli wrote: Il 2016-05-07 17:58 Clemens Eisserer ha scritto: Hi Niccolo, btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot Just to be curious - couldn't it be a hardware issue? I use almost the same setup (compress-force=lzo instead of compress-force=lzo) on my laptop for 2-3 years and haven't experienced any issues since ~kernel-3.14 or so. Br, Clemens Eisserer Hi, Which kind of hardware issue? I did a full memtest86 check, a full smartmontools extended check and even a badblocks -wsv. If this is really an hardware issue that we can identify I would be more than happy because Dell will replace my laptop and this nightmare will be finally over. I'm open to suggestions. First, some general advice: 1. It is fully possible to have bad RAM that still passes memtest86 consistently, and in fact, most of the time this will be the case (if you're seeing any thing other than the bit-fade test in memtest86 fail, then your system probably won't boot fully). Memtest doesn't replicate typical usage patterns very well. My usual testing for RAM involves not just memtest, but also booting into a LiveCD (usually SystemRescueCD), pulling down a copy of the kernel source, and then running as many concurrent kernel builds as cores, each with as many make jobs as cores (so if you've got a quad core CPU (or a dual core with hyperthreading), it would be running 4 builds with -j4 passed to make). GCC seems to have memory usage patterns that reliably trigger memory errors that aren't caught by memtest, so this generally gives good results. Secondarily, if it's a big system and I am not pressed for time, I do a quick Gentoo install with Xen, and then spin up twice as many Xen VM's as cores and run memtest in those concurrently (this seems to catch things a bit more reliably than just a plain memtest). 2. On a similar note, badblocks doesn't replicate filesystem like access patterns, it just runs sequentially through the entire disk. This isn't as likely to give bad results, but it's still important to know. In particular, try running it over a dmcrypt volume a couple of times (preferably with a different key each time, pulling keys from /dev/urandom works well for this), as that will result in writing different data. For what it's worth, when I'm doing initial testing of new disks, I always use ddrescue to copy /dev/zero over the whole disk, then do it twice through dmcrypt with different keys, copying from the disk to /dev/null after each pass. This gives random data on disk as a starting point (which is good if you're going to use dmcrypt), and usually triggers reallocation of any bad sectors as early as possible. If I have time and access to an existing system I can connect the disk to, I often do testing with fio as well. Now, to slightly more specific advice: 1. If you have an eSATA port, try plugging your hard disk in there and see if things work. If that works but having the hard drive plugged in internally doesn't, then the issue is probably either that specific SATA port (in which case your chip-set is bad and you should get a new system), or the SATA connector itself (or the wiring, but that's not as likely when it's traces on a PCB). Normally I'd suggest just swapping cables and SATA ports, but that's not really possible with a laptop. 2. If you have access to a reasonably large flash drive, or to a USB to SATA adapter, try that as well, if it works on that but not internally (or on an eSATA port), you've probably got a bad SATA controller, and should get a new system. 3. Try things without dmcrypt. Adding extra layers makes it harder to determine what is actually wrong. If it works without dmcrypt, try using different parameters for the encryption (different ciphers is what I would try first). If it works reliably without dmcrypt, then it's either a bug in dmcrypt (which I don't think is very likely), or it's bad interaction between dmcrypt and BTRFS. If it works with some encryption parameters but not others, then that will help narrow down where the issue is. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 7 May 2016 at 18:11, Niccolò Belliwrote: > Which kind of hardware issue? I did a full memtest86 check, a full > smartmontools extended check and even a badblocks -wsv. > If this is really an hardware issue that we can identify I would be more than > happy because Dell will replace my laptop and this nightmare will be finally > over. I'm open to suggestions. Well, your hardware differs from a lot of successful installations. Are you using any power management tweaks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Sat, May 7, 2016 at 9:45 AM, Niccolò Belliwrote: > btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot > So discard is not the culprit. Will try to remove compress=lzo and > autodefrag and see if it still happens. You're making the troubleshooting unnecessarily difficult by continuing to use non-default options. *shrug* Every single layer you add complicates the setup and troubleshooting. Of course all of it should work together, many people do. But you're the one having the problem so in order to demonstrate whether this is a software bug or hardware problem, you need to test it with the most basic setup possible --> btrfs on plain partitions and default mount options. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Il 2016-05-07 17:58 Clemens Eisserer ha scritto: Hi Niccolo, btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot Just to be curious - couldn't it be a hardware issue? I use almost the same setup (compress-force=lzo instead of compress-force=lzo) on my laptop for 2-3 years and haven't experienced any issues since ~kernel-3.14 or so. Br, Clemens Eisserer Hi, Which kind of hardware issue? I did a full memtest86 check, a full smartmontools extended check and even a badblocks -wsv. If this is really an hardware issue that we can identify I would be more than happy because Dell will replace my laptop and this nightmare will be finally over. I'm open to suggestions. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Hi Niccolo, > btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot Just to be curious - couldn't it be a hardware issue? I use almost the same setup (compress-force=lzo instead of compress-force=lzo) on my laptop for 2-3 years and haven't experienced any issues since ~kernel-3.14 or so. Br, Clemens Eisserer 2016-05-07 17:45 GMT+02:00 Niccolò Belli: > btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot > So discard is not the culprit. Will try to remove compress=lzo and > autodefrag and see if it still happens. > > [ 748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 move > len 4294962894 len 16384 > [ 748.226206] [ cut here ] > [ 748.227831] kernel BUG at fs/btrfs/extent_io.c:5723! > [ 748.229498] invalid opcode: [#1] PREEMPT SMP > [ 748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 nls_cp437 > vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi iTCO_wdt > iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp > coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw pcspkr elan_i2c > snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a snd_soc_core i2c_hid iwlmvm > snd_compress snd_pcm_dmaengine ac97_bus mac80211 uvcvideo videobuf2_vmalloc > btusb videobuf2_memops cdc_ether btrtl usbnet iwlwifi btbcm videobuf2_v4l2 > btintel intel_pch_thermal videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms > cfg80211 bluetooth visor media mii memstick joydev evdev mousedev input_leds > rfkill mac_hid crc16 i915 fan thermal wmi dw_dmac int3403_thermal video > dw_dmac_core drm_kms_helper snd_soc_sst_acpi i2c_designware_platform > snd_soc_sst_match > [ 748.237203] snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint > spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb > intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal > acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea > sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit > processor_thermal_device kfifo_buf processor snd industrialio acpi_pad ac > int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore shpchp > sch_fq_codel ip_tables x_tables btrfs xor raid6_pq jitterentropy_rng > sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg uas > usb_storage dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2 > crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel > aes_x86_64 lrw gf128mul glue_helper > [ 748.244176] ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci > rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 > mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common > [ 748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1 > [ 748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 > 11/11/2015 > [ 748.251576] task: 8800d9d98e40 ti: 8800cec1 task.ti: > 8800cec1 > [ 748.254064] RIP: 0010:[] [] > memmove_extent_buffer+0x10c/0x110 [btrfs] > [ 748.256600] RSP: 0018:8800cec13c18 EFLAGS: 00010246 > [ 748.259120] RAX: RBX: 88020c01ba40 RCX: > 0056 > [ 748.261631] RDX: RSI: 88021e40db38 RDI: > 88021e40db38 > [ 748.264166] RBP: 8800cec13c48 R08: R09: > 033b > [ 748.266716] R10: R11: 033b R12: > eece > [ 748.269267] R13: 00010405 R14: 000104c9 R15: > 88020c01ba40 > [ 748.271818] FS: 7f14d4271740() GS:88021e40() > knlGS: > [ 748.274392] CS: 0010 DS: ES: CR0: 80050033 > [ 748.276987] CR2: 01630008 CR3: cffc8000 CR4: > 003406f0 > [ 748.279603] DR0: DR1: DR2: > > [ 748.282220] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 748.284815] Stack: > [ 748.287422] e3438cd2 88020c01ba40 00c4 > 002a > [ 748.290082] 006b 03a0 8800cec13ce8 > a02b612c > [ 748.292754] a02b433d 8800da9ca820 0028 > 8800daa78bd0 > [ 748.295441] Call Trace: > [ 748.298104] [] btrfs_del_items+0x33c/0x4a0 [btrfs] > [ 748.300827] [] ? btrfs_search_slot+0x90d/0x990 [btrfs] > [ 748.303564] [] ? btrfs_get_token_8+0x6c/0x130 [btrfs] > [ 748.306311] [] btrfs_truncate_inode_items+0x649/0xd20 > [btrfs] > [ 748.309071] [] ? > btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs] > [ 748.311860] [] btrfs_evict_inode+0x485/0x5d0 [btrfs] > [ 748.314627] [] evict+0xc5/0x190 > [ 748.317412] [] iput+0x1d9/0x260 > [ 748.320199] [] do_unlinkat+0x199/0x2d0 > [ 748.322988] [] SyS_unlink+0x16/0x20 > [ 748.325781] [] entry_SYSCALL_64_fastpath+0x12/0x6d > [ 748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 44
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot So discard is not the culprit. Will try to remove compress=lzo and autodefrag and see if it still happens. [ 748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 move len 4294962894 len 16384 [ 748.226206] [ cut here ] [ 748.227831] kernel BUG at fs/btrfs/extent_io.c:5723! [ 748.229498] invalid opcode: [#1] PREEMPT SMP [ 748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw pcspkr elan_i2c snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a snd_soc_core i2c_hid iwlmvm snd_compress snd_pcm_dmaengine ac97_bus mac80211 uvcvideo videobuf2_vmalloc btusb videobuf2_memops cdc_ether btrtl usbnet iwlwifi btbcm videobuf2_v4l2 btintel intel_pch_thermal videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms cfg80211 bluetooth visor media mii memstick joydev evdev mousedev input_leds rfkill mac_hid crc16 i915 fan thermal wmi dw_dmac int3403_thermal video dw_dmac_core drm_kms_helper snd_soc_sst_acpi i2c_designware_platform snd_soc_sst_match [ 748.237203] snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit processor_thermal_device kfifo_buf processor snd industrialio acpi_pad ac int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore shpchp sch_fq_codel ip_tables x_tables btrfs xor raid6_pq jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg uas usb_storage dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper [ 748.244176] ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common [ 748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1 [ 748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 11/11/2015 [ 748.251576] task: 8800d9d98e40 ti: 8800cec1 task.ti: 8800cec1 [ 748.254064] RIP: 0010:[] [] memmove_extent_buffer+0x10c/0x110 [btrfs] [ 748.256600] RSP: 0018:8800cec13c18 EFLAGS: 00010246 [ 748.259120] RAX: RBX: 88020c01ba40 RCX: 0056 [ 748.261631] RDX: RSI: 88021e40db38 RDI: 88021e40db38 [ 748.264166] RBP: 8800cec13c48 R08: R09: 033b [ 748.266716] R10: R11: 033b R12: eece [ 748.269267] R13: 00010405 R14: 000104c9 R15: 88020c01ba40 [ 748.271818] FS: 7f14d4271740() GS:88021e40() knlGS: [ 748.274392] CS: 0010 DS: ES: CR0: 80050033 [ 748.276987] CR2: 01630008 CR3: cffc8000 CR4: 003406f0 [ 748.279603] DR0: DR1: DR2: [ 748.282220] DR3: DR6: fffe0ff0 DR7: 0400 [ 748.284815] Stack: [ 748.287422] e3438cd2 88020c01ba40 00c4 002a [ 748.290082] 006b 03a0 8800cec13ce8 a02b612c [ 748.292754] a02b433d 8800da9ca820 0028 8800daa78bd0 [ 748.295441] Call Trace: [ 748.298104] [] btrfs_del_items+0x33c/0x4a0 [btrfs] [ 748.300827] [] ? btrfs_search_slot+0x90d/0x990 [btrfs] [ 748.303564] [] ? btrfs_get_token_8+0x6c/0x130 [btrfs] [ 748.306311] [] btrfs_truncate_inode_items+0x649/0xd20 [btrfs] [ 748.309071] [] ? btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs] [ 748.311860] [] btrfs_evict_inode+0x485/0x5d0 [btrfs] [ 748.314627] [] evict+0xc5/0x190 [ 748.317412] [] iput+0x1d9/0x260 [ 748.320199] [] do_unlinkat+0x199/0x2d0 [ 748.322988] [] SyS_unlink+0x16/0x20 [ 748.325781] [] entry_SYSCALL_64_fastpath+0x12/0x6d [ 748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 44 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f fa ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb [ 748.331558] RIP [] memmove_extent_buffer+0x10c/0x110 [btrfs] [ 748.334473] RSP [ 748.356077] ---[ end trace 9bfb28800ab52273 ]--- [ 748.359042] note: pacman[2316] exited with preempt_count 2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
I formatted the partition and copied the content of my previous rootfs to it. There is no dmcrypt now and mount options are defaults, except for noatime. After a single boot I got the very same problem as before (fs corrupted and an infinite loop when doing btrfs check --repair. I wanted to replicate results and so I tried once again and since then I only experienced minor corruption, correctly resolved by repair. But during a pacaman upgrade, which triggered snapper pre-post snapshots, the system hanged and I found this in the logs: mag 06 10:31:15 arch-laptop plasmashell[873]: requesting unexisting screen 2 mag 06 10:31:18 arch-laptop dbus[418]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper) mag 06 10:31:18 arch-laptop dbus[418]: [system] Successfully activated service 'org.opensuse.Snapper' mag 06 10:31:20 arch-laptop kernel: [ cut here ] mag 06 10:31:20 arch-laptop kernel: kernel BUG at fs/btrfs/ctree.h:2693! Still no major corruption found since my second attempt. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote: > On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote: > > I suggest using defaults for starters. The only thing in that list > > that needs be there is either subvolid or subvold, not both. Add in > > the non-default options once you've proven the defaults are working, > > and add them one at a time. > > Yes I read your previous suggestion and I already dropped subvolid, but > since the problem already happened I left it in the mail for completeness. > Anyway the culprit here is genfstab and that's probably what a beginner is > going to use when installing a distro: > https://wiki.archlinux.org/index.php/beginners'_guide#fstab > The redundant subvolid doesn't hurt, the kernel will just check that it matches the passed subvol (see [1]). genfstab probably just pulls the options out of /proc/mounts or /proc/self/mountinfo, and since we show both, that's how it gets in fstab. If it was actually a problem, there would be a clear message in dmesg. 1: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote: I suggest using defaults for starters. The only thing in that list that needs be there is either subvolid or subvold, not both. Add in the non-default options once you've proven the defaults are working, and add them one at a time. Yes I read your previous suggestion and I already dropped subvolid, but since the problem already happened I left it in the mail for completeness. Anyway the culprit here is genfstab and that's probably what a beginner is going to use when installing a distro: https://wiki.archlinux.org/index.php/beginners'_guide#fstab Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q). The firmware is old if I understand the naming scheme used by Dell. It says EXT49D0Q is current. http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH According to this (http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/) the firmware you linked is for the mSATA version of the drive, not the M.2 one. EXT25D0Q seems to be the very latest one for my drive. I advice using all defaults for everything for now, otherwise it's anyone's guess what you're running into. On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote: Would it be OK for you to test your btrfs on a plain ssd, without encryption? And just as Chris Murphy said, reducing mount option is also a pretty good debugging start point. Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what happens. I made a copy of /dev/mapper/cryptroot with dd on an external drive and I run btrfs check on it (btrfs-progs 4.5.2): https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB) Checked, but seems the output is truncated? No, I didn't truncate the btrfs check output because it wasn't endless. I just truncated the repair output. I also have something new to report. Do you remember when I said that my screen was black and so I had to forcedly power off the system? Something similar happened today and since in the meantime I enabled magic sysrq keys I have been able to recover this from the logs: mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering "org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39 mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated service 'org.bluez.obex' mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service. mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement: IdentityManager: There was no default identity. Marking first one as default. mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging request at 00017d11 mag 05 11:55:51 arch-laptop kernel: IP: [] anon_vma_interval_tree_insert+0x3f/0x90 mag 05 11:55:51 arch-laptop kernel: PGD 0 mag 05 11:55:51 arch-laptop kernel: Oops: [#1] PREEMPT SMP mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 crc16 mii joydev mousedev nvr mag 05 11:55:51 arch-laptop kernel: mei_me syscopyarea sysfillrect snd sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan intel_hid sparse_keymap int3403_thermal video processor_thermal_device dw_dmac snd_soc_sst_acpi snd_soc_sst_m mag 05 11:55:51 arch-laptop kernel: lrw gf128mul glue_helper ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer initialized mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11 mag 05 11:55:51 arch-laptop kernel: xhci_hcd mag 05 11:55:51 arch-laptop kernel: i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not tainted 4.5.1-1-ARCH #1 mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 11/11/2015 mag 05 11:55:51 arch-laptop kernel: task: 88021347d580 ti: 880211f8c000 task.ti: 880211f8c000 mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[] [] anon_vma_interval_tree_insert+0x3f/0x90 mag 05 11:55:51 arch-laptop kernel: RSP: 0018:880211f8fd68 EFLAGS: 00010206 mag 05 11:55:51 arch-laptop kernel: RAX: 8800da2f4820 RBX: 8800bb59ce40 RCX: 8800da2f4830 mag 05 11:55:51 arch-laptop kernel: RDX: 8800da2f4828 RSI: 8800374404a0 RDI: 8800c58dfa40 mag 05 11:55:51 arch-laptop kernel: RBP: 880211f8fdb8 R08: 00017c79 R09: 0007f55e2059 mag 05 11:55:51 arch-laptop kernel: R10: 0007f55e2053 R11: 8800c58dfa40 R12: 880037440460 mag 05 11:55:51 arch-laptop kernel:
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Wed, May 4, 2016 at 5:21 PM, Niccolò Belliwrote: > rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@ I suggest using defaults for starters. The only thing in that list that needs be there is either subvolid or subvold, not both. Add in the non-default options once you've proven the defaults are working, and add them one at a time. > I have the whole rootfs encrypted, including boot. I followed these steps: > https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap > > Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q). The firmware is old if I understand the naming scheme used by Dell. It says EXT49D0Q is current. http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH If you need to update, you may be best off doing a whole device trim, which is easiest done with mkfs.btrfs pointed at the whole device. I wouldn't trust any data on the drive after a firmware update so I'd start over entirely from scratch, new partition map, new everything. So the way to do this is: mkfs.btrfs /dev/sda wipefs -a /dev/sda That way the btrfs magic is removed, and now you can partition it, setup dmcrypt, etc. I advice using all defaults for everything for now, otherwise it's anyone's guess what you're running into. Off topic, but at least gmail users see your posts go to spam because your domain is configured to disallow relaying. Most mail services ignore this request by the domain but google honors it so no amount of training will make your email not spam. This is what's in your emails that's causing the problem: dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it http://webmasters.stackexchange.com/questions/76765/sent-emails-pass-spf-and-dkim-but-fail-dmarc-when-received-by-gmail http://www.pcworld.com/article/2141120/yahoo-email-antispoofing-policy-breaks-mailing-lists.html -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html