On Thu, May 12, 2016 at 04:35:24PM +0200, Niccolò Belli wrote: > When doing the btrfs check I also always do a btrfs scrub and it never found > any error. Once it didn't manage to finish the scrub because of: > BTRFS critical (device dm-0): corrupt leaf, slot offset bad: > block=670597120,root=1, slot=6 > and btrfs scrub status reported "was aborted after 00:00:10". > > Talking about scrub I created a systemd timer to run scrub hourly and I > noticed 2 *uncorrectable* errors suddenly appeared on my system. So I > immediately re-run the scrub just to confirm it and then I rebooted into the > Arch live usb and runned btrfs check: the metadata were perfect. So I runned > btrfs scrub from the live usb and there were no errors at all! I rebooted > into my system and runned scrub once again and the uncorrectable errors > where really gone! It happened two times in the past few days.
That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? > Almost no patches get applied by the Arch kernel team: > https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux > At the moment the only one is an harmless > "change-default-console-loglevel.patch". Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. Contrast with 4.1.x and 4.4.x, which runs for months between reboots for me. Maybe there's a regression in 4.5.x, maybe I did something wrong in my config or build, or maybe I just have too few data points to draw any conclusions, but my data so far is telling me to stay on 4.4.x until something changes (i.e. wait for a 4.5.x stable update or skip directly to 4.6.x). :-/ It's always worth trying this if only to eliminate regression as a possible root cause early. In practice, every mainline kernel release has a regression that affects at least one combination of config options and hardware. btrfs is stable enough now that you can be running one or two releases behind to avoid a problem elsewhere in the kernel. > Another option will be crashing it with my car's wheels hoping that because > of my comprehensive insurance policy Dell will give me the next model (the > Skylake one) as a replacement (hoping that it will not suffer from the same > issue of the Broadwell one). The first rule of Insurance Fraud Club: don't talk about Insurance Fraud Club. ;) It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset.
signature.asc
Description: Digital signature