Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Chris Murphy
On Fri, May 13, 2016 at 6:10 AM, Niccolò Belli
 wrote:
> On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
>>
>> The fact that you're getting an OOPS involving core kernel threads
>> (kswapd) is a pretty good indication that either there's a bug elsewhere in
>> the kernel, or that something is wrong with your hardware.  it's really
>> difficult to be certain if you don't have a reliable test case though.
>
>
> Talking about reliable test cases, I forgot to say that I definitely found
> an interesting one. It doesn't lead to OOPS but perhaps something even more
> interesting. While running countless stress tests I tried running some games
> to stress the system in different ways. I chosed openmw (an open source
> engine for Morrowind) and I played it for a while on my second external
> monitor (while I watched at some monitoring tools on my first monitor). I
> noticed that after playing a while I *always* lose internet connection (I
> use an USB3 Gigabit Ethernet adapter). This isn't the only thing which
> happens: even if the game keeps running flawlessly and the system *seems* to
> work fine (I can drag windows, open the terminal...) lots of commands simply
> stall (for example mounting a partition, unmounting it, rebooting...). I can
> reliably reproduce it, it ALWAYS happens.

Well there are a bunch of kernel debug options. If your kernel has
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y at compile time you can boot with boot parameter
slub_debug=1 to enable it and maybe there'll be something more
revealing about the problems you're having. More aggressive is
CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably.

And then there's some Btrfs debug options for compile time, and are
enabled with mount options. But I think the problem you're having
isn't specific to Btrfs or someone else would have run into it.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Niccolò Belli

On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
The fact that you're getting an OOPS involving core kernel 
threads (kswapd) is a pretty good indication that either there's 
a bug elsewhere in the kernel, or that something is wrong with 
your hardware.  it's really difficult to be certain if you don't 
have a reliable test case though.


Talking about reliable test cases, I forgot to say that I definitely found 
an interesting one. It doesn't lead to OOPS but perhaps something even more 
interesting. While running countless stress tests I tried running some 
games to stress the system in different ways. I chosed openmw (an open 
source engine for Morrowind) and I played it for a while on my second 
external monitor (while I watched at some monitoring tools on my first 
monitor). I noticed that after playing a while I *always* lose internet 
connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only 
thing which happens: even if the game keeps running flawlessly and the 
system *seems* to work fine (I can drag windows, open the terminal...) lots 
of commands simply stall (for example mounting a partition, unmounting it, 
rebooting...). I can reliably reproduce it, it ALWAYS happens.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Austin S. Hemmelgarn

On 2016-05-13 07:07, Niccolò Belli wrote:

On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:

That's probably a good indication of the CPU and the MB being OK, but
not necessarily the RAM.  There's two other possible options for
testing the RAM that haven't been mentioned yet though (which I hadn't
thought of myself until now):
1. If you have access to Windows, try the Windows Memory Diagnostic.
This runs yet another slightly different set of tests from memtest86
and memtest86+, so it may catch issues they don't.  You can start this
directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
from the EFI system partition.
2. This is a Dell system.  If you still have the utility partition
which Dell ships all their per-provisioned systems with, that should
have a hardware diagnostics tool.  I doubt that this will find
anything (it's part of their QA procedure AFAICT), but it's probably
worth trying, as the memory testing in that uses yet another slightly
different implementation of the typical tests.  You can usually find
this in the boot interrupt menu accessed by hitting F12 before the
boot-loader loads.


I tried the Dell System Test, including the enhanced optional ram tests
and it was fine. I also tried the Microsoft one, which passed. BUT if I
select the advanced test in the Microsoft One it always stops at 21% of
first test. The test menus are still working, but fans get quiet and it
keeps writing "test running... 21%" forever. I tried it many times and
it always got stuck at 21%, so I suspect a test suite bug instead of a
ram failure.
I've actually seen this before on other systems (different completion 
percentage on each system, but otherwise the same), all of them ended up 
actually having a bad CPU or MB, although the ones with CPU issues were 
fine after BIOS updates which included newer microcode.


I also noticed some other interesting behaviours: while I was running
the usual scrub+check (both were fine) from the livecd I noticed this in
dmesg:
[  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub
and check and they were still fine.
It's worth noting that these are running counts of errors since the last 
time the stats were reset (and they only get reset manually).  If you 
haven't reset the stats, then this isn't all that surprising.


This is what happened another time:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It
wasn't even mounted if I recall correctly!
The fact that you're getting an OOPS involving core kernel threads 
(kswapd) is a pretty good indication that either there's a bug elsewhere 
in the kernel, or that something is wrong with your hardware.  it's 
really difficult to be certain if you don't have a reliable test case 
though.


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?


Yes, except for the point release (the system is slightly ahead of the
liveusb).

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.


No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
FWIW, I've been running 4.5 with almost no issues on my laptop since it 
came out (the few issues I have had are not unique to 4.5, and are all 
ultimately firmware issues (Lenovo has been getting _really_ bad 
recently about having broken ACPI and EFI implementations...)).  Of 
course, I'm also running Gentoo, so everything is built locally, but I 
doubt that that has much impact on stability.


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a
problem
in the kernel that affects only your chipset.


Funny considering it is sold as a Linux laptop. Unfortunately they only
tested it with the ancient Ubuntu 14.04.
Sadly, this is pretty typical for anything sold as a 'Linux' system that 
isn't a server.  Even for the servers sold as such, it's not unusual for 
it to only be tested with with old versions of CentOS.


Now, I hadn't thought of this before, but it's a Dell system, so you're 
trapping out to SMBIOS for everything under the sun, and if they don't 
pass a correct memory map (or correct ACPI tables) to the OS during 
boot, then there may be some sections of RAM that both Linux and the 
firmware think they can use, which could definitely result in symptoms 
like bad RAM while still consistently passing memory tests (because they 
don't 

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Niccolò Belli

On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
That's probably a good indication of the CPU and the MB being 
OK, but not necessarily the RAM.  There's two other possible 
options for testing the RAM that haven't been mentioned yet 
though (which I hadn't thought of myself until now):
1. If you have access to Windows, try the Windows Memory 
Diagnostic. This runs yet another slightly different set of 
tests from memtest86 and memtest86+, so it may catch issues they 
don't.  You can start this directly on an EFI system by loading 
/EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition.
2. This is a Dell system.  If you still have the utility 
partition which Dell ships all their per-provisioned systems 
with, that should have a hardware diagnostics tool.  I doubt 
that this will find anything (it's part of their QA procedure 
AFAICT), but it's probably worth trying, as the memory testing 
in that uses yet another slightly different implementation of 
the typical tests.  You can usually find this in the boot 
interrupt menu accessed by hitting F12 before the boot-loader 
loads.


I tried the Dell System Test, including the enhanced optional ram tests and 
it was fine. I also tried the Microsoft one, which passed. BUT if I select 
the advanced test in the Microsoft One it always stops at 21% of first 
test. The test menus are still working, but fans get quiet and it keeps 
writing "test running... 21%" forever. I tried it many times and it always 
got stuck at 21%, so I suspect a test suite bug instead of a ram failure.


I also noticed some other interesting behaviours: while I was running the 
usual scrub+check (both were fine) from the livecd I noticed this in dmesg:
[  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: 
wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub and 
check and they were still fine.


This is what happened another time: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It wasn't 
even mounted if I recall correctly!


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?


Yes, except for the point release (the system is slightly ahead of the 
liveusb).


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.


No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a problem
in the kernel that affects only your chipset.


Funny considering it is sold as a Linux laptop. Unfortunately they only 
tested it with the ancient Ubuntu 14.04.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-12 Thread Zygo Blaxell
On Thu, May 12, 2016 at 04:35:24PM +0200, Niccolò Belli wrote:
> When doing the btrfs check I also always do a btrfs scrub and it never found
> any error. Once it didn't manage to finish the scrub because of:
> BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
> block=670597120,root=1, slot=6
> and btrfs scrub status reported "was aborted after 00:00:10".
> 
> Talking about scrub I created a systemd timer to run scrub hourly and I
> noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
> immediately re-run the scrub just to confirm it and then I rebooted into the
> Arch live usb and runned btrfs check: the metadata were perfect. So I runned
> btrfs scrub from the live usb and there were no errors at all! I rebooted
> into my system and runned scrub once again and the uncorrectable errors
> where really gone! It happened two times in the past few days.

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?

> Almost no patches get applied by the Arch kernel team:
> https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
> At the moment the only one is an harmless
> "change-default-console-loglevel.patch".

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.
Contrast with 4.1.x and 4.4.x, which runs for months between reboots
for me.  Maybe there's a regression in 4.5.x, maybe I did something
wrong in my config or build, or maybe I just have too few data points
to draw any conclusions, but my data so far is telling me to stay on
4.4.x until something changes (i.e. wait for a 4.5.x stable update or
skip directly to 4.6.x).  :-/

It's always worth trying this if only to eliminate regression as a
possible root cause early.  In practice, every mainline kernel release
has a regression that affects at least one combination of config options
and hardware.  btrfs is stable enough now that you can be running one
or two releases behind to avoid a problem elsewhere in the kernel.

> Another option will be crashing it with my car's wheels hoping that because
> of my comprehensive insurance policy Dell will give me the next model (the
> Skylake one) as a replacement (hoping that it will not suffer from the same
> issue of the Broadwell one).

The first rule of Insurance Fraud Club:  don't talk about Insurance
Fraud Club.  ;)

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a problem
in the kernel that affects only your chipset.



signature.asc
Description: Digital signature


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-12 Thread Austin S. Hemmelgarn

On 2016-05-12 10:35, Niccolò Belli wrote:

On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:

Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.


When doing the btrfs check I also always do a btrfs scrub and it never
found any error. Once it didn't manage to finish the scrub because of:
BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
block=670597120,root=1, slot=6
and btrfs scrub status reported "was aborted after 00:00:10".

Talking about scrub I created a systemd timer to run scrub hourly and I
noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
immediately re-run the scrub just to confirm it and then I rebooted into
the Arch live usb and runned btrfs check: the metadata were perfect. So
I runned btrfs scrub from the live usb and there were no errors at all!
I rebooted into my system and runned scrub once again and the
uncorrectable errors where really gone! It happened two times in the
past few days.
This would indicate to me that you've either got bad RAM (most likely), 
or some other hardware component is not working correctly.  It's not 
unusual for hardware issues to be intermittent.



Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.


Almost no patches get applied by the Arch kernel team:
https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
At the moment the only one is an harmless
"change-default-console-loglevel.patch".


Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.


Arch kernel team is quite conservative regarding staging/experimental
features, I remember they rejected some config patches I submitted
because of this.
Anyway I will try to blacklist as many kernel modules as I can. Maybe
blacklisting GPU is too much because if I can't actually use my laptop
it will be much more difficult to reproduce the issue.
Disable the GPU driver, but make sure you have the VGA_CONSOLE config 
enabled, and you should be fine (you'll just get a 80x25 text-mode 
console instead of a high-resolution one).



Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into
RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores
(memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.


I didn't use memtest86+ because of the lack of EFI support, but I just
tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours
without issues.
Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4
-turns 10" together for 12 hours without any issue so I think both
my ram and cpu are ok.
That's probably a good indication of the CPU and the MB being OK, but 
not necessarily the RAM.  There's two other possible options for testing 
the RAM that haven't been mentioned yet though (which I hadn't thought 
of myself until now):
1. If you have access to Windows, try the Windows Memory Diagnostic. 
This runs yet another slightly different set of tests from memtest86 and 
memtest86+, 

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-12 Thread Niccolò Belli

On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:

Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.


When doing the btrfs check I also always do a btrfs scrub and it never 
found any error. Once it didn't manage to finish the scrub because of:
BTRFS critical (device dm-0): corrupt leaf, slot offset bad: 
block=670597120,root=1, slot=6

and btrfs scrub status reported "was aborted after 00:00:10".

Talking about scrub I created a systemd timer to run scrub hourly and I 
noticed 2 *uncorrectable* errors suddenly appeared on my system. So I 
immediately re-run the scrub just to confirm it and then I rebooted into 
the Arch live usb and runned btrfs check: the metadata were perfect. So I 
runned btrfs scrub from the live usb and there were no errors at all! I 
rebooted into my system and runned scrub once again and the uncorrectable 
errors where really gone! It happened two times in the past few days.



Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.


Almost no patches get applied by the Arch kernel team: 
https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
At the moment the only one is an harmless 
"change-default-console-loglevel.patch".



Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.


Arch kernel team is quite conservative regarding staging/experimental 
features, I remember they rejected some config patches I submitted because 
of this.
Anyway I will try to blacklist as many kernel modules as I can. Maybe 
blacklisting GPU is too much because if I can't actually use my laptop it 
will be much more difficult to reproduce the issue.



Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.


I didn't use memtest86+ because of the lack of EFI support, but I just 
tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours 
without issues.
Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4 
-turns 10" together for 12 hours without any issue so I think both my 
ram and cpu are ok.


I can think only about two possible culprits now (correct me if I'm wrong):
1) A btrfs bug
2) Another module screwing things around

I can do nothing about btrfs bugs so I will try to hunt the second option. 
This is the list of modules I'm running:


lsmod | awk '$4 == ""' | awk '{print $1}' | sort

8250_dw
ac
acpi_als
acpi_pad
aesni_intel
ahci
algif_skcipher
ansi_cprng
arc4
atkbd
battery
bnep
btrfs
btusb
cdc_ether
cmac
coretemp
crc32c_intel
crc32_pclmul
crct10dif_pclmul
dell_laptop
dell_wmi
dm_crypt
drbg
ecb
elan_i2c
evdev
ext4
fan
fjes
ghash_clmulni_intel
gpio_lynxpoint
hid_generic
hid_multitouch
hmac
i2c_designware_platform
i2c_hid
i2c_i801
i915
input_leds
int3400_thermal
int3402_thermal
int3403_thermal
intel_hid
intel_pch_thermal
intel_powerclamp
intel_rapl
ip_tables

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Chris Murphy
On Mon, May 9, 2016 at 8:53 AM, Niccolò Belli  wrote:

> I cannot manage to survive such annoying workflow for long, so I really hope
> someone will manage to track the bug down soon.

I suggest perseverance :) despite how tedious this is. Btrfs is more
aware of its state than other file systems, so if you give up and go
to ext4 it's entirely possible corruption is still happening but you
won't know it until there's a lot more damage. At the least if you
have to give up I'd suggest XFS and make sure you're using not older
than xfsprogs 3.2.3 which will make a V5 file system that uses
metadata checksumming by default.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Lionel Bouton
Hi,

Le 09/05/2016 16:53, Niccolò Belli a écrit :
> On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
>> Are you using any power management tweaks?
>
> Yes, as stated in my very first post I use TLP with
> SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the
> bug even without TLP. Also in the past week I've alwyas been on AC.
>
> On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
>> Memtest doesn't replicate typical usage patterns very well.  My usual
>> testing for RAM involves not just memtest, but also booting into a
>> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel
>> source, and then running as many concurrent kernel builds as cores,
>> each with as many make jobs as cores (so if you've got a quad core
>> CPU (or a dual core with hyperthreading), it would be running 4
>> builds with -j4 passed to make).  GCC seems to have memory usage
>> patterns that reliably trigger memory errors that aren't caught by
>> memtest, so this generally gives good results.
>
> Building kernel with 4 concurrent threads is not an issue for my
> system, in fact I do compile a lot and I never had any issue.

Note : I once had a server which would pass memtest86 and repeated
kernel compilations maxing out the CPU threads but couldn't at the same
time reliably compile a kernel and copy large amounts of data.
I think I lost my little automated test suite (I should definitely look
for it again or code it from scratch) but what I did on new servers
since that time was :

1/ create a file larger than the system's RAM (this makes sure you will
read and write all data from disk and not only caches and might catch
controller hardware problems too) with dd if=/dev/urandom (several
gigabytes of random data exercise many different patterns, far more than
what memtest86 would test), compute its md5 checksum
2/ launch a subprocess repeatedly compiling the kernel with more jobs
than available CPU threads and stopping as soon as the make exit code
was != 0.
3/ launch another subprocess repeatedly copying the random file to
another location and exiting when the md5 checksum didn't match the source.

Let it run as a burn-in test for as long as you can afford (from
experience after 24 hours if it's still running the probability that the
test will find a problem becomes negligible).
If one of the subprocess stopped by itself your hardware is not stable.

This actually caught a few unstable systems before it could go into
production for me.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 09 May 2016 14:21:57 -0400 as
excerpted:

> This practice evolved out of the fact that the only bad RAM I've ever
> dealt with either completely failed to POST (which can have all kinds of
> interesting symptoms if it's just one module, some MB's refuse to boot,
> some report the error, others just disable the module and act like
> nothing happened), or passed all the memory testing tools I threw at it
> (memtest86, memtest86+, memtester, concurrent memtest86 invocations from
> Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed
> under heavy concurrent random access, which can be reliably produced by
> running a bunch of big software builds at the same time with the CPU
> insanely over-committed.

My (likely much more limited) experience matches yours.

Tho FWIW, in my case I did find that one of the more common memory 
failure indicators was bz2-ed tarball decompression, where the tarball 
would fail its decompression checksum safety checks.  However, that most 
reliably happened in the context of a heavily loaded system doing other 
package builds in parallel to the package tarball extraction that failed.

In my case, I even had ECC RAM, but it was apparently just slightly out 
of spec for its labeled and internally configured memory speeds (PC3200 
DDR1 at the time), at least on my hardware.  Once I got a BIOS update 
that let me, I slightly downclocked the memory (to PC3000, IIRC), and it 
was absolutely solid, no more errors, even with tightened up wait-state 
timings.  Later I upgraded RAM, and the new RAM worked just fine at the 
same PC3200 speeds that were a problem for the older RAM.

The problem was apparently that while the RAM cells that memcheck checks 
were fine, it was testing in an otherwise calm environment (not much 
choice since you can only boot to the test directly and can't do anything 
else at the same time), without all the other stuff going on in the 
hectic environment of a multi-package parallel build, that apparently 
happened to occasionally trigger the edge-case that would corrupt things.

And FWIW, I still have major respect for how well reiserfs behaved under 
those conditions.  No filesystem can be expected to be 100% reliable when 
it's getting corrupted data due to bad memory, but reiserfs held up 
remarkably well, far better than btrfs did under similar conditions (but 
then with the PCI and SATA bus) a few year later, forcing me back to 
reiserfs for a time, which again, continued to work like a champ, even 
under hardware conditions that were absolutely unworkable with btrfs.  I 
had a heat-related (AC went out, in Phoenix, in the summer, 40+ C 
outside, 50+C inside, who knows what the disks were!?) head crash on a 
disk too, where the partitions that were mounted and likely had the head 
flying over them were damaged beyond (easy) recovery, but other 
partitions on the same disk were absolutely fine, and I actually 
continued to run off them for a few months after cooling everything back 
down.  That sort of experience is the reason I still use reiserfs on 
spinning rust, including my second and third level backups, even while 
I'm running btrfs on the ssds for the working system and primary backup.  
It's also the reason I continue to use a partitioned system with multiple 
independent filesystems (btrfs raid1 on a pair of ssds for most of the 
working btrfs and primary backups, individual ssd btrfs in dup mode for 
/boot, and its backup on the other ssd), instead of putting my data eggs 
all in the same filesystem basket with subvolumes, where if the 
filesystem goes out all the subvolumes go with it!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Austin S. Hemmelgarn

On 2016-05-09 12:29, Zygo Blaxell wrote:

On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:

While trying to find a common denominator for my issue I did lots of backups
of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
dozens of times (triggering a 150GB+ random data write every time), without
any issue (after restoring the backup I alwyas check the parition with btrfs
check). So disk doesn't seem to be the culprit.


Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

This is a good point.


The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.

Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.

Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.

Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.
My original suggestion that prompted that part of the comment was to run 
a bunch of concurrent kernel builds (I only use kernel builds myself 
because it's a big project with essentially zero build dependencies, if 
I had the patience and space (and a LiveCD with the right tools and 
packages installed), I'd probably be using something like LibreOffice or 
Chromium instead), each run with as many jobs as CPU's (so on a 
quad-core system, run a dozen or so concurrently with make -j4).  I 
don't use this as my sole test (I also use multiple other tools), but I 
find that this does a particularly good job of exercising things that 
memtest doesn't, and I don't just make sure the build's succeed, but 
also that the compiled kernel images all match, because if there's bad 
RAM, the resultant images will often be different in some way (and I had 
forgotten to mention this bit).


This practice evolved out of the fact that the only bad RAM I've ever 
dealt with either completely failed to POST (which can have all kinds of 
interesting symptoms if it's just one module, some MB's refuse to boot, 
some report the error, others just disable the module and act like 
nothing happened), or passed all the memory testing tools I threw at it 
(memtest86, memtest86+, memtester, concurrent memtest86 invocations from 
Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed 
under heavy concurrent random access, which can be reliably produced by 
running a bunch of big software builds at the same time with the CPU 
insanely over-committed.  I could probably produce a similar workload 
with tmpfs and FIO, but it's a lot quicker and easier to remember how to 
do a kernel build than it is to remember the complex incantations needed 
to get FIO to do anything interesting.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Zygo Blaxell
On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
> While trying to find a common denominator for my issue I did lots of backups
> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
> dozens of times (triggering a 150GB+ random data write every time), without
> any issue (after restoring the backup I alwyas check the parition with btrfs
> check). So disk doesn't seem to be the culprit.

Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.

Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.

Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.

Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.

> [...]I have the feeling that "autodefrag" enhances the
> chances to get corruption, but I'm not 100% sure about it. Anyway,
> triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", giving
> high chances to get irrecoverable corruption. When running such command it
> simply extracts the tarballs from the cache and overwrites the already
> installed files. It doesn't write lots of data (after reinstallation my
> system is still quite small, just a few GBs) but it seems to be enough to
> displease the filesystem.

pacman probably does a lot of fsync() which will do a lot of metadata
tree updates.  autodefrag triples the I/O load for fragmented files and
most of that extra load is metadata tree writes.  Both will make the
symptoms of your problem worse.



signature.asc
Description: Digital signature


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Niccolò Belli

On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:

Are you using any power management tweaks?


Yes, as stated in my very first post I use TLP with 
SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bug 
even without TLP. Also in the past week I've alwyas been on AC.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
Memtest doesn't replicate typical usage patterns very well.  My 
usual testing for RAM involves not just memtest, but also 
booting into a LiveCD (usually SystemRescueCD), pulling down a 
copy of the kernel source, and then running as many concurrent 
kernel builds as cores, each with as many make jobs as cores (so 
if you've got a quad core CPU (or a dual core with 
hyperthreading), it would be running 4 builds with -j4 passed to 
make).  GCC seems to have memory usage patterns that reliably 
trigger memory errors that aren't caught by memtest, so this 
generally gives good results.


Building kernel with 4 concurrent threads is not an issue for my system, in 
fact I do compile a lot and I never had any issue.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
On a similar note, badblocks doesn't replicate filesystem like 
access patterns, it just runs sequentially through the entire 
disk.  This isn't as likely to give bad results, but it's still 
important to know.  In particular, try running it over a dmcrypt 
volume a couple of times (preferably with a different key each 
time, pulling keys from /dev/urandom works well for this), as 
that will result in writing different data.  For what it's 
worth, when I'm doing initial testing of new disks, I always use 
ddrescue to copy /dev/zero over the whole disk, then do it twice 
through dmcrypt with different keys, copying from the disk to 
/dev/null after each pass.  This gives random data on disk as a 
starting point (which is good if you're going to use dmcrypt), 
and usually triggers reallocation of any bad sectors as early as 
possible.


While trying to find a common denominator for my issue I did lots of 
backups of /dev/mapper/cryptroot and I restored them into 
/dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data 
write every time), without any issue (after restoring the backup I alwyas 
check the parition with btrfs check). So disk doesn't seem to be the 
culprit.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
1. If you have an eSATA port, try plugging your hard disk in 
there and see if things work.  If that works but having the hard 
drive plugged in internally doesn't, then the issue is probably 
either that specific SATA port (in which case your chip-set is 
bad and you should get a new system), or the SATA connector 
itself (or the wiring, but that's not as likely when it's traces 
on a PCB).  Normally I'd suggest just swapping cables and SATA 
ports, but that's not really possible with a laptop.
2. If you have access to a reasonably large flash drive, or to 
a USB to SATA adapter, try that as well, if it works on that but 
not internally (or on an eSATA port), you've probably got a bad 
SATA controller, and should get a new system.


My laptop doesn't have an eSATA port and my only big enough external drive 
is currently used for daily backups, since I fear for data loss.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
3. Try things without dmcrypt.  Adding extra layers makes it 
harder to determine what is actually wrong.  If it works without 
dmcrypt, try using different parameters for the encryption 
(different ciphers is what I would try first).  If it works 
reliably without dmcrypt, then it's either a bug in dmcrypt 
(which I don't think is very likely), or it's bad interaction 
between dmcrypt and BTRFS.  If it works with some encryption 
parameters but not others, then that will help narrow down where 
the issue is.


On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote:

You're making the troubleshooting unnecessarily difficult by
continuing to use non-default options. *shrug*

Every single layer you add complicates the setup and troubleshooting.
Of course all of it should work together, many people do. But you're
the one having the problem so in order to demonstrate whether this is
a software bug or hardware problem, you need to test it with the most
basic setup possible --> btrfs on plain partitions and default mount
options.


I will try to recap because you obviously missed my previous e-mail: I 
managed to replicate the irrecoverable corruption bug even with default 
options and no dmcrypt at all. Somehow it was a bit more difficult to 
replicate with default options and so I started to play with different 
combinations to find if there was something which increased the chances of 
getting corruption. I have the feeling that "autodefrag" enhances the 
chances to get corruption, but I'm not 100% sure about it. Anyway, 
triggering a whole packages reinstall with "pacaur -S $(pacman 

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-09 Thread Austin S. Hemmelgarn

On 2016-05-07 12:11, Niccolò Belli wrote:

Il 2016-05-07 17:58 Clemens Eisserer ha scritto:

Hi Niccolo,


btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot


Just to be curious - couldn't it be a hardware issue? I use almost the
same setup (compress-force=lzo instead of compress-force=lzo) on my
laptop for 2-3 years and haven't experienced any issues since
~kernel-3.14 or so.

Br, Clemens Eisserer


Hi,
Which kind of hardware issue? I did a full memtest86 check, a full
smartmontools extended check and even a badblocks -wsv.
If this is really an hardware issue that we can identify I would be more
than happy because Dell will replace my laptop and this nightmare will
be finally over. I'm open to suggestions.

First, some general advice:
1. It is fully possible to have bad RAM that still passes memtest86 
consistently, and in fact, most of the time this will be the case (if 
you're seeing any thing other than the bit-fade test in memtest86 fail, 
then your system probably won't boot fully).  Memtest doesn't replicate 
typical usage patterns very well.  My usual testing for RAM involves not 
just memtest, but also booting into a LiveCD (usually SystemRescueCD), 
pulling down a copy of the kernel source, and then running as many 
concurrent kernel builds as cores, each with as many make jobs as cores 
(so if you've got a quad core CPU (or a dual core with hyperthreading), 
it would be running 4 builds with -j4 passed to make).  GCC seems to 
have memory usage patterns that reliably trigger memory errors that 
aren't caught by memtest, so this generally gives good results. 
Secondarily, if it's a big system and I am not pressed for time, I do a 
quick Gentoo install with Xen, and then spin up twice as many Xen VM's 
as cores and run memtest in those concurrently (this seems to catch 
things a bit more reliably than just a plain memtest).
2. On a similar note, badblocks doesn't replicate filesystem like access 
patterns, it just runs sequentially through the entire disk.  This isn't 
as likely to give bad results, but it's still important to know.  In 
particular, try running it over a dmcrypt volume a couple of times 
(preferably with a different key each time, pulling keys from 
/dev/urandom works well for this), as that will result in writing 
different data.  For what it's worth, when I'm doing initial testing of 
new disks, I always use ddrescue to copy /dev/zero over the whole disk, 
then do it twice through dmcrypt with different keys, copying from the 
disk to /dev/null after each pass.  This gives random data on disk as a 
starting point (which is good if you're going to use dmcrypt), and 
usually triggers reallocation of any bad sectors as early as possible. 
If I have time and access to an existing system I can connect the disk 
to, I often do testing with fio as well.


Now, to slightly more specific advice:
1. If you have an eSATA port, try plugging your hard disk in there and 
see if things work.  If that works but having the hard drive plugged in 
internally doesn't, then the issue is probably either that specific SATA 
port (in which case your chip-set is bad and you should get a new 
system), or the SATA connector itself (or the wiring, but that's not as 
likely when it's traces on a PCB).  Normally I'd suggest just swapping 
cables and SATA ports, but that's not really possible with a laptop.
2. If you have access to a reasonably large flash drive, or to a USB to 
SATA adapter, try that as well, if it works on that but not internally 
(or on an eSATA port), you've probably got a bad SATA controller, and 
should get a new system.
3. Try things without dmcrypt.  Adding extra layers makes it harder to 
determine what is actually wrong.  If it works without dmcrypt, try 
using different parameters for the encryption (different ciphers is what 
I would try first).  If it works reliably without dmcrypt, then it's 
either a bug in dmcrypt (which I don't think is very likely), or it's 
bad interaction between dmcrypt and BTRFS.  If it works with some 
encryption parameters but not others, then that will help narrow down 
where the issue is.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-08 Thread Patrik Lundquist
On 7 May 2016 at 18:11, Niccolò Belli  wrote:

> Which kind of hardware issue? I did a full memtest86 check, a full 
> smartmontools extended check and even a badblocks -wsv.
> If this is really an hardware issue that we can identify I would be more than 
> happy because Dell will replace my laptop and this nightmare will be finally 
> over. I'm open to suggestions.


Well, your hardware differs from a lot of successful installations.
Are you using any power management tweaks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-07 Thread Chris Murphy
On Sat, May 7, 2016 at 9:45 AM, Niccolò Belli  wrote:
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.

You're making the troubleshooting unnecessarily difficult by
continuing to use non-default options. *shrug*

Every single layer you add complicates the setup and troubleshooting.
Of course all of it should work together, many people do. But you're
the one having the problem so in order to demonstrate whether this is
a software bug or hardware problem, you need to test it with the most
basic setup possible --> btrfs on plain partitions and default mount
options.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-07 Thread Niccolò Belli

Il 2016-05-07 17:58 Clemens Eisserer ha scritto:

Hi Niccolo,


btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot


Just to be curious - couldn't it be a hardware issue? I use almost the
same setup (compress-force=lzo instead of compress-force=lzo) on my
laptop for 2-3 years and haven't experienced any issues since
~kernel-3.14 or so.

Br, Clemens Eisserer


Hi,
Which kind of hardware issue? I did a full memtest86 check, a full 
smartmontools extended check and even a badblocks -wsv.
If this is really an hardware issue that we can identify I would be more 
than happy because Dell will replace my laptop and this nightmare will 
be finally over. I'm open to suggestions.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-07 Thread Clemens Eisserer
Hi Niccolo,

> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot

Just to be curious - couldn't it be a hardware issue? I use almost the
same setup (compress-force=lzo instead of compress-force=lzo) on my
laptop for 2-3 years and haven't experienced any issues since
~kernel-3.14 or so.

Br, Clemens Eisserer


2016-05-07 17:45 GMT+02:00 Niccolò Belli :
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.
>
> [  748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 move
> len 4294962894 len 16384
> [  748.226206] [ cut here ]
> [  748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
> [  748.229498] invalid opcode:  [#1] PREEMPT SMP
> [  748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 nls_cp437
> vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi iTCO_wdt
> iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw pcspkr elan_i2c
> snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a snd_soc_core i2c_hid iwlmvm
> snd_compress snd_pcm_dmaengine ac97_bus mac80211 uvcvideo videobuf2_vmalloc
> btusb videobuf2_memops cdc_ether btrtl usbnet iwlwifi btbcm videobuf2_v4l2
> btintel intel_pch_thermal videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms
> cfg80211 bluetooth visor media mii memstick joydev evdev mousedev input_leds
> rfkill mac_hid crc16 i915 fan thermal wmi dw_dmac int3403_thermal video
> dw_dmac_core drm_kms_helper snd_soc_sst_acpi i2c_designware_platform
> snd_soc_sst_match
> [  748.237203]  snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint
> spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb
> intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal
> acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea
> sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit
> processor_thermal_device kfifo_buf processor snd industrialio acpi_pad ac
> int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore shpchp
> sch_fq_codel ip_tables x_tables btrfs xor raid6_pq jitterentropy_rng
> sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg uas
> usb_storage dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2
> crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
> aes_x86_64 lrw gf128mul glue_helper
> [  748.244176]  ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci
> rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303
> mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common
> [  748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
> [  748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07
> 11/11/2015
> [  748.251576] task: 8800d9d98e40 ti: 8800cec1 task.ti:
> 8800cec1
> [  748.254064] RIP: 0010:[]  []
> memmove_extent_buffer+0x10c/0x110 [btrfs]
> [  748.256600] RSP: 0018:8800cec13c18  EFLAGS: 00010246
> [  748.259120] RAX:  RBX: 88020c01ba40 RCX:
> 0056
> [  748.261631] RDX:  RSI: 88021e40db38 RDI:
> 88021e40db38
> [  748.264166] RBP: 8800cec13c48 R08:  R09:
> 033b
> [  748.266716] R10:  R11: 033b R12:
> eece
> [  748.269267] R13: 00010405 R14: 000104c9 R15:
> 88020c01ba40
> [  748.271818] FS:  7f14d4271740() GS:88021e40()
> knlGS:
> [  748.274392] CS:  0010 DS:  ES:  CR0: 80050033
> [  748.276987] CR2: 01630008 CR3: cffc8000 CR4:
> 003406f0
> [  748.279603] DR0:  DR1:  DR2:
> 
> [  748.282220] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  748.284815] Stack:
> [  748.287422]  e3438cd2 88020c01ba40 00c4
> 002a
> [  748.290082]  006b 03a0 8800cec13ce8
> a02b612c
> [  748.292754]  a02b433d 8800da9ca820 0028
> 8800daa78bd0
> [  748.295441] Call Trace:
> [  748.298104]  [] btrfs_del_items+0x33c/0x4a0 [btrfs]
> [  748.300827]  [] ? btrfs_search_slot+0x90d/0x990 [btrfs]
> [  748.303564]  [] ? btrfs_get_token_8+0x6c/0x130 [btrfs]
> [  748.306311]  [] btrfs_truncate_inode_items+0x649/0xd20
> [btrfs]
> [  748.309071]  [] ?
> btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
> [  748.311860]  [] btrfs_evict_inode+0x485/0x5d0 [btrfs]
> [  748.314627]  [] evict+0xc5/0x190
> [  748.317412]  [] iput+0x1d9/0x260
> [  748.320199]  [] do_unlinkat+0x199/0x2d0
> [  748.322988]  [] SyS_unlink+0x16/0x20
> [  748.325781]  [] entry_SYSCALL_64_fastpath+0x12/0x6d
> [  748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 44

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-07 Thread Niccolò Belli

btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
So discard is not the culprit. Will try to remove compress=lzo and 
autodefrag and see if it still happens.


[  748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 
move len 4294962894 len 16384

[  748.226206] [ cut here ]
[  748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
[  748.229498] invalid opcode:  [#1] PREEMPT SMP
[  748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 
nls_cp437 vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi 
iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw 
pcspkr elan_i2c snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a 
snd_soc_core i2c_hid iwlmvm snd_compress snd_pcm_dmaengine ac97_bus 
mac80211 uvcvideo videobuf2_vmalloc btusb videobuf2_memops cdc_ether 
btrtl usbnet iwlwifi btbcm videobuf2_v4l2 btintel intel_pch_thermal 
videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms cfg80211 bluetooth 
visor media mii memstick joydev evdev mousedev input_leds rfkill mac_hid 
crc16 i915 fan thermal wmi dw_dmac int3403_thermal video dw_dmac_core 
drm_kms_helper snd_soc_sst_acpi i2c_designware_platform 
snd_soc_sst_match
[  748.237203]  snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint 
spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb 
intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal 
acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea 
sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit 
processor_thermal_device kfifo_buf processor snd industrialio acpi_pad 
ac int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore 
shpchp sch_fq_codel ip_tables x_tables btrfs xor raid6_pq 
jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng 
algif_skcipher af_alg uas usb_storage dm_crypt dm_mod sd_mod 
rtsx_pci_sdmmc atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper
[  748.244176]  ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci 
rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 
mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common

[  748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
[  748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 
11/11/2015
[  748.251576] task: 8800d9d98e40 ti: 8800cec1 task.ti: 
8800cec1
[  748.254064] RIP: 0010:[]  [] 
memmove_extent_buffer+0x10c/0x110 [btrfs]

[  748.256600] RSP: 0018:8800cec13c18  EFLAGS: 00010246
[  748.259120] RAX:  RBX: 88020c01ba40 RCX: 
0056
[  748.261631] RDX:  RSI: 88021e40db38 RDI: 
88021e40db38
[  748.264166] RBP: 8800cec13c48 R08:  R09: 
033b
[  748.266716] R10:  R11: 033b R12: 
eece
[  748.269267] R13: 00010405 R14: 000104c9 R15: 
88020c01ba40
[  748.271818] FS:  7f14d4271740() GS:88021e40() 
knlGS:

[  748.274392] CS:  0010 DS:  ES:  CR0: 80050033
[  748.276987] CR2: 01630008 CR3: cffc8000 CR4: 
003406f0
[  748.279603] DR0:  DR1:  DR2: 

[  748.282220] DR3:  DR6: fffe0ff0 DR7: 
0400

[  748.284815] Stack:
[  748.287422]  e3438cd2 88020c01ba40 00c4 
002a
[  748.290082]  006b 03a0 8800cec13ce8 
a02b612c
[  748.292754]  a02b433d 8800da9ca820 0028 
8800daa78bd0

[  748.295441] Call Trace:
[  748.298104]  [] btrfs_del_items+0x33c/0x4a0 [btrfs]
[  748.300827]  [] ? btrfs_search_slot+0x90d/0x990 
[btrfs]
[  748.303564]  [] ? btrfs_get_token_8+0x6c/0x130 
[btrfs]
[  748.306311]  [] 
btrfs_truncate_inode_items+0x649/0xd20 [btrfs]
[  748.309071]  [] ? 
btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
[  748.311860]  [] btrfs_evict_inode+0x485/0x5d0 
[btrfs]

[  748.314627]  [] evict+0xc5/0x190
[  748.317412]  [] iput+0x1d9/0x260
[  748.320199]  [] do_unlinkat+0x199/0x2d0
[  748.322988]  [] SyS_unlink+0x16/0x20
[  748.325781]  [] entry_SYSCALL_64_fastpath+0x12/0x6d
[  748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 
44 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f 
fa ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb
[  748.331558] RIP  [] 
memmove_extent_buffer+0x10c/0x110 [btrfs]

[  748.334473]  RSP 
[  748.356077] ---[ end trace 9bfb28800ab52273 ]---
[  748.359042] note: pacman[2316] exited with preempt_count 2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-06 Thread Niccolò Belli
I formatted the partition and copied the content of my previous rootfs to 
it. There is no dmcrypt now and mount options are defaults, except for 
noatime. After a single boot I got the very same problem as before (fs 
corrupted and an infinite loop when doing btrfs check --repair.


I wanted to replicate results and so I tried once again and since then I 
only experienced minor corruption, correctly resolved by repair. But during 
a pacaman upgrade, which triggered snapper pre-post snapshots, the system 
hanged and I found this in the logs:


mag 06 10:31:15 arch-laptop plasmashell[873]: requesting unexisting screen 
2
mag 06 10:31:18 arch-laptop dbus[418]: [system] Activating service 
name='org.opensuse.Snapper' (using servicehelper)
mag 06 10:31:18 arch-laptop dbus[418]: [system] Successfully activated 
service 'org.opensuse.Snapper'

mag 06 10:31:20 arch-laptop kernel: [ cut here ]
mag 06 10:31:20 arch-laptop kernel: kernel BUG at fs/btrfs/ctree.h:2693!

Still no major corruption found since my second attempt.

Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-05 Thread Omar Sandoval
On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote:
> On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> > I suggest using defaults for starters. The only thing in that list
> > that needs be there is either subvolid or subvold, not both. Add in
> > the non-default options once you've proven the defaults are working,
> > and add them one at a time.
> 
> Yes I read your previous suggestion and I already dropped subvolid, but
> since the problem already happened I left it in the mail for completeness.
> Anyway the culprit here is genfstab and that's probably what a beginner is
> going to use when installing a distro:
> https://wiki.archlinux.org/index.php/beginners'_guide#fstab
> 

The redundant subvolid doesn't hurt, the kernel will just check that it
matches the passed subvol (see [1]). genfstab probably just pulls the
options out of /proc/mounts or /proc/self/mountinfo, and since we show
both, that's how it gets in fstab. If it was actually a problem, there
would be a clear message in dmesg.

1: 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-05 Thread Niccolò Belli

On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:

I suggest using defaults for starters. The only thing in that list
that needs be there is either subvolid or subvold, not both. Add in
the non-default options once you've proven the defaults are working,
and add them one at a time.


Yes I read your previous suggestion and I already dropped subvolid, but 
since the problem already happened I left it in the mail for completeness.
Anyway the culprit here is genfstab and that's probably what a beginner is 
going to use when installing a distro: 
https://wiki.archlinux.org/index.php/beginners'_guide#fstab



Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).


The firmware is old if I understand the naming scheme used by Dell. It
says EXT49D0Q is current.

http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH


According to this 
(http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/) 
the firmware you linked is for the mSATA version of the drive, not the M.2 
one. EXT25D0Q seems to be the very latest one for my drive.



I advice using all defaults for everything for
now, otherwise it's anyone's guess what you're running into.


On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote:
Would it be OK for you to test your btrfs on a plain ssd, 
without encryption?
And just as Chris Murphy said, reducing mount option is also a 
pretty good debugging start point.


Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what 
happens.



I made a copy of /dev/mapper/cryptroot with dd on an external drive and
I run btrfs check on it (btrfs-progs 4.5.2):
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)


Checked, but seems the output is truncated?


No, I didn't truncate the btrfs check output because it wasn't endless. I 
just truncated the repair output.


I also have something new to report. Do you remember when I said that my 
screen was black and so I had to forcedly power off the system? Something 
similar happened today and since in the meantime I enabled magic sysrq keys 
I have been able to recover this from the logs:


mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering 
"org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray

mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39
mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated 
service 'org.bluez.obex'

mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service.
mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement: 
IdentityManager: There was no default identity. Marking first one as 
default.
mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging 
request at 00017d11
mag 05 11:55:51 arch-laptop kernel: IP: [] 
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: PGD 0 
mag 05 11:55:51 arch-laptop kernel: Oops:  [#1] PREEMPT SMP 
mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core 
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 
crc16 mii joydev mousedev nvr
mag 05 11:55:51 arch-laptop kernel:  mei_me syscopyarea sysfillrect snd 
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan 
intel_hid sparse_keymap int3403_thermal video processor_thermal_device 
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:51 arch-laptop kernel:  lrw gf128mul glue_helper ablk_helper 
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci

mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer 
initialized

mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11
mag 05 11:55:51 arch-laptop kernel:  xhci_hcd
mag 05 11:55:51 arch-laptop kernel:  i8042 serio sdhci_acpi sdhci led_class 
mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore 
usb_common
mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not 
tainted 4.5.1-1-ARCH #1
mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:51 arch-laptop kernel: task: 88021347d580 ti: 
880211f8c000 task.ti: 880211f8c000
mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[]  
[] anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: RSP: 0018:880211f8fd68  EFLAGS: 
00010206
mag 05 11:55:51 arch-laptop kernel: RAX: 8800da2f4820 RBX: 
8800bb59ce40 RCX: 8800da2f4830
mag 05 11:55:51 arch-laptop kernel: RDX: 8800da2f4828 RSI: 
8800374404a0 RDI: 8800c58dfa40
mag 05 11:55:51 arch-laptop kernel: RBP: 880211f8fdb8 R08: 
00017c79 R09: 0007f55e2059
mag 05 11:55:51 arch-laptop kernel: R10: 0007f55e2053 R11: 
8800c58dfa40 R12: 880037440460
mag 05 11:55:51 arch-laptop kernel: 

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-04 Thread Chris Murphy
On Wed, May 4, 2016 at 5:21 PM, Niccolò Belli  wrote:

> rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@

I suggest using defaults for starters. The only thing in that list
that needs be there is either subvolid or subvold, not both. Add in
the non-default options once you've proven the defaults are working,
and add them one at a time.



> I have the whole rootfs encrypted, including boot. I followed these steps:
> https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
>
> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).

The firmware is old if I understand the naming scheme used by Dell. It
says EXT49D0Q is current.

http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH

If you need to update, you may be best off doing a whole device trim,
which is easiest done with mkfs.btrfs pointed at the whole device. I
wouldn't trust any data on the drive after a firmware update so I'd
start over entirely from scratch, new partition map, new everything.
So the way to do this is:

mkfs.btrfs /dev/sda
wipefs -a /dev/sda

That way the btrfs magic is removed, and now you can partition it,
setup dmcrypt, etc. I advice using all defaults for everything for
now, otherwise it's anyone's guess what you're running into.


Off topic, but at least gmail users see your posts go to spam because
your domain is configured to disallow relaying. Most mail services
ignore this request by the domain but google honors it so no amount of
training will make your email not spam. This is what's in your emails
that's causing the problem:

   dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it

http://webmasters.stackexchange.com/questions/76765/sent-emails-pass-spf-and-dkim-but-fail-dmarc-when-received-by-gmail
http://www.pcworld.com/article/2141120/yahoo-email-antispoofing-policy-breaks-mailing-lists.html



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html