Re: Kernel lockup, might be helpful log.

2015-12-14 Thread Birdsarenice
I've no need for a fix. I know exactly what the underlying cause is: 
Those Seagate 8TB Archive drives and their known compatibility issues 
with some kernel versions. I just shared the log because it's a 
situation that btrfs handles very, very poorly, and the error handling 
could be improved. If a drive is unresponsive, btrfs really should be 
able to just cease using it and treat it as failed, or even unmount the 
entire filesystem - either would be preferable to what actually happens 
(at least for me), a system hang that leaves nothing functional whatsoever.


I've 'solved' it by removing all drives of that model. It's been running 
without issue since I did that.


On 14/12/15 07:36, Chris Murphy wrote:

I can't help with the call traces. But several (not all) of the hard
resetting link messages are hallmark cases where the SCSI command
timer default of 30 seconds looks like it's being hit while the drive
itself is hung up doing a sector read recovery (multiple attempts).
It's worth seeing if 'smartctl -l scterc ' will report back that
SCT is supported and that it's just disabled, meaning you can change
this to something sane like with 'smartctl -l 70,70 ' which will
make the drive time out before the linux kernel command timer. That'll
let Btrfs do the right thing, rather than constantly getting poked in
both eyes by link resets.


Chris Murphy



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel lockup, might be helpful log.

2015-12-13 Thread Birdsarenice
I've finally finished deleting all those nasty unreliable Seagate drives 
from my array. During the process I crashed my server - over, and over, 
and over. Completely gone - screen blank, controls unresponsive, no 
network activity (no, I don't have root on btrfs - data only). Most 
annoying, but I think btrfs survived it all somehow - it's scrubbing now.


Meanwhile, I did get lucky: At one crash I happened to be logged in and 
was able to hit dmesg seconds before it went completely. So what I have 
here is information that looks like it'll help you track down a 
rarely-encountered and hard-to-reproduce bug which can cause the system 
to lock up completely in event of certain types of hard drive failure. 
It might be nothing, but perhaps someone will find it of use - because 
it'd be a tricky one to both reproduce and get a good error report if it 
did occur.


I see an 'invalid opcode' error in here, that's pretty unusual - and 
again it even gives a file name and line number to look at. The root 
cause of all my issues is the NCQ issue with Seagate 8TB archive drives, 
which is Someone Else's Problem - but I think some good can come of 
this, as these exotic forms of corruption and weird drive semi-failures 
have revealed ways in which btrfs's error handling could be made more 
graceful.


Meanwhile I remain impressed that btrfs appears to have kept all my data 
intact even though all these issues.
[11668.697976] BTRFS info (device sde1): relocating block group 5932520046592 
flags 17
[11676.977183] BTRFS info (device sde1): found 20 extents
[11686.138376] BTRFS info (device sde1): found 20 extents
[11686.567242] BTRFS info (device sde1): relocating block group 5935741272064 
flags 17
[11695.452025] BTRFS info (device sde1): found 17 extents
[11704.627191] BTRFS info (device sde1): found 17 extents
[11705.966792] BTRFS info (device sde1): relocating block group 5938962497536 
flags 17
[11715.343790] BTRFS info (device sde1): found 15 extents
[11724.219660] BTRFS info (device sde1): found 15 extents
[11724.910970] BTRFS info (device sde1): relocating block group 5940036239360 
flags 17
[11733.289804] BTRFS info (device sde1): found 22 extents
[11741.538676] BTRFS info (device sde1): found 22 extents
[11742.019752] BTRFS info (device sde1): relocating block group 5941109981184 
flags 17
[11751.676514] BTRFS info (device sde1): found 14 extents
[11759.404371] [ cut here ]
[11759.404439] kernel BUG at ../fs/btrfs/extent-tree.c:1832!
[11759.404514] invalid opcode:  [#1] PREEMPT SMP 
[11759.404600] Modules linked in: xt_nat nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables xt_conntrack xt_tcpudp ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_filter iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables af_packet 
bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul 
crc32c_intel raid6_pq aesni_intel aes_x86_64 lrw gf128mul iTCO_wdt glue_helper 
ablk_helper iTCO_vendor_support cryptd pcspkr i2c_i801 ib_mthca lpc_ich tpm_tis 
8250_fintek ie31200_edac mfd_core shpchp battery edac_core thermal tpm video 
fan button processor hid_generic usbhid uas usb_storage amdkfd amd_iommu_v2 
radeon igb dca i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
[11759.405914]  fb_sys_fops ttm drm xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore 
usb_common e1000e ptp pps_core fjes vhost_net tun vhost macvtap macvlan sg 
rpcrdma sunrpc rdma_cm iw_cm ib_ipoib ib_cm ib_sa ib_umad ib_mad ib_core ib_addr
[11759.406328] CPU: 2 PID: 2060 Comm: btrfs Not tainted 4.3.0-2-default #1
[11759.406414] Hardware name: FUJITSU PRIMERGY TX100 S3P/D3009-B1, BIOS 
V4.6.5.3 R1.10.0 for D3009-B1x 12/18/2012
[11759.406555] task: 88042f832040 ti: 88041cae4000 task.ti: 
88041cae4000
[11759.406659] RIP: 0010:[]  [] 
insert_inline_extent_backref+0xc6/0xd0 [btrfs]
[11759.406815] RSP: 0018:88041cae7830  EFLAGS: 00010293
[11759.406889] RAX:  RBX:  RCX: 0001
[11759.406986] RDX: 8800 RSI: 0001 RDI: 
[11759.407085] RBP: 88041cae7890 R08: 4000 R09: 88041cae7748
[11759.407184] R10:  R11: 0003 R12: 880412615800
[11759.407283] R13:  R14:  R15: 8800c92aef50
[11759.407383] FS:  7f2e3b1678c0() GS:88042fd0() 
knlGS:
[11759.407497] CS:  0010 DS:  ES:  CR0: 80050033
[11759.407576] CR2: 55f473f59f28 CR3: 0004180be000 CR4: 001406e0
[11759.407675] Stack:
[11759.407706]   0102  

[11759.407831]  0001 88041170d800 32b6 
88041170d800
[11759.407949]  88030f0203b0 8800c92aef50 0102 
88040b22e000
[11759.408069] Call Trace:
[11759.408127]  

Re: btrfs crashing the kernel with Seagate 8TB SMR drives.

2015-12-04 Thread Birdsarenice
I did suspect that NCQ may be involved, but I had no clear evidence - 
until I noticed that my drives had also incremented the 'end to end 
error' count in SMART, which does match accounts of the NCQ issue. That 
suggests there are two interlinked issues: The issue with those Seagate 
drives and NCQ, combined with btrfs causing a kernel lock under certain 
error circumstances when it would be more appropriate to remount ro. 
Looks like the NCQ issue is already being addressed, but I did uncover a 
new and unusual error condition that btrfs needs to handle - and looking 
at the patch, it's a trivial thing to fix, so bothering the mailing list 
with it has made btrfs better in a tiny way. I don't usually report 
errors, assuming that people far more capable than I are already on top 
of them, but when I saw one that gave a description right down to the 
line number I thought it might be something that could be looked into 
very easily.


I'm still impressed with the resilience of btrfs though - after all this 
abuse of crashing during rebalancing, corrupted filesystem structures 
and out-of-order commands, all my data is still undamaged. No 
conventional RAID could have endured that.


Thanks for the patch, but I'd rather not fiddle with he kernel and have 
to repeat every time a new version comes out. I'll just disable NCQ 
until the fix is mainlined and SUSE incorporates it.


uOn 04/12/15 15:21, Robert Krig wrote:

As Chris mentioned, check out the Bug report here:
https://bugzilla.kernel.org/show_bug.cgi?id=93581


I have a 8TB SMR Drive and the kernel was reporting drive errors.
Switching to Kernel 3.16 (Standard Debian Jessie kernel) fixed it for me
( for the moment).

>From what I read in that kernel bug report. The patch has been submitted
for kernel 4.4.

On 03.12.2015 19:07, Codebird wrote:

I've got a nice bug for you - because I can offer you what everyone
likes to see, a precise error message.

I've got a btrfs filesystem spread over six devices, RAID1 mode. Four
of these are Seagate 8TB archive drives - those SMR ones that a few
others have reported failing when used with btrfs. I've had that issue
too, and I just can't explain why, other than to say that it only
occurs when using them on my mainboard SATA ports, not via USB dock.
But that's not what I'm reporting - that's just the source of the
problem that causes the crash I am reporting.

The crash occurs when scrubbing, after some time and some terabytes -
or possibly just when reading a certain place, I'm not sure - and it
gives this helpful error left on the screen along with a system so
unresponsive numlock won't flash:

BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
failure
BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
 BTRFS: assertion failed:
f(fs_info->sb->s_flags & MS  
---[ cut here ]
kernel BUG at ../fs/btrfs/ctree.h:4057!

Not sure if some of those 5 might be 6, as I was in a hurry to get it
back up both times and just got a blurry photo. But it looks to me
like there might be a chunk of code that doesn't handle a hardware
fault - rather than cleanly return an error it's causing the kernel to
hang entirely. I've managed to get this to happen twice now, so it's
certainly something worth looking into. This is on SUSE tumbleweed,
with kernel 4.3.0-2-default.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html