Re: Kernel crash on mount after SMR disk trouble

2016-06-11 Thread Jukka Larja

11.6.2016, 19.30, Chris Murphy kirjoitti:

On Sat, Jun 11, 2016 at 6:40 AM, Jukka Larja  wrote:

11.6.2016, 15.30, Chris Murphy kirjoitti:


On Fri, Jun 10, 2016 at 9:11 PM, Jukka Larja 
wrote:


I understand that usebackuproot requires kernel >= 4.6. I probably won't
be
installing a custom kernel, but if I still have the array in its current
state when 4.6 becomes available in Debian Stretch, I'll give it a try.



It's the "recovery" mount option in older kernels.



That didn't work, one of the first things I tried. Crashes just like without
it.


-o ro,recovery is quite a bit more tolerant in my experience. While
it's not great to in effect have a read only file system, it's a lot
easier to get data off of if necessary, rather than restoring to
'btrfs restore'.


Read-only mounting works even without recovery. My current plan is to copy 
most of the data (I probably skip snapshots even though that defeats part of 
the purpose of backups) once I get new disks. I have also run --repair, but 
that didn't have any effect.


--
 ...Elämälle vierasta toimintaa...
 Jukka Larja, jla...@iki.fi, 0407679919

"Those who fail to learn history are doomed to repeat it; those who fail to 
learn it correctly -- why they are simply doomed."

- Andromeda -
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-11 Thread Chris Murphy
On Sat, Jun 11, 2016 at 6:40 AM, Jukka Larja  wrote:
> 11.6.2016, 15.30, Chris Murphy kirjoitti:
>
>> On Fri, Jun 10, 2016 at 9:11 PM, Jukka Larja 
>> wrote:
>>
>>> I understand that usebackuproot requires kernel >= 4.6. I probably won't
>>> be
>>> installing a custom kernel, but if I still have the array in its current
>>> state when 4.6 becomes available in Debian Stretch, I'll give it a try.
>>
>>
>> It's the "recovery" mount option in older kernels.
>
>
> That didn't work, one of the first things I tried. Crashes just like without
> it.

-o ro,recovery is quite a bit more tolerant in my experience. While
it's not great to in effect have a read only file system, it's a lot
easier to get data off of if necessary, rather than restoring to
'btrfs restore'.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-11 Thread Jukka Larja

11.6.2016, 15.30, Chris Murphy kirjoitti:


On Fri, Jun 10, 2016 at 9:11 PM, Jukka Larja  wrote:


I understand that usebackuproot requires kernel >= 4.6. I probably won't be
installing a custom kernel, but if I still have the array in its current
state when 4.6 becomes available in Debian Stretch, I'll give it a try.


It's the "recovery" mount option in older kernels.


That didn't work, one of the first things I tried. Crashes just like without it.

--
 ...Elämälle vierasta toimintaa...
 Jukka Larja, jla...@iki.fi, 0407679919

"BTW: You won't get that extra point if you plagiate the feedback. (We will 
run automatic plagiation checkers... ;)"

- Aki Hiisilä, news.tky.hut.fi: opinnot.as.as0101 -
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-11 Thread Chris Murphy
On Fri, Jun 10, 2016 at 9:11 PM, Jukka Larja  wrote:


> I understand that usebackuproot requires kernel >= 4.6. I probably won't be
> installing a custom kernel, but if I still have the array in its current
> state when 4.6 becomes available in Debian Stretch, I'll give it a try.

It's the "recovery" mount option in older kernels.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-10 Thread Jukka Larja

10.6.2016, 23.20, Henk Slager kirjoitti:

On Sat, May 14, 2016 at 10:19 AM, Jukka Larja  wrote:

In short:

I added two 8TB Seagate Archive SMR disk to btrfs pool and tried to delete
one of the old disks. After some errors I ended up with file system that can
be mounted read-only, but crashes the kernel if mounted normally. Tried
btrfs check --repair (which noted that space cache needs to be zeroed) and
zeroing space cache (via mount parameter), but that didn't change anything.

Longer version:

I was originally running Debian Jessie with some pretty recent kernel (maybe
4.4), but somewhat older btrfs tools. After the trouble started, I tried


You should at least have kernel 4.4, the critical patch for supporting
this drive was added in 4.4-rc3 or 4.4-rc4, i dont remember exactly.
Only if you somehow disable NCQ completely in your linux system
(kernel and more) or use a HW chipset/bridge that does that for you it
might work.


After the crash I tracked the issue somewhat and found a discussion about 
very similar issue (starting with drives failing with dd or badblocks and 
ending, after several patches, to drives working in everything except maybe 
in Btrfs in certain cases). As far as I could tell, the 4.5 kernel has all 
the patches from that discussion, but I may have missed something that 
wasn't mentioned there.



updating (now running Kernel 4.5.1 and tools 4.4.1). I checked the new disks
with badblocks (no problems found), but based on some googling, Seagate's
SMR disks seem to have various problems, so the root cause is probably one
type or another of disk errors.


Seagate provides a special variant of the linux ext4 fs system that
should then play well with their SMR drive. Also the advice is to not
use this drive in a array setup; the risk is way to high that they
can't keep up with the demands of the higher layers and then get
resets or their FW crashes. You should have had also have a look at
your system's and drive timeouts (see scterc). To summarize: adding
those drives to an btrfs raid array is asking for trouble.


Increasing timeouts didn't help with the drive. Array freezes when drive 
drops out, then there's a crash when timeout occurs. It doesn't matter if 
the drive has come back in the mean time (drive doesn't return with same 
/dev/sdX, though I don't know if that matters for Btrfs).


I always thought that the problem with these drives was supposed to be bad 
performance and worse than usual ability to handle power going out. My use 
case is quite light from bytes written point of view, so I didn't expect 
trouble. Of course, doing the initial add + balance isn't light at all.


What I don't expect is what's essentially write errors. Pity, since the 
disks are dirt cheap compared to alternatives and I really don't care about 
performance.



I am using 1 such drive with an Intel J1900 SoC (Atom, SATA2) and it
works, although I get still the typical error occasionally. As it is
just a btrfs receive target, just 1 fs dup/dup/single for the whole
drive, all CoW, it survives those lockups or crashes, I just restart
the board+drive. In general, reading back multi-TB ro snapshots works
fine and is on par with Gbps LAN speeds.


I'll probably test those drives as a target for DVR backups, when I get them 
out of the array (still waiting for new drives with which to start over. 
Then I just tear down the old array).



Indeed kernel should not crash on such a case. It is not clear if you
run a 4.5.1 or 4.5.0 kernel in terms of kernel.org terminology, but
newer than 4.5.x probably does not help in this case.
You could try to mount with usebackuproot and then see if you can get
it writable, after setting long timeout values for the drive. If it
works, then remove those 2 SMRs from the array ASAP.


I understand that usebackuproot requires kernel >= 4.6. I probably won't be 
installing a custom kernel, but if I still have the array in its current 
state when 4.6 becomes available in Debian Stretch, I'll give it a try.


--
 ...Elämälle vierasta toimintaa...
 Jukka Larja, jla...@iki.fi, 0407679919

"... on paper looked like a great chip (10 GFs at 1.2 GHZ whith 35W"
"It's a mystery to me why people continue to use silicon - processors on 
paper are always faster and cooler :-)"

- lubemark and Richard Cownie on RWT forums -

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash on mount after SMR disk trouble

2016-06-10 Thread Henk Slager
On Sat, May 14, 2016 at 10:19 AM, Jukka Larja  wrote:
> In short:
>
> I added two 8TB Seagate Archive SMR disk to btrfs pool and tried to delete
> one of the old disks. After some errors I ended up with file system that can
> be mounted read-only, but crashes the kernel if mounted normally. Tried
> btrfs check --repair (which noted that space cache needs to be zeroed) and
> zeroing space cache (via mount parameter), but that didn't change anything.
>
> Longer version:
>
> I was originally running Debian Jessie with some pretty recent kernel (maybe
> 4.4), but somewhat older btrfs tools. After the trouble started, I tried

You should at least have kernel 4.4, the critical patch for supporting
this drive was added in 4.4-rc3 or 4.4-rc4, i dont remember exactly.
Only if you somehow disable NCQ completely in your linux system
(kernel and more) or use a HW chipset/bridge that does that for you it
might work.

> updating (now running Kernel 4.5.1 and tools 4.4.1). I checked the new disks
> with badblocks (no problems found), but based on some googling, Seagate's
> SMR disks seem to have various problems, so the root cause is probably one
> type or another of disk errors.

Seagate provides a special variant of the linux ext4 fs system that
should then play well with their SMR drive. Also the advice is to not
use this drive in a array setup; the risk is way to high that they
can't keep up with the demands of the higher layers and then get
resets or their FW crashes. You should have had also have a look at
your system's and drive timeouts (see scterc). To summarize: adding
those drives to an btrfs raid array is asking for trouble.

I am using 1 such drive with an Intel J1900 SoC (Atom, SATA2) and it
works, although I get still the typical error occasionally. As it is
just a btrfs receive target, just 1 fs dup/dup/single for the whole
drive, all CoW, it survives those lockups or crashes, I just restart
the board+drive. In general, reading back multi-TB ro snapshots works
fine and is on par with Gbps LAN speeds.

> Here's the output of btrfs fi show:
>
> Label: none  uuid: 8b65962d-0982-449b-ac6f-1acc8397ceb9
> Total devices 12 FS bytes used 13.15TiB
> devid1 size 3.64TiB used 3.36TiB path /dev/sde1
> devid2 size 3.64TiB used 3.36TiB path /dev/sdg1
> devid3 size 3.64TiB used 3.36TiB path /dev/sdh1
> devid4 size 3.64TiB used 3.34TiB path /dev/sdf1
> devid5 size 1.82TiB used 1.44TiB path /dev/sdi1
> devid6 size 1.82TiB used 1.54TiB path /dev/sdl1
> devid7 size 1.82TiB used 1.51TiB path /dev/sdk1
> devid8 size 1.82TiB used 1.54TiB path /dev/sdj1
> devid9 size 3.64TiB used 3.31TiB path /dev/sdb1
> devid   10 size 3.64TiB used 3.36TiB path /dev/sda1
> devid   11 size 7.28TiB used 168.00GiB path /dev/sdc1
> devid   12 size 7.28TiB used 168.00GiB path /dev/sdd1
>
> Last two devices (11 and 12) are the new disks. After adding them, I first
> copied some new data in (about 130 GBs), which seemed to go fine. Then I
> tried to remove disk 5. After some time (about 30 GiBs written to 11 and
> 12), there were some errors and disk 11 or 12 dropped out and fs went
> read-only. After some trouble-shooting (googling), I decided the new disks
> were too iffy to trust and tried to remove them.
>
> I don't remember exactly what errors I got, but device delete operation was
> interrupted due to errors at least once or twice, before more serious
> trouble began. In between the attempts I updated the HBA's (an LSI 9300)
> firmware. After final device delete attempt the end result was that
> attempting to mount causes kernel to crash. I then tried updating kernel and
> running check --repair, but that hasn't helped. Mounting read-only seems to
> work perfectly, but I haven't tried copying everything to /dev/null or
> anything like that (just few files).
>
> The log of the crash (it is very repeatable) can be seen here:
> http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/btrfs_crash_log.txt
>
> Snipped from start of that:
>
> touko 12 06:41:22 jane kernel: BTRFS info (device sda1): disk space caching
> is enabled
> touko 12 06:41:24 jane kernel: BTRFS info (device sda1): bdev /dev/sdd1
> errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> touko 12 06:41:39 jane kernel: BUG: unable to handle kernel NULL pointer
> dereference at 01f0
> touko 12 06:41:39 jane kernel: IP: []
> can_overcommit+0x1e/0xf0 [btrfs]
> touko 12 06:41:39 jane kernel: PGD 0
> touko 12 06:41:39 jane kernel: Oops:  [#1] SMP
>
> My dmesg log is here:
> http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/dmesg.log
>
> Other information:
> Linux jane 4.5.0-1-amd64 #1 SMP Debian 4.5.1-1 (2016-04-14) x86_64 GNU/Linux
> btrfs-progs v4.4.1
>
> btrfs fi df /mnt/Allosaurus/
> Data, RAID1: total=13.13TiB, used=13.07TiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: 

Kernel crash on mount after SMR disk trouble

2016-05-14 Thread Jukka Larja

In short:

I added two 8TB Seagate Archive SMR disk to btrfs pool and tried to delete 
one of the old disks. After some errors I ended up with file system that can 
be mounted read-only, but crashes the kernel if mounted normally. Tried 
btrfs check --repair (which noted that space cache needs to be zeroed) and 
zeroing space cache (via mount parameter), but that didn't change anything.


Longer version:

I was originally running Debian Jessie with some pretty recent kernel (maybe 
4.4), but somewhat older btrfs tools. After the trouble started, I tried 
updating (now running Kernel 4.5.1 and tools 4.4.1). I checked the new disks 
with badblocks (no problems found), but based on some googling, Seagate's 
SMR disks seem to have various problems, so the root cause is probably one 
type or another of disk errors.


Here's the output of btrfs fi show:

Label: none  uuid: 8b65962d-0982-449b-ac6f-1acc8397ceb9
Total devices 12 FS bytes used 13.15TiB
devid1 size 3.64TiB used 3.36TiB path /dev/sde1
devid2 size 3.64TiB used 3.36TiB path /dev/sdg1
devid3 size 3.64TiB used 3.36TiB path /dev/sdh1
devid4 size 3.64TiB used 3.34TiB path /dev/sdf1
devid5 size 1.82TiB used 1.44TiB path /dev/sdi1
devid6 size 1.82TiB used 1.54TiB path /dev/sdl1
devid7 size 1.82TiB used 1.51TiB path /dev/sdk1
devid8 size 1.82TiB used 1.54TiB path /dev/sdj1
devid9 size 3.64TiB used 3.31TiB path /dev/sdb1
devid   10 size 3.64TiB used 3.36TiB path /dev/sda1
devid   11 size 7.28TiB used 168.00GiB path /dev/sdc1
devid   12 size 7.28TiB used 168.00GiB path /dev/sdd1

Last two devices (11 and 12) are the new disks. After adding them, I first 
copied some new data in (about 130 GBs), which seemed to go fine. Then I 
tried to remove disk 5. After some time (about 30 GiBs written to 11 and 
12), there were some errors and disk 11 or 12 dropped out and fs went 
read-only. After some trouble-shooting (googling), I decided the new disks 
were too iffy to trust and tried to remove them.


I don't remember exactly what errors I got, but device delete operation was 
interrupted due to errors at least once or twice, before more serious 
trouble began. In between the attempts I updated the HBA's (an LSI 9300) 
firmware. After final device delete attempt the end result was that 
attempting to mount causes kernel to crash. I then tried updating kernel and 
running check --repair, but that hasn't helped. Mounting read-only seems to 
work perfectly, but I haven't tried copying everything to /dev/null or 
anything like that (just few files).


The log of the crash (it is very repeatable) can be seen here: 
http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/btrfs_crash_log.txt


Snipped from start of that:

touko 12 06:41:22 jane kernel: BTRFS info (device sda1): disk space caching 
is enabled
touko 12 06:41:24 jane kernel: BTRFS info (device sda1): bdev /dev/sdd1 
errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
touko 12 06:41:39 jane kernel: BUG: unable to handle kernel NULL pointer 
dereference at 01f0
touko 12 06:41:39 jane kernel: IP: [] 
can_overcommit+0x1e/0xf0 [btrfs]

touko 12 06:41:39 jane kernel: PGD 0
touko 12 06:41:39 jane kernel: Oops:  [#1] SMP

My dmesg log is here: 
http://jane.aarghimedes.fi/~jlarja/tempe/btrfs-trouble/dmesg.log


Other information:
Linux jane 4.5.0-1-amd64 #1 SMP Debian 4.5.1-1 (2016-04-14) x86_64 GNU/Linux
btrfs-progs v4.4.1

btrfs fi df /mnt/Allosaurus/
Data, RAID1: total=13.13TiB, used=13.07TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=1.94MiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=87.00GiB, used=85.24GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


The data is either backups or media data dublicated elsewhere, so I'm in no 
great hurry and could just fix everything just with enough new disks and cp 
-R. However, it would save me a lot of trouble (and some money) if I could 
get this fixed otherwise. Of course, would be nice in general for the future 
kernel not to crash when mounting corrupted file system :) .


--
 ...Elämälle vierasta toimintaa...
 Jukka Larja, jla...@iki.fi, 0407679919

"Our own Charlie D reckons that 18.2 per cent of Internet traffic is now 
pr0n, and if Intel's Netbust can make the Internet faster, can the sempr0n 
make pr0n faster?"

- The Inquirer, http://www.theinquirer.net/?article=16447 -

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html