Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
 Hi Qu,

 The tool is still running and the log file is now ~300mb. I guess it
 shouldn't normally take this long.. Is there anything else worth
 trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael

 On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
> skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
> skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or 
 failed
 to recover its data.

 This means your last 

Re: btrfs progs always assume devid 1?

2018-12-05 Thread Austin S. Hemmelgarn

On 2018-12-05 14:50, Roman Mamedov wrote:

Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.



I've been meaning to try and write up a patch to special-case this for a 
while now, but have not gotten around to it yet.


FWIW, this is one of multiple reasons that it's highly recommended to 
use `btrfs replace` instead of adding a new device and deleting the old 
one when replacing a device.  Other benefits include:


* It doesn't have to run in the foreground (and doesn't by default).
* It usually takes less time.
* Replace operations can be queried while running to get a nice 
indication of the completion percentage.


The only disadvantage is that the new device has to be at least as large 
as the old one (though you can get around this to a limited degree by 
shrinking the old device), and it needs the old and new device to be 
plugged in at the same time (add/delete doesn't, if you flip the order 
of the add and delete commands).


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Wilson, Ellis
On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis



Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
>  wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel



Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>


Re: btrfs development - question about crypto api integration

2018-12-04 Thread David Sterba
On Fri, Nov 30, 2018 at 06:27:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 30.11.18 г. 17:22 ч., Chris Mason wrote:
> > On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:
> > 
> >> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
> >>> Hi all,
> >>> I've been studying LZ4 and other compression algorithms on the
> >>> kernel, and seen other projects such as zram and ubifs using the
> >>> crypto api. Is there a technical reason for not using the crypto api
> >>> for compression (and possibly for encryption) in btrfs?
> >>> I did not find any design/technical implementation choices in
> >>> btrfs development in the developer's FAQ on the wiki. If I completely
> >>> missed it, could someone point me in the right direction?
> >>> Lastly, if there is no technical reason for this, would it be
> >>> something interesting to have implemented?
> >>
> >> I personally think it would be better if btrfs' exploited the generic
> >> framework. And in fact when you look at zstd, btrfs does use the
> >> generic, low-level ZSTD routines but not the crypto library wrappers. 
> >> If
> >> I were I'd try and convert zstd (since it's the most recently added
> >> algorithm) to using the crypto layer to see if there are any lurking
> >> problems.
> > 
> > Back when I first added the zlib support, the zlib API was both easier 
> > to use and a better fit for our async worker threads.  That doesn't mean 
> > we shouldn't switch, it's just how we got to step one ;)
> 
> And what about zstd? WHy is it also using the low level api and not the
> crypto wrappers?

I think beacuse it just copied the way things already were.

Here's fs/ubifs/compress.c as an example use of the API in a filesystem:

ubifs_compress
  crypto_comp_compress
crypto_comp_crt -- address of, dereference
->cot_commpress -- takes another address of something, indirect
   function call
  with value 'crypto_compress'
-- this does 2 pointer dereferences and indirect call to
   coa_compress
   coa_compress = lzo_compress
 lzo_compress -- pointer dereferences for crypto context
   lzo1x_1_compress -- through static __lzo_compress (no overhead)

while in btrfs:

btrfs_compress_pages
  ->compress_pages
lzo1x_1_compress

The crypto API adds several pointer and function call indirectinos, I'm
not sure I want to get rid of the direct calls just yet.


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Chris Murphy
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
 wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Qu Wenruo


On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 



signature.asc
Description: OpenPGP digital signature


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Hans van Kranenburg
Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).




Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel


Re: btrfs development - question about crypto api integration

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 17:22 ч., Chris Mason wrote:
> On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:
> 
>> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>>> Hi all,
>>> I've been studying LZ4 and other compression algorithms on the
>>> kernel, and seen other projects such as zram and ubifs using the
>>> crypto api. Is there a technical reason for not using the crypto api
>>> for compression (and possibly for encryption) in btrfs?
>>> I did not find any design/technical implementation choices in
>>> btrfs development in the developer's FAQ on the wiki. If I completely
>>> missed it, could someone point me in the right direction?
>>> Lastly, if there is no technical reason for this, would it be
>>> something interesting to have implemented?
>>
>> I personally think it would be better if btrfs' exploited the generic
>> framework. And in fact when you look at zstd, btrfs does use the
>> generic, low-level ZSTD routines but not the crypto library wrappers. 
>> If
>> I were I'd try and convert zstd (since it's the most recently added
>> algorithm) to using the crypto layer to see if there are any lurking
>> problems.
> 
> Back when I first added the zlib support, the zlib API was both easier 
> to use and a better fit for our async worker threads.  That doesn't mean 
> we shouldn't switch, it's just how we got to step one ;)

And what about zstd? WHy is it also using the low level api and not the
crypto wrappers?

> 
> -chris
> 


Re: btrfs development - question about crypto api integration

2018-11-30 Thread Chris Mason
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:

> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>> Hi all,
>> I've been studying LZ4 and other compression algorithms on the
>> kernel, and seen other projects such as zram and ubifs using the
>> crypto api. Is there a technical reason for not using the crypto api
>> for compression (and possibly for encryption) in btrfs?
>> I did not find any design/technical implementation choices in
>> btrfs development in the developer's FAQ on the wiki. If I completely
>> missed it, could someone point me in the right direction?
>> Lastly, if there is no technical reason for this, would it be
>> something interesting to have implemented?
>
> I personally think it would be better if btrfs' exploited the generic
> framework. And in fact when you look at zstd, btrfs does use the
> generic, low-level ZSTD routines but not the crypto library wrappers. 
> If
> I were I'd try and convert zstd (since it's the most recently added
> algorithm) to using the crypto layer to see if there are any lurking
> problems.

Back when I first added the zlib support, the zlib API was both easier 
to use and a better fit for our async worker threads.  That doesn't mean 
we shouldn't switch, it's just how we got to step one ;)

-chris


Re: btrfs development - question about crypto api integration

2018-11-29 Thread Nikolay Borisov



On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
> Hi all,
> I've been studying LZ4 and other compression algorithms on the
> kernel, and seen other projects such as zram and ubifs using the
> crypto api. Is there a technical reason for not using the crypto api
> for compression (and possibly for encryption) in btrfs?
> I did not find any design/technical implementation choices in
> btrfs development in the developer's FAQ on the wiki. If I completely
> missed it, could someone point me in the right direction?
> Lastly, if there is no technical reason for this, would it be
> something interesting to have implemented?

I personally think it would be better if btrfs' exploited the generic
framework. And in fact when you look at zstd, btrfs does use the
generic, low-level ZSTD routines but not the crypto library wrappers. If
I were I'd try and convert zstd (since it's the most recently added
algorithm) to using the crypto layer to see if there are any lurking
problems.

> 
> Best regards
> 


Re: Btrfs and fanotify filesystem watches

2018-11-27 Thread Jan Kara
On Fri 23-11-18 19:53:11, Amir Goldstein wrote:
> On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein  wrote:
> > > So open_by_handle() should work fine even if we get mount_fd of /subvol1
> > > and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> > > only used to get the superblock and that is the same.
> >
> > I don't think it will work fine.
> > do_handle_to_path() will compose a path from /subvol1 mnt and dentry from
> > /subvol2. This may resolve to a full path that does not really exist,
> > so application
> > cannot match it to anything that is can use name_to_handle_at() to identify.
> >
> > The whole thing just sounds too messy. At the very least we need more time
> > to communicate with btrfs developers and figure this out, so I am going to
> > go with -EXDEV for any attempt to set *any* mark on a group with
> > FAN_REPORT_FID if fsid of fanotify_mark() path argument is different
> > from fsid of path->dentry->d_sb->s_root.
> >
> > We can relax that later if we figure out a better way.
> >
> > BTW, I am also going to go with -ENODEV for zero fsid (e.g. tmpfs).
> > tmpfs can be easily fixed to have non zero fsid if filesystem watch on 
> > tmpfs is
> > important.
> >
> 
> Well, this is interesting... I have implemented the -EXDEV logic and it
> works as expected. I can set a filesystem global watch on the main
> btrfs mount and not allowed to set a global watch on a subvolume.
> 
> The interesting part is that the global watch on the main btrfs volume
> more useful than I though it would be. The file handles reported by the
> main volume global watch are resolved to correct paths in the main volume.
> I guess this is because a btrfs subvolume looks like a directory tree
> in the global
> namespace to vfs. See below.
> 
> So I will continue based on this working POC:
> https://github.com/amir73il/linux/commits/fanotify_fid
> 
> Note that in the POC, fsid is cached in mark connector as you suggested.
> It is still buggy, but my prototype always decodes file handles from the first
> path argument given to the program, so it just goes to show that by getting
> fsid of the main btrfs volume, the application will be able to properly decode
> file handles and resolve correct paths.
> 
> The bottom line is that btrfs will have the full functionality of super block
> monitoring with no ability to watch (or ignore) a single subvolume
> (unless by using a mount mark).

Sounds good. I'll check the new version of your series.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein  wrote:
>
> On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
> >
> > Changed subject to better match what we discuss and added btrfs list to CC.
> >
> > On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > > >
> > > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > > associated with
> > > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > > will
> > > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > > >
> > > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > > >
> > > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > > be32_to_cpu(fsid[2]);
> > > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > > be32_to_cpu(fsid[3]);
> > > > > > /* Mask in the root object ID too, to disambiguate subvols 
> > > > > > */
> > > > > > buf->f_fsid.val[0] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid 
> > > > > > >> 32;
> > > > > > buf->f_fsid.val[1] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > > >
> > > > > > So subvolume root is xored into the FSID. Thus dentries from 
> > > > > > different
> > > > > > subvolumes are going to return different fsids...
> > > > > >
> > > > >
> > > > > Right... how could I have missed that :-/
> > > > >
> > > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > > Instead, I will use:
> > > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > > >
> > > > So what about my proposal to store fsid in the notification mark when 
> > > > it is
> > > > created and then use it when that mark results in event being generated?
> > > > When mark is created, we have full path available, so getting proper 
> > > > fsid
> > > > is trivial. Furthermore if the behavior is documented like that, it is
> > > > fairly easy for userspace to track fsids it should care about and 
> > > > translate
> > > > them to proper file descriptors for open_by_handle().
> > > >
> > > > This would get rid of statfs() on every event creation (which I don't 
> > > > like
> > > > much) and also avoids these problems "how to get fsid for inode". What 
> > > > do
> > > > you think?
> > > >
> > >
> > > That's interesting. I like the simplicity.
> > > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > > and the application obviously doesn't know about this and does:
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > > statfs("/subvol1",...);
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > > statfs("/subvol2",...);
> > >
> > > Now the second fanotify_mark() call just updates the existing super block
> > > mark mask, but does not change the fsid on the mark, so all events will 
> > > have
> > > fsid of subvol1 that was stored when first creating the mark.
> >
> > Yes.
> >
> > > fanotify-watch application (for example) hashes the watches (paths) under
> > > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > > "watch" (i.e. path).
> >
> > I agree this can be confusing... but with btrfs fanotify-watch will be
> > confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> > on /subvol1 (with fsid A) is going to return also events on inodes from
> > /subvol2 (with fsid B). So your current code will return event with fsid B
> > which fanotify-watch has no way to match back and can get confused.
> >
> > So currently application can get events with fsid it has never seen, with
> > the code as I suggest it can get "wrong" fsid. That is confusing but still
> > somewhat better?
>
> It's not better IMO because the purpose of the fid in event is a unique key
> that application can use to match a path it already indexed.
> wrong fsid => no match.
> Using open_by_handle_at() to resolve unknown fid is a limited solution
> and I don't think it is going to work in this case (see below).
>
> >
> > The core of the problem is that btrfs does not have "the superblock
> > identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> > that we could use.
> >
> > > And when trying to open_by_handle fid with fhandle from /subvol2
> > > using mount_fd of /subvol1, I suppose we can either get ESTALE
> > > or a disconnected dentry, because object from /subvol2 cannot
> > > have a path inside /subvol1
> >
> > So open_by_handle() should work fine even if we get mount_fd of /subvol1
> > and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> > only used to get the superblock and that is the same.
>
> I don't think it will work fine.
> do_handle_to_path() will compose a path from /subvol1 mnt and dentry 

Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
>
> Changed subject to better match what we discuss and added btrfs list to CC.
>
> On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > >
> > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > associated with
> > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > will
> > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > >
> > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > >
> > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > be32_to_cpu(fsid[2]);
> > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > be32_to_cpu(fsid[3]);
> > > > > /* Mask in the root object ID too, to disambiguate subvols */
> > > > > buf->f_fsid.val[0] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 
> > > > > 32;
> > > > > buf->f_fsid.val[1] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > >
> > > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > > subvolumes are going to return different fsids...
> > > > >
> > > >
> > > > Right... how could I have missed that :-/
> > > >
> > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > Instead, I will use:
> > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > >
> > > So what about my proposal to store fsid in the notification mark when it 
> > > is
> > > created and then use it when that mark results in event being generated?
> > > When mark is created, we have full path available, so getting proper fsid
> > > is trivial. Furthermore if the behavior is documented like that, it is
> > > fairly easy for userspace to track fsids it should care about and 
> > > translate
> > > them to proper file descriptors for open_by_handle().
> > >
> > > This would get rid of statfs() on every event creation (which I don't like
> > > much) and also avoids these problems "how to get fsid for inode". What do
> > > you think?
> > >
> >
> > That's interesting. I like the simplicity.
> > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > and the application obviously doesn't know about this and does:
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > statfs("/subvol1",...);
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > statfs("/subvol2",...);
> >
> > Now the second fanotify_mark() call just updates the existing super block
> > mark mask, but does not change the fsid on the mark, so all events will have
> > fsid of subvol1 that was stored when first creating the mark.
>
> Yes.
>
> > fanotify-watch application (for example) hashes the watches (paths) under
> > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > "watch" (i.e. path).
>
> I agree this can be confusing... but with btrfs fanotify-watch will be
> confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> on /subvol1 (with fsid A) is going to return also events on inodes from
> /subvol2 (with fsid B). So your current code will return event with fsid B
> which fanotify-watch has no way to match back and can get confused.
>
> So currently application can get events with fsid it has never seen, with
> the code as I suggest it can get "wrong" fsid. That is confusing but still
> somewhat better?

It's not better IMO because the purpose of the fid in event is a unique key
that application can use to match a path it already indexed.
wrong fsid => no match.
Using open_by_handle_at() to resolve unknown fid is a limited solution
and I don't think it is going to work in this case (see below).

>
> The core of the problem is that btrfs does not have "the superblock
> identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> that we could use.
>
> > And when trying to open_by_handle fid with fhandle from /subvol2
> > using mount_fd of /subvol1, I suppose we can either get ESTALE
> > or a disconnected dentry, because object from /subvol2 cannot
> > have a path inside /subvol1
>
> So open_by_handle() should work fine even if we get mount_fd of /subvol1
> and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> only used to get the superblock and that is the same.

I don't think it will work fine.
do_handle_to_path() will compose a path from /subvol1 mnt and dentry from
/subvol2. This may resolve to a full path that does not really exist,
so application
cannot match it to anything that is can use name_to_handle_at() to identify.

The whole thing just sounds too messy. At the very least we need more time
to communicate with btrfs developers and figure this out, so I am going to
go 

Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Chris Murphy
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski  wrote:
>
> On 2018-11-22 21:46, Nikolay Borisov wrote:
>
> >> # echo w > /proc/sysrq-trigger
> >>
> >> # dmesg -c
> >> [  931.585611] sysrq: SysRq : Show Blocked State
> >> [  931.585715]   taskPC stack   pid father
> >> [  931.590168] btrfs-cleaner   D0  1340  2 0x8000
> >> [  931.590175] Call Trace:
> >> [  931.590190]  __schedule+0x29e/0x840
> >> [  931.590195]  schedule+0x2c/0x80
> >> [  931.590199]  schedule_timeout+0x258/0x360
> >> [  931.590204]  io_schedule_timeout+0x1e/0x50
> >> [  931.590208]  wait_for_completion_io+0xb7/0x140
> >> [  931.590214]  ? wake_up_q+0x80/0x80
> >> [  931.590219]  submit_bio_wait+0x61/0x90
> >> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> >> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> >> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> >> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> >> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> >> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> >> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> >> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> >> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> >> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> >> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> >> [  931.590612]  kthread+0x120/0x140
> >> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> >> [  931.590658]  ? kthread_bind+0x40/0x40
> >> [  931.590661]  ret_from_fork+0x22/0x40
> >>
> >
> > It seems your filesystem is mounted with the DSICARD option meaning
> > every delete will result in discard this is highly suboptimal for
> > ssd's.
> > Try remounting the fs without the discard option see if it helps.
> > Generally for discard you want to submit it in big batches (what fstrim
> > does) so that the ftl on the ssd could apply any optimisations it might
> > have up its sleeve.
>
> Spot on!
>
> Removed "discard" from fstab and added "ssd", rebooted - no more
> btrfs-cleaner running.
>
> Do you know if the issue you described ("discard this is highly
> suboptimal for ssd") affects other filesystems as well to a similar
> extent? I.e. if using ext4 on ssd?

Quite a lot of activity on ext4 and XFS are overwrites, so discard
isn't needed. And it might be discard is subject to delays. On Btrfs,
it's almost immediate, to the degree that on a couple SSDs I've
tested, stale trees referenced exclusively by the most recent backup
tree entires in the superblock are already zeros. That functionally
means no automatic recoveries at mount time if there's a problem with
any of the current trees.

I was using it for about a year to no ill effect, BUT not a lot of
file deletions either. I wouldn't recommend it, and instead suggest
enabling the fstrim.timer which by default runs fstrim.service once a
week (which in turn issues fstrim, I think on all mounted volumes.)

I am a bit more concerned about the read errors you had that were
being corrected automatically? The corruption suggests a firmware bug
related to trim. I'd check the affected SSD firmware revision and
consider updating it (only after a backup, it's plausible the firmware
update is not guaranteed to be data safe). Does the volume use DUP or
raid1 metadata? I'm not sure how it's correcting for these problems
otherwise.


-- 
Chris Murphy


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Qu Wenruo



On 2018/11/22 下午10:03, Roman Mamedov wrote:
> On Thu, 22 Nov 2018 22:07:25 +0900
> Tomasz Chmielewski  wrote:
> 
>> Spot on!
>>
>> Removed "discard" from fstab and added "ssd", rebooted - no more 
>> btrfs-cleaner running.
> 
> Recently there has been a bugfix for TRIM in Btrfs:
>   
>   btrfs: Ensure btrfs_trim_fs can trim the whole fs
>   https://patchwork.kernel.org/patch/10579539/
> 
> Perhaps your upgraded kernel is the first one to contain it, and for the first
> time you're seeing TRIM to actually *work*, with the actual performance impact
> of it on a large fragmented FS, instead of a few contiguous unallocated areas.
> 
That only affects btrfs_trim_fs(), and you can see it's only called from
ioctl interface, so it's definitely not the case.

Thanks,
Qu


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Roman Mamedov
On Thu, 22 Nov 2018 22:07:25 +0900
Tomasz Chmielewski  wrote:

> Spot on!
> 
> Removed "discard" from fstab and added "ssd", rebooted - no more 
> btrfs-cleaner running.

Recently there has been a bugfix for TRIM in Btrfs:
  
  btrfs: Ensure btrfs_trim_fs can trim the whole fs
  https://patchwork.kernel.org/patch/10579539/

Perhaps your upgraded kernel is the first one to contain it, and for the first
time you're seeing TRIM to actually *work*, with the actual performance impact
of it on a large fragmented FS, instead of a few contiguous unallocated areas.

-- 
With respect,
Roman


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Tomasz Chmielewski

On 2018-11-22 21:46, Nikolay Borisov wrote:


# echo w > /proc/sysrq-trigger

# dmesg -c
[  931.585611] sysrq: SysRq : Show Blocked State
[  931.585715]   task    PC stack   pid father
[  931.590168] btrfs-cleaner   D    0  1340  2 0x8000
[  931.590175] Call Trace:
[  931.590190]  __schedule+0x29e/0x840
[  931.590195]  schedule+0x2c/0x80
[  931.590199]  schedule_timeout+0x258/0x360
[  931.590204]  io_schedule_timeout+0x1e/0x50
[  931.590208]  wait_for_completion_io+0xb7/0x140
[  931.590214]  ? wake_up_q+0x80/0x80
[  931.590219]  submit_bio_wait+0x61/0x90
[  931.590225]  blkdev_issue_discard+0x7a/0xd0
[  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
[  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
[  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
[  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
[  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
[  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
[  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
[  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
[  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
[  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
[  931.590612]  kthread+0x120/0x140
[  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
[  931.590658]  ? kthread_bind+0x40/0x40
[  931.590661]  ret_from_fork+0x22/0x40



It seems your filesystem is mounted with the DSICARD option meaning
every delete will result in discard this is highly suboptimal for 
ssd's.

Try remounting the fs without the discard option see if it helps.
Generally for discard you want to submit it in big batches (what fstrim
does) so that the ftl on the ssd could apply any optimisations it might
have up its sleeve.


Spot on!

Removed "discard" from fstab and added "ssd", rebooted - no more 
btrfs-cleaner running.


Do you know if the issue you described ("discard this is highly 
suboptimal for ssd") affects other filesystems as well to a similar 
extent? I.e. if using ext4 on ssd?




Would you finally care to share the smart data + the model and make of
the ssd?


2x these:

Model Family: Samsung based SSDs
Device Model: SAMSUNG MZ7LM1T9HCJM-5
Firmware Version: GXT1103Q
User Capacity:1,920,383,410,176 bytes [1.92 TB]
Sector Size:  512 bytes logical/physical
Rotation Rate:Solid State Device

1x this:

Device Model: Micron_5200_MTFDDAK1T9TDC
Firmware Version: D1MU004
User Capacity:1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:Solid State Device
Form Factor:  2.5 inches


But - seems the issue was unneeded discard option, so not pasting 
unnecessary SMART data, thanks for finding this out.



Tomasz Chmielewski


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 14:31 ч., Tomasz Chmielewski wrote:
> Yet another system upgraded to 4.19 and showing strange issues.
> 
> btrfs-cleaner is showing as ~90-100% busy in iotop:
> 
>    TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>    COMMAND
>   1340 be/4 root    0.00 B/s    0.00 B/s  0.00 % 92.88 %
> [btrfs-cleaner]
> 
> 
> Note disk read, disk write are 0.00 B/s.
> 
> 
> iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0
> writes to them:
> 
> Device    r/s w/s rMB/s wMB/s   rrqm/s   wrqm/s 
> %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> sda  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   0.91 0.00 0.00   0.00  90.80
> sdb  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> sdc  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> 
> 
> 
> btrfs-cleaner persists 100% busy after reboots. The system is generally
> not very responsive.
> 
> 
> 
> # echo w > /proc/sysrq-trigger
> 
> # dmesg -c
> [  931.585611] sysrq: SysRq : Show Blocked State
> [  931.585715]   task    PC stack   pid father
> [  931.590168] btrfs-cleaner   D    0  1340  2 0x8000
> [  931.590175] Call Trace:
> [  931.590190]  __schedule+0x29e/0x840
> [  931.590195]  schedule+0x2c/0x80
> [  931.590199]  schedule_timeout+0x258/0x360
> [  931.590204]  io_schedule_timeout+0x1e/0x50
> [  931.590208]  wait_for_completion_io+0xb7/0x140
> [  931.590214]  ? wake_up_q+0x80/0x80
> [  931.590219]  submit_bio_wait+0x61/0x90
> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> [  931.590612]  kthread+0x120/0x140
> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> [  931.590658]  ? kthread_bind+0x40/0x40
> [  931.590661]  ret_from_fork+0x22/0x40
> 



It seems your filesystem is mounted with the DSICARD option meaning
every delete will result in discard this is highly suboptimal for ssd's.
Try remounting the fs without the discard option see if it helps.
Generally for discard you want to submit it in big batches (what fstrim
does) so that the ftl on the ssd could apply any optimisations it might
have up its sleeve.


Would you finally care to share the smart data + the model and make of
the ssd?

> 
> 
> 
> After rebooting to 4.19.3, I've started seeing read errors. There were
> no errors with a previous kernel (4.17.14); disks are healthy according
> to SMART; no errors reported when we read the whole surface with dd.
> 
> # grep -i btrfs.*corrected /var/log/syslog|wc -l
> 156
> 
> Things like:
> 
> Nov 22 12:17:43 lxd05 kernel: [  711.538836] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector
> 101088384)
> Nov 22 12:17:43 lxd05 kernel: [  711.538905] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector
> 101088392)
> Nov 22 12:17:43 lxd05 kernel: [  711.538958] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector
> 101088400)
> Nov 22 12:17:43 lxd05 kernel: [  711.539006] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector
> 101088408)
> 
> 
> Yet - 0 errors, according to stats, not sure if it's expected or not:

Since btrfs was able to fix the read failure on the fly it doesn't
record the error.  IMO you have problems with your storage below the
filesystem level.

> 
> # btrfs device stats /data/lxd
> [/dev/sda2].write_io_errs    0
> [/dev/sda2].read_io_errs 0
> [/dev/sda2].flush_io_errs    0
> [/dev/sda2].corruption_errs  0
> [/dev/sda2].generation_errs  0
> [/dev/sdb2].write_io_errs    0
> [/dev/sdb2].read_io_errs 0
> [/dev/sdb2].flush_io_errs    0
> [/dev/sdb2].corruption_errs  0
> [/dev/sdb2].generation_errs  0
> [/dev/sdc2].write_io_errs    0
> [/dev/sdc2].read_io_errs 0
> [/dev/sdc2].flush_io_errs    0
> [/dev/sdc2].corruption_errs  0
> [/dev/sdc2].generation_errs  0
> 
> 




Re: btrfs convert problem

2018-11-17 Thread Qu Wenruo


On 2018/11/17 下午5:49, Serhat Sevki Dincer wrote:
> Hi,
> 
> On my second attempt to convert my 698 GiB usb HDD from ext4 to btrfs
> with btrfs-progs 4.19 from Manjaro (kernel 4.14.80):
> 
> I identified bad files with
> $ find . -type f -exec cat {} > /dev/null \;
> This revealed 6 corrupted files, I deleted them. Tried it again with
> no error message.
> 
> Then I checked HDD with
> $ sudo fsck.ext4 -f /dev/sdb1
> e2fsck 1.44.4 (18-Aug-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> HDD: 763513/45744128 files (0.3% non-contiguous), 131674747/182970301 blocks
> It is mountable & usable.
> 
> Now my attempt to convert
> $ LC_ALL=en_US.utf8 sudo strace -f -s 10 -a 4 -o convert-strace.log
> btrfs-convert /dev/sdb1
> create btrfs filesystem:
> blocksize: 4096
> nodesize:  16384
> features:  extref, skinny-metadata (default)
> creating ext2 image file
> Unable to find block group for 0
> Unable to find block group for 0
> Unable to find block group for 0

It's ENOSPC.
Since it won't damage your original ext* fs, you could try to check if
there are enough *continuous* space for btrfs to use.

Please keep in mind that, due to the nature ext* data layout, it may
contain a lot of small free space fragments, and in that case
btrfs-convert may not be able to take use of them.

Or you could try to disable csum to make btrfs take less space so it may
have a chance to convert.

Thanks,
Qu
> ctree.c:2244: split_leaf: BUG_ON `1` triggered, value 1
> btrfs-convert(+0x162d6)[0x561fbd26b2d6]
> btrfs-convert(btrfs_search_slot+0xf21)[0x561fbd26c881]
> btrfs-convert(btrfs_csum_file_block+0x499)[0x561fbd27e8e9]
> btrfs-convert(+0xe6f5)[0x561fbd2636f5]
> btrfs-convert(main+0x194f)[0x561fbd26296f]
> /usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f658a93a223]
> btrfs-convert(_start+0x2e)[0x561fbd26328e]
> Aborted
> 
> It crashed :) The log is 2.4G, I compressed it and put it at
> https://drive.google.com/drive/folders/0B5oVWFBM47D9aVpJY2s4UTdMdUU
> 
> What else can I do to debug further?
> Thanks..
> 



signature.asc
Description: OpenPGP digital signature


Re: btrfs convert problem

2018-11-17 Thread Serhat Sevki Dincer
Hi,

On my second attempt to convert my 698 GiB usb HDD from ext4 to btrfs
with btrfs-progs 4.19 from Manjaro (kernel 4.14.80):

I identified bad files with
$ find . -type f -exec cat {} > /dev/null \;
This revealed 6 corrupted files, I deleted them. Tried it again with
no error message.

Then I checked HDD with
$ sudo fsck.ext4 -f /dev/sdb1
e2fsck 1.44.4 (18-Aug-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
HDD: 763513/45744128 files (0.3% non-contiguous), 131674747/182970301 blocks
It is mountable & usable.

Now my attempt to convert
$ LC_ALL=en_US.utf8 sudo strace -f -s 10 -a 4 -o convert-strace.log
btrfs-convert /dev/sdb1
create btrfs filesystem:
blocksize: 4096
nodesize:  16384
features:  extref, skinny-metadata (default)
creating ext2 image file
Unable to find block group for 0
Unable to find block group for 0
Unable to find block group for 0
ctree.c:2244: split_leaf: BUG_ON `1` triggered, value 1
btrfs-convert(+0x162d6)[0x561fbd26b2d6]
btrfs-convert(btrfs_search_slot+0xf21)[0x561fbd26c881]
btrfs-convert(btrfs_csum_file_block+0x499)[0x561fbd27e8e9]
btrfs-convert(+0xe6f5)[0x561fbd2636f5]
btrfs-convert(main+0x194f)[0x561fbd26296f]
/usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f658a93a223]
btrfs-convert(_start+0x2e)[0x561fbd26328e]
Aborted

It crashed :) The log is 2.4G, I compressed it and put it at
https://drive.google.com/drive/folders/0B5oVWFBM47D9aVpJY2s4UTdMdUU

What else can I do to debug further?
Thanks..


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-16 Thread Chris Murphy
On Thu, Nov 15, 2018 at 10:39 AM Juan Alberto Cirez
 wrote:
>
> Is BTRFS mature enough to be deployed on a production system to underpin
> the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?
>
> Based on our limited experience with BTRFS (1+ year) under the above
> scenario the answer seems to be no; but I wanted you ask the community
> at large for their experience before making a final decision to hold off
> on deploying BTRFS on production systems.
>
> Let us be clear: We think BTRFS has great potential, and as it matures
> we will continue to watch its progress, so that at some future point we
> can return to using it.
>
> The issue has been the myriad of problems we have encountered when
> deploying BTRFS as the storage fs for the NVR/VMS in cases were the
> camera count exceeds 10: Corrupted file systems, sudden read-only file
> system, re-balance kernel panics, broken partitions, etc.

Performance problems are separate from reliability problems. No matter
what, there shouldn't be corruptions or failures when your process is
writing through the Btrfs kernel driver. Period. So you've either got
significant hardware/firmware problems as the root cause, or your use
case is exposing Btrfs bugs.

But you're burdened with providing sufficient details about the
hardware and storage stack configuration including kernel, btrfs-progs
versions and mkfs options and mount options being used. Without a way
for a developer to reproduce your problem it's unlikely the source of
the problem can be discovered and fixed.


> So, again, the question is: is BTRFS mature enough to be used in such
> use case and if so, what approach can be used to mitigate such issues.

What format are the cameras writing out in? It matters if this is a
continuous appending format, or if it's writing them out as individual
JPEG files, one per frame, or whatever. What rate, what size, and any
other concurrent operations, etc.

-- 
Chris Murphy


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-16 Thread Anand Jain




On 11/16/2018 03:51 AM, Nikolay Borisov wrote:



On 15.11.18 г. 20:39 ч., Juan Alberto Cirez wrote:

Is BTRFS mature enough to be deployed on a production system to underpin
the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?

Based on our limited experience with BTRFS (1+ year) under the above
scenario the answer seems to be no; but I wanted you ask the community
at large for their experience before making a final decision to hold off
on deploying BTRFS on production systems.

Let us be clear: We think BTRFS has great potential, and as it matures
we will continue to watch its progress, so that at some future point we
can return to using it.

The issue has been the myriad of problems we have encountered when
deploying BTRFS as the storage fs for the NVR/VMS in cases were the
camera count exceeds 10: Corrupted file systems, sudden read-only file
system, re-balance kernel panics, broken partitions, etc.


What version of the file system, how many of those issues (if any) have
you reported to the upstream mailing list? I don't remember seeing any
reports from you.



One specific case saw the btrfs drive pool mounted under the /var
partition so that upon installation the btrfs pool contained all files
under /var; /lib/mysql as well as the video storage. Needless to say
this was a catastrophe...


BTRFS is not suitable for random workloads, such as databases for
example. In those cases it's recommended, at the very least, to disable
COW on the database files. Note, disabling CoW doesn't preclude you from
creating snapshots, it just means that not every 4k write (for example)
will result in a CoW operation.



At the other end of the spectrum is a use case where ONLY the video
storage was on the btrfs pool; but in that case, the btrfs pool became
read-only suddenly and would not re-mount as rw despite all the recovery
trick btrfs-tools could throw at it. This, of course prevented the
NVR/VMS from recording any footage.


Again, you give absolutely no information about how you have configured
the filesystem or versions of the software used.



So, again, the question is: is BTRFS mature enough to be used in such
use case and if so, what approach can be used to mitigate such issues.


 And..

  What is the blocksize used by the application to write into btrfs?
  And what type of write IO? async, directio or fsync?

Thanks, Anand




Thank you all for your assistance

-Ryan-


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-16 Thread Austin S. Hemmelgarn

On 2018-11-15 13:39, Juan Alberto Cirez wrote:
Is BTRFS mature enough to be deployed on a production system to underpin 
the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?
For NVR, I'd say no.  BTRFS does pretty horribly with append-only 
workloads, even if they are WORM style.  It also does a really bad job 
with most relational database systems that you would likely use for 
indexing.


If you can suggest your reasoning for wanting to use BTRFS though, I can 
probably point you at alternatives that would work more reliably for 
your use case.




Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-15 Thread Roman Mamedov
On Thu, 15 Nov 2018 11:39:58 -0700
Juan Alberto Cirez  wrote:

> Is BTRFS mature enough to be deployed on a production system to underpin 
> the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?

What are you looking to gain from using Btrfs on an NVR system? It doesn't
sound like any of its prime-time features -- such as snapshots, checksumming,
compression or reflink copying -- are a must for a bulk video recording
scenario.

And even if you don't need or use those features, you still pay the price for
having them, Btrfs can be 2x to 10x slower than its simpler competitors:
https://www.phoronix.com/scan.php?page=article=linux418-nvme-raid=2

If you just meant to use its multi-device support instead of a separate RAID
layer, IMO that can't be considered prime-time or depended upon in production
(yes, not even RAID 1 or 10).

-- 
With respect,
Roman


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-15 Thread Nikolay Borisov



On 15.11.18 г. 20:39 ч., Juan Alberto Cirez wrote:
> Is BTRFS mature enough to be deployed on a production system to underpin
> the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?
> 
> Based on our limited experience with BTRFS (1+ year) under the above
> scenario the answer seems to be no; but I wanted you ask the community
> at large for their experience before making a final decision to hold off
> on deploying BTRFS on production systems.
> 
> Let us be clear: We think BTRFS has great potential, and as it matures
> we will continue to watch its progress, so that at some future point we
> can return to using it.
> 
> The issue has been the myriad of problems we have encountered when
> deploying BTRFS as the storage fs for the NVR/VMS in cases were the
> camera count exceeds 10: Corrupted file systems, sudden read-only file
> system, re-balance kernel panics, broken partitions, etc.

What version of the file system, how many of those issues (if any) have
you reported to the upstream mailing list? I don't remember seeing any
reports from you.

> 
> One specific case saw the btrfs drive pool mounted under the /var
> partition so that upon installation the btrfs pool contained all files
> under /var; /lib/mysql as well as the video storage. Needless to say
> this was a catastrophe...

BTRFS is not suitable for random workloads, such as databases for
example. In those cases it's recommended, at the very least, to disable
COW on the database files. Note, disabling CoW doesn't preclude you from
creating snapshots, it just means that not every 4k write (for example)
will result in a CoW operation.

> 
> At the other end of the spectrum is a use case where ONLY the video
> storage was on the btrfs pool; but in that case, the btrfs pool became
> read-only suddenly and would not re-mount as rw despite all the recovery
> trick btrfs-tools could throw at it. This, of course prevented the
> NVR/VMS from recording any footage.

Again, you give absolutely no information about how you have configured
the filesystem or versions of the software used.

> 
> So, again, the question is: is BTRFS mature enough to be used in such
> use case and if so, what approach can be used to mitigate such issues.
> 
> 
> Thank you all for your assistance
> 
> -Ryan-


Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Attila Vangel
Hi Qu,

I created an issue for this feature request, so that it is not lost:
https://github.com/kdave/btrfs-progs/issues/153

Thanks for the help again! Now unsubscribing from this list.

Regards,
Attila

On Tue, Nov 6, 2018 at 9:23 AM Qu Wenruo  wrote:
> On 2018/11/6 下午4:17, Attila Vangel wrote:
[snip]
> > So I propose that "btrs restore" should always print a summary line at
> > the end to make it more user friendly (e.g. for the first time users).
> > Something like this:
> >
> > x files restored in y directories[, total size of files is z  > readable unit e.g. GB>]. Use -v command line option for more
> > information.
> >
> > The part in square bracket is optional / nice to have, might be useful
> > e.g. to estimate the required free space in the target filesystem.
>
> Nice idea. Maybe we could enhance btrfs-restore to that direction.
>
> Thanks,
> Qu
[snip]


Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Qu Wenruo


On 2018/11/6 下午4:17, Attila Vangel wrote:
> Hi Qu,
> 
> Thanks again for the help.
> In my case:
> 
> $ sudo btrfs check /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 18811453440 found E4E3BDB6 wanted 
> checksum verify failed on 18811453440 found E4E3BDB6 wanted 
> bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
> ERROR: cannot open file system

As expected, since extent tree is corrupted, btrfs check can't do much help.

This is definitely something we could enhance, although it will take a
long time for btrfs-progs to handle such case, then some super hard
kernel mount option for kernel to do pure RO mount.

> 
> (same with --mode=lowmem).
> 
> However "btrfs restore" seems to work! (did not have time for the
> actual run yet)
> 
> What confused me about "btrds restore" is when I ran a dry run, it
> just output about a dozen errors about some files, and I thought
> that's it, no files can be restored.
> Today I tried it with other options, especially -v shown me that the
> majority of the files can be restored. I consider myself an
> intermediate GNU plus Linux user, yet given the stressful situation
> and being a first time user (btrfs n00b) I missed experimenting with
> the -v option.
> 
> So I propose that "btrs restore" should always print a summary line at
> the end to make it more user friendly (e.g. for the first time users).
> Something like this:
> 
> x files restored in y directories[, total size of files is z  readable unit e.g. GB>]. Use -v command line option for more
> information.
> 
> The part in square bracket is optional / nice to have, might be useful
> e.g. to estimate the required free space in the target filesystem.

Nice idea. Maybe we could enhance btrfs-restore to that direction.

Thanks,
Qu

> 
> Cheers,
> Attila
> 
> On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/11/6 上午2:01, Attila Vangel wrote:
>>> Hi,
>>>
>>> TL;DR: I want to save data from my unmountable btrfs partition.
>>> I saw some commands in another thread "Salvage files from broken btrfs".
>>> I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
>>> btrfs-progs 4.17.1-1) to execute these commands.
>>>
>>> $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
>>> mount: /mnt: wrong fs type, bad option, bad superblock on
>>> /dev/nvme0n1p2, missing codepage or helper program, or other error.
>>>
>>> Corresponding lines from dmesg:
>>>
>>> [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
>>> time
>>> [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
>>> [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
>>> [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
>>> want 18811453440 have 0
>>> [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: 
>>> -5
>>> [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed
>>
>> Extent tree corrupted.
>>
>> If it's the only problem, btrfs-restore should be able to salvage data.
>>
>>>
>>> $ sudo btrfs-find-root /dev/nvme0n1p2
>>
>> No, that's not what you need.
>>
>>> Superblock thinks the generation is 220524
>>> Superblock thinks the level is 1
>>> Found tree root at 25018368 gen 220524 level 1
>>> Well block 4243456(gen: 220520 level: 1) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>> Well block 5259264(gen: 220519 level: 1) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>> Well block 4866048(gen: 220518 level: 0) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>>
>>> $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
>>> superblock: bytenr=65536, device=/dev/nvme0n1p2
>> [snip]
>>>
>>> If I understood correctly, somehow it is possible to use this data to
>>> parametrize btrfs restore to save the files from the partition.
>>
>> None of the output is really helpful.
>>
>> In your case, your extent tree is corrupted, thus kernel will refuse to
>> mount (even RO).
>>
>> You should run "btrfs check" on the fs to see if btrfs can check fs tree.
>> If not, then go directly to "btrfs restore".
>>
>> Thanks,
>> Qu
>>
>>> Could you please help how to do it in this case? I am not familiar
>>> with these technical terms in the outputs.
>>> Thanks in advance!
>>>
>>> Cheers,
>>> Attila
>>>
>>> On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel  
>>> wrote:

 Hi,

 Somehow my btrfs partition got broken. I use Arch, so my kernel is
 quite new (4.18.x).
 I don't remember exactly the sequence of events. At some point it was
 accessible in read-only, but unfortunately I did not take backup
 immediately. dmesg log from that time:

 [ 62.602388] BTRFS warning (device nvme0n1p2): block group
 103923318784 has wrong amount of free space
 [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
 space cache for block group 103923318784, 

Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Attila Vangel
Hi Qu,

Thanks again for the help.
In my case:

$ sudo btrfs check /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
ERROR: cannot open file system

(same with --mode=lowmem).

However "btrfs restore" seems to work! (did not have time for the
actual run yet)

What confused me about "btrds restore" is when I ran a dry run, it
just output about a dozen errors about some files, and I thought
that's it, no files can be restored.
Today I tried it with other options, especially -v shown me that the
majority of the files can be restored. I consider myself an
intermediate GNU plus Linux user, yet given the stressful situation
and being a first time user (btrfs n00b) I missed experimenting with
the -v option.

So I propose that "btrs restore" should always print a summary line at
the end to make it more user friendly (e.g. for the first time users).
Something like this:

x files restored in y directories[, total size of files is z ]. Use -v command line option for more
information.

The part in square bracket is optional / nice to have, might be useful
e.g. to estimate the required free space in the target filesystem.

Cheers,
Attila

On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo  wrote:
>
>
>
> On 2018/11/6 上午2:01, Attila Vangel wrote:
> > Hi,
> >
> > TL;DR: I want to save data from my unmountable btrfs partition.
> > I saw some commands in another thread "Salvage files from broken btrfs".
> > I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
> > btrfs-progs 4.17.1-1) to execute these commands.
> >
> > $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
> > mount: /mnt: wrong fs type, bad option, bad superblock on
> > /dev/nvme0n1p2, missing codepage or helper program, or other error.
> >
> > Corresponding lines from dmesg:
> >
> > [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
> > time
> > [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
> > [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
> > [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
> > want 18811453440 have 0
> > [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: 
> > -5
> > [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed
>
> Extent tree corrupted.
>
> If it's the only problem, btrfs-restore should be able to salvage data.
>
> >
> > $ sudo btrfs-find-root /dev/nvme0n1p2
>
> No, that's not what you need.
>
> > Superblock thinks the generation is 220524
> > Superblock thinks the level is 1
> > Found tree root at 25018368 gen 220524 level 1
> > Well block 4243456(gen: 220520 level: 1) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> > Well block 5259264(gen: 220519 level: 1) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> > Well block 4866048(gen: 220518 level: 0) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> >
> > $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
> > superblock: bytenr=65536, device=/dev/nvme0n1p2
> [snip]
> >
> > If I understood correctly, somehow it is possible to use this data to
> > parametrize btrfs restore to save the files from the partition.
>
> None of the output is really helpful.
>
> In your case, your extent tree is corrupted, thus kernel will refuse to
> mount (even RO).
>
> You should run "btrfs check" on the fs to see if btrfs can check fs tree.
> If not, then go directly to "btrfs restore".
>
> Thanks,
> Qu
>
> > Could you please help how to do it in this case? I am not familiar
> > with these technical terms in the outputs.
> > Thanks in advance!
> >
> > Cheers,
> > Attila
> >
> > On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel  
> > wrote:
> >>
> >> Hi,
> >>
> >> Somehow my btrfs partition got broken. I use Arch, so my kernel is
> >> quite new (4.18.x).
> >> I don't remember exactly the sequence of events. At some point it was
> >> accessible in read-only, but unfortunately I did not take backup
> >> immediately. dmesg log from that time:
> >>
> >> [ 62.602388] BTRFS warning (device nvme0n1p2): block group
> >> 103923318784 has wrong amount of free space
> >> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
> >> space cache for block group 103923318784, rebuilding it now
> >> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 
> >> 18812026880
> >> [ 108.039227] BTRFS: error (device nvme0n1p2) in
> >> __btrfs_free_extent:7010: errno=-5 IO failure
> >> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly
> >> [ 108.039250] BTRFS: error (device nvme0n1p2) in
> >> btrfs_run_delayed_refs:3076: errno=-5 IO failure
> >>
> >> At the next reboot it failed to mount. Problem may have been that at
> >> some point I booted to another distro with older kernel 

Re: btrfs partition is broken, cannot restore anything

2018-11-05 Thread Qu Wenruo


On 2018/11/6 上午2:01, Attila Vangel wrote:
> Hi,
> 
> TL;DR: I want to save data from my unmountable btrfs partition.
> I saw some commands in another thread "Salvage files from broken btrfs".
> I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
> btrfs-progs 4.17.1-1) to execute these commands.
> 
> $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
> mount: /mnt: wrong fs type, bad option, bad superblock on
> /dev/nvme0n1p2, missing codepage or helper program, or other error.
> 
> Corresponding lines from dmesg:
> 
> [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
> time
> [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
> [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
> [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
> want 18811453440 have 0
> [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5
> [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed

Extent tree corrupted.

If it's the only problem, btrfs-restore should be able to salvage data.

> 
> $ sudo btrfs-find-root /dev/nvme0n1p2

No, that's not what you need.

> Superblock thinks the generation is 220524
> Superblock thinks the level is 1
> Found tree root at 25018368 gen 220524 level 1
> Well block 4243456(gen: 220520 level: 1) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
> Well block 5259264(gen: 220519 level: 1) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
> Well block 4866048(gen: 220518 level: 0) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
> 
> $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
> superblock: bytenr=65536, device=/dev/nvme0n1p2
[snip]
> 
> If I understood correctly, somehow it is possible to use this data to
> parametrize btrfs restore to save the files from the partition.

None of the output is really helpful.

In your case, your extent tree is corrupted, thus kernel will refuse to
mount (even RO).

You should run "btrfs check" on the fs to see if btrfs can check fs tree.
If not, then go directly to "btrfs restore".

Thanks,
Qu

> Could you please help how to do it in this case? I am not familiar
> with these technical terms in the outputs.
> Thanks in advance!
> 
> Cheers,
> Attila
> 
> On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel  wrote:
>>
>> Hi,
>>
>> Somehow my btrfs partition got broken. I use Arch, so my kernel is
>> quite new (4.18.x).
>> I don't remember exactly the sequence of events. At some point it was
>> accessible in read-only, but unfortunately I did not take backup
>> immediately. dmesg log from that time:
>>
>> [ 62.602388] BTRFS warning (device nvme0n1p2): block group
>> 103923318784 has wrong amount of free space
>> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
>> space cache for block group 103923318784, rebuilding it now
>> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 
>> 18812026880
>> [ 108.039227] BTRFS: error (device nvme0n1p2) in
>> __btrfs_free_extent:7010: errno=-5 IO failure
>> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly
>> [ 108.039250] BTRFS: error (device nvme0n1p2) in
>> btrfs_run_delayed_refs:3076: errno=-5 IO failure
>>
>> At the next reboot it failed to mount. Problem may have been that at
>> some point I booted to another distro with older kernel (4.15.x,
>> 4.14.52) and unfortunately attempted some checks/repairs (?) e.g. from
>> gparted, and at that time I did not know it could be destructive.
>>
>> Anyway, currently it fails to mount (even with ro and/or recovery),
>> btrfs check results in "checksum verify failed" and "bad tree block"
>> errors, btrfs restore resulted in "We have looped trying to restore
>> files in" errors for a dozen of paths then exit.
>>
>> Is there some hope to save data from the filesystem, and if so, how?
>>
>> BTW I checked some diagnostics commands regarding my SSD with the nvme
>> client and from that it seems there are no hardware problems.
>>
>> Your help is highly appreciated.
>>
>> Cheers,
>> Attila



signature.asc
Description: OpenPGP digital signature


Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Chris Murphy
On Mon, Nov 5, 2018 at 6:27 AM, Austin S. Hemmelgarn
 wrote:
> On 11/4/2018 11:44 AM, waxhead wrote:
>>
>> Sterling Windmill wrote:
>>>
>>> Out of curiosity, what led to you choosing RAID1 for data but RAID10
>>> for metadata?
>>>
>>> I've flip flipped between these two modes myself after finding out
>>> that BTRFS RAID10 doesn't work how I would've expected.
>>>
>>> Wondering what made you choose your configuration.
>>>
>>> Thanks!
>>> Sure,
>>
>>
>> The "RAID"1 profile for data was chosen to maximize disk space utilization
>> since I got a lot of mixed size devices.
>>
>> The "RAID"10 profile for metadata was chosen simply because it *feels* a
>> bit faster for some of my (previous) workload which was reading a lot of
>> small files (which I guess was embedded in the metadata). While I never
>> remembered that I got any measurable performance increase the system simply
>> felt smoother (which is strange since "RAID"10 should hog more disks at
>> once).
>>
>> I would love to try "RAID"10 for both data and metadata, but I have to
>> delete some files first (or add yet another drive).
>>
>> Would you like to elaborate a bit more yourself about how BTRFS "RAID"10
>> does not work as you expected?
>>
>> As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1
>> replica) is striped over as many disks it can (as long as there is free
>> space).
>>
>> So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe
>> over (20/2) x 2 and if you run out of space on 10 of the devices it will
>> continue to stripe over (5/2) x 2. So your stripe width vary with the
>> available space essentially... I may be terribly wrong about this (until
>> someones corrects me that is...)
>
> He's probably referring to the fact that instead of there being a roughly
> 50% chance of it surviving the failure of at least 2 devices like classical
> RAID10 is technically able to do, it's currently functionally 100% certain
> it won't survive more than one device failing.

Right. Classic RAID10 is *two block device* copies, where you have
mirror1 drives and mirror2 drives, and each mirror pair becomes a
single virtual block device that are then striped across. If you lose
a single mirror1 drive, its mirror2 data is available and
statistically unlikely to also go away.

Whereas with Btrfs raid10, it's *two block group* copies. And it is
the block group that's striped. That means block group copy 1 is
striped across 1/2 the available drives (at the time the bg is
allocated), and block group copy 2 is striped across the other drives.
When a drive dies, there is no single remaining drive that contains
all the missing copies, they're distributed. Which means you've got a
very good chance in a 2 drive failure of losing two copies of either
metadata or data or both. While I'm not certain it's 100% not
survivable, the real gotcha is it's possible maybe even likely that
it'll mount and seem to work fine but as soon as it runs into two
missing bg's, it'll face plant.


-- 
Chris Murphy


Re: btrfs partition is broken, cannot restore anything

2018-11-05 Thread Attila Vangel
Hi,

Stupid gmail has put my email (or Qu's reply? ) to spam, so I just saw
the reply after I sent my reply (gmail asked me whether to remove it
from spam).

Anyway here is the requested output. Thanks for the help!

$ sudo btrfs check /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
ERROR: cannot open file system

$ sudo btrfs check --mode=lowmem /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
ERROR: cannot open file system

Regards,
Attila

On Mon, Nov 5, 2018 at 6:01 PM Attila Vangel  wrote:
>
> Hi,
>
> TL;DR: I want to save data from my unmountable btrfs partition.
> I saw some commands in another thread "Salvage files from broken btrfs".
> I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
> btrfs-progs 4.17.1-1) to execute these commands.
>
> $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
> mount: /mnt: wrong fs type, bad option, bad superblock on
> /dev/nvme0n1p2, missing codepage or helper program, or other error.
>
> Corresponding lines from dmesg:
>
> [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
> time
> [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
> [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
> [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
> want 18811453440 have 0
> [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5
> [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed
>
> $ sudo btrfs-find-root /dev/nvme0n1p2
> Superblock thinks the generation is 220524
> Superblock thinks the level is 1
> Found tree root at 25018368 gen 220524 level 1
> Well block 4243456(gen: 220520 level: 1) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
> Well block 5259264(gen: 220519 level: 1) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
> Well block 4866048(gen: 220518 level: 0) seems good, but
> generation/level doesn't match, want gen: 220524 level: 1
>
> $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
> superblock: bytenr=65536, device=/dev/nvme0n1p2
> -
> csum_type0 (crc32c)
> csum_size4
> csum0x7956a931 [match]
> bytenr65536
> flags0x1
> ( WRITTEN )
> magic_BHRfS_M [match]
> fsid014c9d24-339c-482e-8f06-9284e4a7bc40
> labelnewhome
> generation220524
> root25018368
> sys_array_size97
> chunk_root_generation219209
> root_level1
> chunk_root131072
> chunk_root_level1
> log_root86818816
> log_root_transid0
> log_root_level0
> total_bytes355938074624
> bytes_used344504737792
> sectorsize4096
> nodesize16384
> leafsize (deprecated)16384
> stripesize4096
> root_dir6
> num_devices1
> compat_flags0x0
> compat_ro_flags0x0
> incompat_flags0x161
> ( MIXED_BACKREF |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA )
> cache_generation220524
> uuid_tree_generation220524
> dev_item.uuid05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3
> dev_item.fsid014c9d24-339c-482e-8f06-9284e4a7bc40 [match]
> dev_item.type0
> dev_item.total_bytes355938074624
> dev_item.bytes_used355792322560
> dev_item.io_align4096
> dev_item.io_width4096
> dev_item.sector_size4096
> dev_item.devid1
> dev_item.dev_group0
> dev_item.seek_speed0
> dev_item.bandwidth0
> dev_item.generation0
> sys_chunk_array[2048]:
> item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
> length 4194304 owner 2 stripe_len 65536 type SYSTEM
> io_align 4096 io_width 4096 sector_size 4096
> num_stripes 1 sub_stripes 0
> stripe 0 devid 1 offset 0
> dev_uuid 05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3
> backup_roots[4]:
> backup 0:
> backup_tree_root:42598400gen: 220522level: 1
> backup_chunk_root:131072gen: 219209level: 1
> backup_extent_root:26460160gen: 220522level: 2
> backup_fs_root:51347456gen: 220523level: 2
> backup_dev_root:4472832gen: 220520level: 1
> backup_csum_root:26558464gen: 220522level: 2
> backup_total_bytes:355938074624
> backup_bytes_used:344504741888
> backup_num_devices:1
>
> backup 1:
> backup_tree_root:   

Re: btrfs partition is broken, cannot restore anything

2018-11-05 Thread Attila Vangel
Hi,

TL;DR: I want to save data from my unmountable btrfs partition.
I saw some commands in another thread "Salvage files from broken btrfs".
I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
btrfs-progs 4.17.1-1) to execute these commands.

$ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on
/dev/nvme0n1p2, missing codepage or helper program, or other error.

Corresponding lines from dmesg:

[ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount time
[ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
[ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
[ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
want 18811453440 have 0
[ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5
[ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed

$ sudo btrfs-find-root /dev/nvme0n1p2
Superblock thinks the generation is 220524
Superblock thinks the level is 1
Found tree root at 25018368 gen 220524 level 1
Well block 4243456(gen: 220520 level: 1) seems good, but
generation/level doesn't match, want gen: 220524 level: 1
Well block 5259264(gen: 220519 level: 1) seems good, but
generation/level doesn't match, want gen: 220524 level: 1
Well block 4866048(gen: 220518 level: 0) seems good, but
generation/level doesn't match, want gen: 220524 level: 1

$ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
superblock: bytenr=65536, device=/dev/nvme0n1p2
-
csum_type0 (crc32c)
csum_size4
csum0x7956a931 [match]
bytenr65536
flags0x1
( WRITTEN )
magic_BHRfS_M [match]
fsid014c9d24-339c-482e-8f06-9284e4a7bc40
labelnewhome
generation220524
root25018368
sys_array_size97
chunk_root_generation219209
root_level1
chunk_root131072
chunk_root_level1
log_root86818816
log_root_transid0
log_root_level0
total_bytes355938074624
bytes_used344504737792
sectorsize4096
nodesize16384
leafsize (deprecated)16384
stripesize4096
root_dir6
num_devices1
compat_flags0x0
compat_ro_flags0x0
incompat_flags0x161
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
cache_generation220524
uuid_tree_generation220524
dev_item.uuid05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3
dev_item.fsid014c9d24-339c-482e-8f06-9284e4a7bc40 [match]
dev_item.type0
dev_item.total_bytes355938074624
dev_item.bytes_used355792322560
dev_item.io_align4096
dev_item.io_width4096
dev_item.sector_size4096
dev_item.devid1
dev_item.dev_group0
dev_item.seek_speed0
dev_item.bandwidth0
dev_item.generation0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
length 4194304 owner 2 stripe_len 65536 type SYSTEM
io_align 4096 io_width 4096 sector_size 4096
num_stripes 1 sub_stripes 0
stripe 0 devid 1 offset 0
dev_uuid 05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3
backup_roots[4]:
backup 0:
backup_tree_root:42598400gen: 220522level: 1
backup_chunk_root:131072gen: 219209level: 1
backup_extent_root:26460160gen: 220522level: 2
backup_fs_root:51347456gen: 220523level: 2
backup_dev_root:4472832gen: 220520level: 1
backup_csum_root:26558464gen: 220522level: 2
backup_total_bytes:355938074624
backup_bytes_used:344504741888
backup_num_devices:1

backup 1:
backup_tree_root:52363264gen: 220523level: 1
backup_chunk_root:131072gen: 219209level: 1
backup_extent_root:51806208gen: 220523level: 2
backup_fs_root:51347456gen: 220523level: 2
backup_dev_root:4472832gen: 220520level: 1
backup_csum_root:52461568gen: 220523level: 2
backup_total_bytes:355938074624
backup_bytes_used:344504729600
backup_num_devices:1

backup 2:
backup_tree_root:25018368gen: 220524level: 1
backup_chunk_root:131072gen: 219209level: 1
backup_extent_root:21479424gen: 220524level: 2
backup_fs_root:53084160gen: 220524level: 2
backup_dev_root:4472832gen: 220520level: 1
backup_csum_root:53379072gen: 220524level: 2
backup_total_bytes:355938074624
backup_bytes_used:344504737792
backup_num_devices:1

backup 3:
backup_tree_root:21921792gen: 220521level: 1
backup_chunk_root:

Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Austin S. Hemmelgarn

On 11/4/2018 11:44 AM, waxhead wrote:

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)
He's probably referring to the fact that instead of there being a 
roughly 50% chance of it surviving the failure of at least 2 devices 
like classical RAID10 is technically able to do, it's currently 
functionally 100% certain it won't survive more than one device failing.




Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)




Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread Sterling Windmill
Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!

On Fri, Nov 2, 2018 at 3:55 PM waxhead  wrote:
>
> Hi,
>
> my main computer runs on a 7x SSD BTRFS as rootfs with
> data:RAID1 and metadata:RAID10.
>
> One SSD is probably about to fail, and it seems that BTRFS fixed it
> nicely (thanks everyone!)
>
> I decided to just post the ugly details in case someone just wants to
> have a look. Note that I tend to interpret the btrfs de st / output as
> if the error was NOT fixed even if (seems clearly that) it was, so I
> think the output is a bit misleading... just saying...
>
>
>
> -- below are the details for those curious (just for fun) ---
>
> scrub status for [YOINK!]
>  scrub started at Fri Nov  2 17:49:45 2018 and finished after
> 00:29:26
>  total bytes scrubbed: 1.15TiB with 1 errors
>  error details: csum=1
>  corrected errors: 1, uncorrectable errors: 0, unverified errors: 0
>
>   btrfs fi us -T /
> Overall:
>  Device size:   1.18TiB
>  Device allocated:  1.17TiB
>  Device unallocated:9.69GiB
>  Device missing:  0.00B
>  Used:  1.17TiB
>  Free (estimated):  6.30GiB  (min: 6.30GiB)
>  Data ratio:   2.00
>  Metadata ratio:   2.00
>  Global reserve:  512.00MiB  (used: 0.00B)
>
>   Data  Metadata  System
> Id Path  RAID1 RAID10RAID10Unallocated
> -- - - - - ---
>   6 /dev/sda1 236.28GiB 704.00MiB  32.00MiB   485.00MiB
>   7 /dev/sdb1 233.72GiB   1.03GiB  32.00MiB 2.69GiB
>   2 /dev/sdc1 110.56GiB 352.00MiB -   904.00MiB
>   8 /dev/sdd1 234.96GiB   1.03GiB  32.00MiB 1.45GiB
>   1 /dev/sde1 164.90GiB   1.03GiB  32.00MiB 1.72GiB
>   9 /dev/sdf1 109.00GiB   1.03GiB  32.00MiB   744.00MiB
> 10 /dev/sdg1 107.98GiB   1.03GiB  32.00MiB 1.74GiB
> -- - - - - ---
> Total 598.70GiB   3.09GiB  96.00MiB 9.69GiB
> Used  597.25GiB   1.57GiB 128.00KiB
>
>
>
> uname -a
> Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64
> GNU/Linux
>
> btrfs --version
> btrfs-progs v4.17
>
>
> dmesg | grep -i btrfs
> [7.801817] Btrfs loaded, crc32c=crc32c-generic
> [8.163288] BTRFS: device label btrfsroot devid 10 transid 669961
> /dev/sdg1
> [8.163433] BTRFS: device label btrfsroot devid 9 transid 669961
> /dev/sdf1
> [8.163591] BTRFS: device label btrfsroot devid 1 transid 669961
> /dev/sde1
> [8.163734] BTRFS: device label btrfsroot devid 8 transid 669961
> /dev/sdd1
> [8.163974] BTRFS: device label btrfsroot devid 2 transid 669961
> /dev/sdc1
> [8.164117] BTRFS: device label btrfsroot devid 7 transid 669961
> /dev/sdb1
> [8.164262] BTRFS: device label btrfsroot devid 6 transid 669961
> /dev/sda1
> [8.206174] BTRFS info (device sde1): disk space caching is enabled
> [8.206236] BTRFS info (device sde1): has skinny extents
> [8.348610] BTRFS info (device sde1): enabling ssd optimizations
> [8.854412] BTRFS info (device sde1): enabling free space tree
> [8.854471] BTRFS info (device sde1): using free space tree
> [   68.170580] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.185973] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.185991] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186003] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186015] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186028] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186041] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186052] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186063] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.186075] BTRFS warning (device sde1): csum failed root 3760 ino
> 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2
> [   68.199237] 

Re: BTRFS did it's job nicely (thanks!)

2018-11-04 Thread waxhead

Duncan wrote:

waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:


Note that I tend to interpret the btrfs de st / output as if the error
was NOT fixed even if (seems clearly that) it was, so I think the output
is a bit misleading... just saying...


See the btrfs-device manpage, stats subcommand, -z|--reset option, and
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you
specifically reset it, *NOT* when the error is fixed.

Yes, I am perfectly aware of all that. The issue I have is that the 
manpage describes corruption errors as "A block checksum mismatched or 
corrupted metadata header was found". This does not tell me if this was 
a permanent corruption or if it was fixed. That is why I think the 
output is a bit misleadning (and I should have said that more clearly).


My point being that btrfs device stats /mnt would have been a lot easier 
to read and understand if it distinguished between permanent corruption 
e.g. unfixable errors vs fixed errors.



(There's actually a recent patch, I believe in the current dev kernel
4.20/5.0, that will reset a device's stats automatically for the btrfs
replace case when it's actually a different device afterward anyway.
Apparently, it doesn't even do /that/ automatically yet.  Keep that in
mind if you replace that device.)

Oh thanks for the heads up, I was under the impression that the device 
stats was tracked by btrfs devid, but apparently it is (was) not. Good 
to know!


Re: BTRFS did it's job nicely (thanks!)

2018-11-03 Thread Duncan
waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:

> Note that I tend to interpret the btrfs de st / output as if the error
> was NOT fixed even if (seems clearly that) it was, so I think the output
> is a bit misleading... just saying...

See the btrfs-device manpage, stats subcommand, -z|--reset option, and 
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related 
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you 
specifically reset it, *NOT* when the error is fixed.

(There's actually a recent patch, I believe in the current dev kernel 
4.20/5.0, that will reset a device's stats automatically for the btrfs 
replace case when it's actually a different device afterward anyway.  
Apparently, it doesn't even do /that/ automatically yet.  Keep that in 
mind if you replace that device.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: btrfs partition is broken, cannot restore anything

2018-11-01 Thread Qu Wenruo


On 2018/11/2 上午4:40, Attila Vangel wrote:
> Hi,
> 
> Somehow my btrfs partition got broken. I use Arch, so my kernel is
> quite new (4.18.x).
> I don't remember exactly the sequence of events. At some point it was
> accessible in read-only, but unfortunately I did not take backup
> immediately. dmesg log from that time:
> 
> [ 62.602388] BTRFS warning (device nvme0n1p2): block group
> 103923318784 has wrong amount of free space
> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
> space cache for block group 103923318784, rebuilding it now
> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 
> 18812026880
> [ 108.039227] BTRFS: error (device nvme0n1p2) in
> __btrfs_free_extent:7010: errno=-5 IO failure

Normally we shouldn't call __btrfs_free_extent() during mount.
It looks like it's caused by log tree replay.

You could try to mount the fs with -o ro,nologreplay to see if you could
still mount the fs RO without log replay.

If it goes well, at least you could access the files.

> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly
> [ 108.039250] BTRFS: error (device nvme0n1p2) in
> btrfs_run_delayed_refs:3076: errno=-5 IO failure
> 
> At the next reboot it failed to mount. Problem may have been that at
> some point I booted to another distro with older kernel (4.15.x,
> 4.14.52) and unfortunately attempted some checks/repairs (?) e.g. from
> gparted, and at that time I did not know it could be destructive.>
> Anyway, currently it fails to mount (even with ro and/or recovery),
> btrfs check results in "checksum verify failed" and "bad tree block"
> errors,

Full btrfs check result please.

> btrfs restore resulted in "We have looped trying to restore
> files in" errors for a dozen of paths then exit.

So at least btrfs-restore is working.

This normally means the fs is not completely corrupted, and essential
trees are at least OK.

Still need that "btrfs check" output (along with "btrfs check
--mode=lowmem") to determine what to do.

Thanks,
Qu
> 
> Is there some hope to save data from the filesystem, and if so, how?
> 
> BTW I checked some diagnostics commands regarding my SSD with the nvme
> client and from that it seems there are no hardware problems.
> 
> Your help is highly appreciated.
> 
> Cheers,
> Attila
> 



signature.asc
Description: OpenPGP digital signature


Re: btrfs-qgroup-rescan using 100% CPU

2018-10-27 Thread Qu Wenruo


On 2018/10/28 上午6:58, Dave wrote:
> I'm using btrfs and snapper on a system with an SSD. On this system
> when I run `snapper -c root ls` (where `root` is the snapper config
> for /), the process takes a very long time and top shows the following
> process using 100% of the CPU:
> 
> kworker/u8:6+btrfs-qgroup-rescan

Not sure about what snapper is doing, but it looks like snapper needs to
use btrfs qgroup.

And then enable btrfs qgroup will do a initial qgroup scan.

If you have a lot of snapshots or a lot of files, it will take a long
time to do the initial rescan.

That's the designed behavior.

> 
> I have multiple computers (also with SSD's) set up the same way with
> snapper and btrfs. On the other computers, `snapper -c root ls`
> completes almost instantly, even on systems with many more snapshots.

The size of each subvolume also counts.

The time consumed by qgroup rescan depends on the number of references.
Snapshots, reflinks all contribute to that number.

Also, large files contribute less than small files, as one large file
(128M) could only contain one reference, while 128 small files (1M)
contains 128 references.

Snapshot is one of the heaviest workload in such case.
If one extent is shared 4 times between 4 snapshots, then it will cost 4
times CPU resource to do the rescan.

> This system has 20 total snapshots on `/`.

That already sounds a lot, especially considering the size of the fs.

> 
> System info:
> 
> 4.18.16-arch1-1-ARCH (Arch Linux)
> btrfs-progs v4.17.1
> scrub started at Sat Oct 27 18:37:21 2018 and finished after 00:04:02
> total bytes scrubbed: 75.97GiB with 0 errors
> 
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/mapper/cryptdv  116G   77G   38G  67% /
> 
> Data, single: total=72.01GiB, used=71.38GiB
> System, DUP: total=32.00MiB, used=16.00KiB
> Metadata, DUP: total=3.50GiB, used=2.22GiB

From a quick glance, it indeed needs some time do to the rescan.


Also, to make sure it's not some deadlock, you could check the rescan
progress by the following command:

# btrfs ins dump-tree -t quota /dev/mapper/cryptdv

And look for the following item:

item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32
version 1 generation 7 flags ON|RESCAN scan 1024

That scan number should change with rescan progress.
(That number is only updated after each transaction, so you need to wait
some time to see that change).

Thanks,
Qu

> 
> What other info would be helpful? What troubleshooting steps should I try?
> 



signature.asc
Description: OpenPGP digital signature


Re: Btrfs resize seems to deadlock

2018-10-22 Thread Liu Bo
On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
>
> On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> >
> > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> > wrote:
> > >
> > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > forcing the system to be reset. I am not sure what the state of the
> > > filesystem really is at this point. Btrfs usage does report the
> > > correct size for after resizing. Details below:
> > >
> >
> > Thanks for the report, the stack is helpful, but this needs a few
> > deeper debugging, may I ask you to post "btrfs inspect-internal
> > dump-tree -t 1 /dev/your_btrfs_disk"?
>
> I believe it's actually easy to understand from the trace alone and
> it's kind of a bad luck scenario.
> I made this fix a few hours ago:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
>
> But haven't done full testing yet and might have missed something.
> Bo, can you take a look and let me know what you think?
>

The patch is OK to me, should fix the problem.

Since load_free_space_cache() is touching root tree with
path->commit_root and skip_locking, I was wondering if we could just
provide a same way for free space cache inode's iget().

thanks,
liubo

> Thanks.
>
> >
> > So I'd like to know what's the height of your tree "1" which refers to
> > root tree in btrfs.
> >
> > thanks,
> > liubo
> >
> > > $ sudo btrfs filesystem usage $MOUNT
> > > Overall:
> > > Device size:  90.96TiB
> > > Device allocated: 72.62TiB
> > > Device unallocated:   18.33TiB
> > > Device missing:  0.00B
> > > Used: 72.62TiB
> > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > Data ratio:   1.00
> > > Metadata ratio:   2.00
> > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > >
> > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > $MOUNT72.46TiB
> > >
> > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > $MOUNT   172.00GiB
> > >
> > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > >$MOUNT80.00MiB
> > >
> > > Unallocated:
> > > $MOUNT18.33TiB
> > >
> > > $ uname -a
> > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > btrfs-transacti D0  2501  2 0x8000
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  transaction_kthread+0x155/0x170 [btrfs]
> > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > >  kthread+0x112/0x130
> > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > >  ret_from_fork+0x35/0x40
> > > btrfs   D0  2504   2502 0x0002
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > >  ? btrfs_release_path+0x13/0x80 [btrfs]
> > >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> > >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> > >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> > >  ? tty_write+0x1fc/0x330
> > >  ? do_vfs_ioctl+0xa4/0x620
> > >  do_vfs_ioctl+0xa4/0x620
> > >  ksys_ioctl+0x60/0x90
> > >  ? ksys_write+0x4f/0xb0
> > >  __x64_sys_ioctl+0x16/0x20
> > >  do_syscall_64+0x5b/0x160
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7fcdc0d78c57
> > > Code: Bad RIP value.
> > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:06 AM Andrew Nelson
 wrote:
>
> OK, an update: After unmouting and running btrfs check, the drive
> reverted to reporting the old size. Not sure if this was due to
> unmounting / mounting or doing btrfs check. Btrfs check should have
> been running in readonly mode.

It reverted to the old size because the transaction used for the
resize operation never got committed due to the deadlock, not because
of 'btrfs check'.

>  Since it looked like something was
> wrong with the resize process, I patched my kernel with the posted
> patch. This time the resize operation finished successfully.

Great, thanks for testing!

> On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana  wrote:
> >
> > On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  
> > wrote:
> > >
> > > Also, is the drive in a safe state to use? Is there anything I should
> > > run on the drive to check consistency?
> >
> > It should be in a safe state. You can verify it running "btrfs check
> > /dev/" (it's a readonly operation).
> >
> > If you are able to patch and build a kernel, you can also try the
> > patch. I left it running tests overnight and haven't got any
> > regressions.
> >
> > Thanks.
> >
> > > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
> > >  wrote:
> > > >
> > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > > > the output is ~55mb. Is there something in particular you are looking
> > > > for in this?
> > > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  
> > > > wrote:
> > > > >
> > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > > > >
> > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > > > >  wrote:
> > > > > > >
> > > > > > > I am having an issue with btrfs resize in Fedora 28. I am 
> > > > > > > attempting
> > > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > > > resize max $MOUNT", the command runs for a few minutes and then 
> > > > > > > hangs
> > > > > > > forcing the system to be reset. I am not sure what the state of 
> > > > > > > the
> > > > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > > > correct size for after resizing. Details below:
> > > > > > >
> > > > > >
> > > > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > > > >
> > > > > I believe it's actually easy to understand from the trace alone and
> > > > > it's kind of a bad luck scenario.
> > > > > I made this fix a few hours ago:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > > > >
> > > > > But haven't done full testing yet and might have missed something.
> > > > > Bo, can you take a look and let me know what you think?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > So I'd like to know what's the height of your tree "1" which refers 
> > > > > > to
> > > > > > root tree in btrfs.
> > > > > >
> > > > > > thanks,
> > > > > > liubo
> > > > > >
> > > > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > > > Overall:
> > > > > > > Device size:  90.96TiB
> > > > > > > Device allocated: 72.62TiB
> > > > > > > Device unallocated:   18.33TiB
> > > > > > > Device missing:  0.00B
> > > > > > > Used: 72.62TiB
> > > > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > > > Data ratio:   1.00
> > > > > > > Metadata ratio:   2.00
> > > > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > > > >
> > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > > > $MOUNT72.46TiB
> > > > > > >
> > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > > > $MOUNT   172.00GiB
> > > > > > >
> > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > > > >$MOUNT80.00MiB
> > > > > > >
> > > > > > > Unallocated:
> > > > > > > $MOUNT18.33TiB
> > > > > > >
> > > > > > > $ uname -a
> > > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon 
> > > > > > > Oct 15
> > > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > >
> > > > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > > > Call Trace:
> > > > > > >  ? __schedule+0x253/0x860
> > > > > > >  schedule+0x28/0x80
> > > > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > > > >  ? finish_wait+0x80/0x80
> > > > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > > > >  kthread+0x112/0x130
> > > > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > >  ret_from_fork+0x35/0x40
> > > > > > 

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Andrew Nelson
OK, an update: After unmouting and running btrfs check, the drive
reverted to reporting the old size. Not sure if this was due to
unmounting / mounting or doing btrfs check. Btrfs check should have
been running in readonly mode. Since it looked like something was
wrong with the resize process, I patched my kernel with the posted
patch. This time the resize operation finished successfully.
On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana  wrote:
>
> On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  
> wrote:
> >
> > Also, is the drive in a safe state to use? Is there anything I should
> > run on the drive to check consistency?
>
> It should be in a safe state. You can verify it running "btrfs check
> /dev/" (it's a readonly operation).
>
> If you are able to patch and build a kernel, you can also try the
> patch. I left it running tests overnight and haven't got any
> regressions.
>
> Thanks.
>
> > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
> >  wrote:
> > >
> > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > > the output is ~55mb. Is there something in particular you are looking
> > > for in this?
> > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
> > > >
> > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > > >
> > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > > >  wrote:
> > > > > >
> > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > > resize max $MOUNT", the command runs for a few minutes and then 
> > > > > > hangs
> > > > > > forcing the system to be reset. I am not sure what the state of the
> > > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > > correct size for after resizing. Details below:
> > > > > >
> > > > >
> > > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > > >
> > > > I believe it's actually easy to understand from the trace alone and
> > > > it's kind of a bad luck scenario.
> > > > I made this fix a few hours ago:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > > >
> > > > But haven't done full testing yet and might have missed something.
> > > > Bo, can you take a look and let me know what you think?
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > So I'd like to know what's the height of your tree "1" which refers to
> > > > > root tree in btrfs.
> > > > >
> > > > > thanks,
> > > > > liubo
> > > > >
> > > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > > Overall:
> > > > > > Device size:  90.96TiB
> > > > > > Device allocated: 72.62TiB
> > > > > > Device unallocated:   18.33TiB
> > > > > > Device missing:  0.00B
> > > > > > Used: 72.62TiB
> > > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > > Data ratio:   1.00
> > > > > > Metadata ratio:   2.00
> > > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > > >
> > > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > > $MOUNT72.46TiB
> > > > > >
> > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > > $MOUNT   172.00GiB
> > > > > >
> > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > > >$MOUNT80.00MiB
> > > > > >
> > > > > > Unallocated:
> > > > > > $MOUNT18.33TiB
> > > > > >
> > > > > > $ uname -a
> > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 
> > > > > > 15
> > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > > >
> > > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > > Call Trace:
> > > > > >  ? __schedule+0x253/0x860
> > > > > >  schedule+0x28/0x80
> > > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > > >  ? finish_wait+0x80/0x80
> > > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > > >  kthread+0x112/0x130
> > > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > >  ret_from_fork+0x35/0x40
> > > > > > btrfs   D0  2504   2502 0x0002
> > > > > > Call Trace:
> > > > > >  ? __schedule+0x253/0x860
> > > > > >  schedule+0x28/0x80
> > > > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > > > >  ? finish_wait+0x80/0x80
> > > > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > > > >  ? inode_insert5+0x119/0x190
> > > > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > > > >  ? 

Re: Btrfs resize seems to deadlock

2018-10-21 Thread Filipe Manana
On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  wrote:
>
> Also, is the drive in a safe state to use? Is there anything I should
> run on the drive to check consistency?

It should be in a safe state. You can verify it running "btrfs check
/dev/" (it's a readonly operation).

If you are able to patch and build a kernel, you can also try the
patch. I left it running tests overnight and haven't got any
regressions.

Thanks.

> On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
>  wrote:
> >
> > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > the output is ~55mb. Is there something in particular you are looking
> > for in this?
> > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
> > >
> > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > >
> > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > >  wrote:
> > > > >
> > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > > > forcing the system to be reset. I am not sure what the state of the
> > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > correct size for after resizing. Details below:
> > > > >
> > > >
> > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > >
> > > I believe it's actually easy to understand from the trace alone and
> > > it's kind of a bad luck scenario.
> > > I made this fix a few hours ago:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > >
> > > But haven't done full testing yet and might have missed something.
> > > Bo, can you take a look and let me know what you think?
> > >
> > > Thanks.
> > >
> > > >
> > > > So I'd like to know what's the height of your tree "1" which refers to
> > > > root tree in btrfs.
> > > >
> > > > thanks,
> > > > liubo
> > > >
> > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > Overall:
> > > > > Device size:  90.96TiB
> > > > > Device allocated: 72.62TiB
> > > > > Device unallocated:   18.33TiB
> > > > > Device missing:  0.00B
> > > > > Used: 72.62TiB
> > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > Data ratio:   1.00
> > > > > Metadata ratio:   2.00
> > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > >
> > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > $MOUNT72.46TiB
> > > > >
> > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > $MOUNT   172.00GiB
> > > > >
> > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > >$MOUNT80.00MiB
> > > > >
> > > > > Unallocated:
> > > > > $MOUNT18.33TiB
> > > > >
> > > > > $ uname -a
> > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > >
> > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > Call Trace:
> > > > >  ? __schedule+0x253/0x860
> > > > >  schedule+0x28/0x80
> > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > >  kthread+0x112/0x130
> > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > >  ret_from_fork+0x35/0x40
> > > > > btrfs   D0  2504   2502 0x0002
> > > > > Call Trace:
> > > > >  ? __schedule+0x253/0x860
> > > > >  schedule+0x28/0x80
> > > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > > >  ? inode_insert5+0x119/0x190
> > > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > >  btrfs_iget+0x113/0x690 [btrfs]
> > > > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > > > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > > > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > > > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > > > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  find_free_extent+0x799/0x1010 [btrfs]
> > > > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > > > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > > > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > > > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > > > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > > > >  btrfs_lookup_inode+0x3a/0xc0 

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Andrew Nelson
Also, is the drive in a safe state to use? Is there anything I should
run on the drive to check consistency?
On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
 wrote:
>
> I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> the output is ~55mb. Is there something in particular you are looking
> for in this?
> On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
> >
> > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > >
> > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> > > wrote:
> > > >
> > > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > > forcing the system to be reset. I am not sure what the state of the
> > > > filesystem really is at this point. Btrfs usage does report the
> > > > correct size for after resizing. Details below:
> > > >
> > >
> > > Thanks for the report, the stack is helpful, but this needs a few
> > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > dump-tree -t 1 /dev/your_btrfs_disk"?
> >
> > I believe it's actually easy to understand from the trace alone and
> > it's kind of a bad luck scenario.
> > I made this fix a few hours ago:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> >
> > But haven't done full testing yet and might have missed something.
> > Bo, can you take a look and let me know what you think?
> >
> > Thanks.
> >
> > >
> > > So I'd like to know what's the height of your tree "1" which refers to
> > > root tree in btrfs.
> > >
> > > thanks,
> > > liubo
> > >
> > > > $ sudo btrfs filesystem usage $MOUNT
> > > > Overall:
> > > > Device size:  90.96TiB
> > > > Device allocated: 72.62TiB
> > > > Device unallocated:   18.33TiB
> > > > Device missing:  0.00B
> > > > Used: 72.62TiB
> > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > Data ratio:   1.00
> > > > Metadata ratio:   2.00
> > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > >
> > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > $MOUNT72.46TiB
> > > >
> > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > $MOUNT   172.00GiB
> > > >
> > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > >$MOUNT80.00MiB
> > > >
> > > > Unallocated:
> > > > $MOUNT18.33TiB
> > > >
> > > > $ uname -a
> > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > >
> > > > btrfs-transacti D0  2501  2 0x8000
> > > > Call Trace:
> > > >  ? __schedule+0x253/0x860
> > > >  schedule+0x28/0x80
> > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > >  kthread+0x112/0x130
> > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > >  ret_from_fork+0x35/0x40
> > > > btrfs   D0  2504   2502 0x0002
> > > > Call Trace:
> > > >  ? __schedule+0x253/0x860
> > > >  schedule+0x28/0x80
> > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > >  ? inode_insert5+0x119/0x190
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_iget+0x113/0x690 [btrfs]
> > > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  find_free_extent+0x799/0x1010 [btrfs]
> > > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > > >  ? btrfs_release_path+0x13/0x80 [btrfs]
> > > >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> > > >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> > > >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> > > >  ? tty_write+0x1fc/0x330
> > > >  ? do_vfs_ioctl+0xa4/0x620
> > > >  

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Andrew Nelson
I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
the output is ~55mb. Is there something in particular you are looking
for in this?
On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
>
> On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> >
> > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> > wrote:
> > >
> > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > forcing the system to be reset. I am not sure what the state of the
> > > filesystem really is at this point. Btrfs usage does report the
> > > correct size for after resizing. Details below:
> > >
> >
> > Thanks for the report, the stack is helpful, but this needs a few
> > deeper debugging, may I ask you to post "btrfs inspect-internal
> > dump-tree -t 1 /dev/your_btrfs_disk"?
>
> I believe it's actually easy to understand from the trace alone and
> it's kind of a bad luck scenario.
> I made this fix a few hours ago:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
>
> But haven't done full testing yet and might have missed something.
> Bo, can you take a look and let me know what you think?
>
> Thanks.
>
> >
> > So I'd like to know what's the height of your tree "1" which refers to
> > root tree in btrfs.
> >
> > thanks,
> > liubo
> >
> > > $ sudo btrfs filesystem usage $MOUNT
> > > Overall:
> > > Device size:  90.96TiB
> > > Device allocated: 72.62TiB
> > > Device unallocated:   18.33TiB
> > > Device missing:  0.00B
> > > Used: 72.62TiB
> > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > Data ratio:   1.00
> > > Metadata ratio:   2.00
> > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > >
> > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > $MOUNT72.46TiB
> > >
> > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > $MOUNT   172.00GiB
> > >
> > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > >$MOUNT80.00MiB
> > >
> > > Unallocated:
> > > $MOUNT18.33TiB
> > >
> > > $ uname -a
> > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > btrfs-transacti D0  2501  2 0x8000
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  transaction_kthread+0x155/0x170 [btrfs]
> > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > >  kthread+0x112/0x130
> > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > >  ret_from_fork+0x35/0x40
> > > btrfs   D0  2504   2502 0x0002
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > >  ? btrfs_release_path+0x13/0x80 [btrfs]
> > >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> > >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> > >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> > >  ? tty_write+0x1fc/0x330
> > >  ? do_vfs_ioctl+0xa4/0x620
> > >  do_vfs_ioctl+0xa4/0x620
> > >  ksys_ioctl+0x60/0x90
> > >  ? ksys_write+0x4f/0xb0
> > >  __x64_sys_ioctl+0x16/0x20
> > >  do_syscall_64+0x5b/0x160
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7fcdc0d78c57
> > > Code: Bad RIP value.
> > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010
> > > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57
> > > RDX: 

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Filipe Manana
On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
>
> On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> wrote:
> >
> > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > resize max $MOUNT", the command runs for a few minutes and then hangs
> > forcing the system to be reset. I am not sure what the state of the
> > filesystem really is at this point. Btrfs usage does report the
> > correct size for after resizing. Details below:
> >
>
> Thanks for the report, the stack is helpful, but this needs a few
> deeper debugging, may I ask you to post "btrfs inspect-internal
> dump-tree -t 1 /dev/your_btrfs_disk"?

I believe it's actually easy to understand from the trace alone and
it's kind of a bad luck scenario.
I made this fix a few hours ago:

https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock

But haven't done full testing yet and might have missed something.
Bo, can you take a look and let me know what you think?

Thanks.

>
> So I'd like to know what's the height of your tree "1" which refers to
> root tree in btrfs.
>
> thanks,
> liubo
>
> > $ sudo btrfs filesystem usage $MOUNT
> > Overall:
> > Device size:  90.96TiB
> > Device allocated: 72.62TiB
> > Device unallocated:   18.33TiB
> > Device missing:  0.00B
> > Used: 72.62TiB
> > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > Data ratio:   1.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 24.11MiB)
> >
> > Data,single: Size:72.46TiB, Used:72.45TiB
> > $MOUNT72.46TiB
> >
> > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > $MOUNT   172.00GiB
> >
> > System,DUP: Size:40.00MiB, Used:7.53MiB
> >$MOUNT80.00MiB
> >
> > Unallocated:
> > $MOUNT18.33TiB
> >
> > $ uname -a
> > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > btrfs-transacti D0  2501  2 0x8000
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  schedule+0x28/0x80
> >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  ? join_transaction+0x22/0x3e0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  transaction_kthread+0x155/0x170 [btrfs]
> >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> >  kthread+0x112/0x130
> >  ? kthread_create_worker_on_cpu+0x70/0x70
> >  ret_from_fork+0x35/0x40
> > btrfs   D0  2504   2502 0x0002
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >  ? btrfs_release_path+0x13/0x80 [btrfs]
> >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> >  ? tty_write+0x1fc/0x330
> >  ? do_vfs_ioctl+0xa4/0x620
> >  do_vfs_ioctl+0xa4/0x620
> >  ksys_ioctl+0x60/0x90
> >  ? ksys_write+0x4f/0xb0
> >  __x64_sys_ioctl+0x16/0x20
> >  do_syscall_64+0x5b/0x160
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7fcdc0d78c57
> > Code: Bad RIP value.
> > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57
> > RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003
> > RBP: 7ffdd1ee6da0 R08:  R09: 7ffdd1ee67e0
> > R10:  R11: 0246 R12: 7ffdd1ee888e
> > R13: 0003 R14:  R15: 
> > kworker/u48:1   D0  2505  2 0x8000
> > Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs]
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Liu Bo
On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  wrote:
>
> I am having an issue with btrfs resize in Fedora 28. I am attempting
> to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> resize max $MOUNT", the command runs for a few minutes and then hangs
> forcing the system to be reset. I am not sure what the state of the
> filesystem really is at this point. Btrfs usage does report the
> correct size for after resizing. Details below:
>

Thanks for the report, the stack is helpful, but this needs a few
deeper debugging, may I ask you to post "btrfs inspect-internal
dump-tree -t 1 /dev/your_btrfs_disk"?

So I'd like to know what's the height of your tree "1" which refers to
root tree in btrfs.

thanks,
liubo

> $ sudo btrfs filesystem usage $MOUNT
> Overall:
> Device size:  90.96TiB
> Device allocated: 72.62TiB
> Device unallocated:   18.33TiB
> Device missing:  0.00B
> Used: 72.62TiB
> Free (estimated): 18.34TiB  (min: 9.17TiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 24.11MiB)
>
> Data,single: Size:72.46TiB, Used:72.45TiB
> $MOUNT72.46TiB
>
> Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> $MOUNT   172.00GiB
>
> System,DUP: Size:40.00MiB, Used:7.53MiB
>$MOUNT80.00MiB
>
> Unallocated:
> $MOUNT18.33TiB
>
> $ uname -a
> Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> btrfs-transacti D0  2501  2 0x8000
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  ? join_transaction+0x22/0x3e0 [btrfs]
>  ? finish_wait+0x80/0x80
>  transaction_kthread+0x155/0x170 [btrfs]
>  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
>  kthread+0x112/0x130
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40
> btrfs   D0  2504   2502 0x0002
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>  ? btrfs_release_path+0x13/0x80 [btrfs]
>  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
>  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
>  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
>  ? tty_write+0x1fc/0x330
>  ? do_vfs_ioctl+0xa4/0x620
>  do_vfs_ioctl+0xa4/0x620
>  ksys_ioctl+0x60/0x90
>  ? ksys_write+0x4f/0xb0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x5b/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fcdc0d78c57
> Code: Bad RIP value.
> RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57
> RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003
> RBP: 7ffdd1ee6da0 R08:  R09: 7ffdd1ee67e0
> R10:  R11: 0246 R12: 7ffdd1ee888e
> R13: 0003 R14:  R15: 
> kworker/u48:1   D0  2505  2 0x8000
> Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs]
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_tree_lock+0x130/0x210 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_search_slot+0x775/0x9f0 [btrfs]
>  btrfs_mark_extent_written+0xb0/0xb20 [btrfs]
>  ? join_transaction+0x22/0x3e0 [btrfs]
>  btrfs_finish_ordered_io+0x2e0/0x7e0 [btrfs]
>  ? syscall_return_via_sysret+0x13/0x83
>  ? __switch_to_asm+0x34/0x70
>  normal_work_helper+0xd3/0x2f0 [btrfs]
>  process_one_work+0x1a1/0x350
>  worker_thread+0x30/0x380
>  ? pwq_unbound_release_workfn+0xd0/0xd0
>  kthread+0x112/0x130
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40


Re: btrfs send receive: No space left on device

2018-10-17 Thread Henk Slager
On Wed, Oct 17, 2018 at 10:29 AM Libor Klepáč  wrote:
>
> Hello,
> i have new 32GB SSD in my intel nuc, installed debian9 on it, using btrfs as 
> a rootfs.
> Then i created subvolumes /system and /home and moved system there.
>
> System was installed using kernel 4.9.x and filesystem created using 
> btrfs-progs 4.7.x
> Details follow:
> main filesystem
>
> # btrfs filesystem usage /mnt/btrfs/ssd/
> Overall:
> Device size:  29.08GiB
> Device allocated:  4.28GiB
> Device unallocated:   24.80GiB
> Device missing:  0.00B
> Used:  2.54GiB
> Free (estimated): 26.32GiB  (min: 26.32GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:4.00GiB, Used:2.48GiB
>/dev/sda3   4.00GiB
>
> Metadata,single: Size:256.00MiB, Used:61.05MiB
>/dev/sda3 256.00MiB
>
> System,single: Size:32.00MiB, Used:16.00KiB
>/dev/sda3  32.00MiB
>
> Unallocated:
>/dev/sda3  24.80GiB
>
> #/etc/fstab
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /mnt/btrfs/ssd  btrfs 
> noatime,space_cache=v2,compress=lzo,commit=300,subvolid=5 0   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/system 0 
>   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /home   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/home 0   > 0
>
> -
> Then i installed kernel from backports:
> 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
> and btrfs-progs 4.17
>
> For backups , i have created 16GB iscsi device on my qnap and mounted it, 
> created filesystem, mounted like this:
> LABEL=backup/mnt/btrfs/backup   btrfs 
>   noatime,space_cache=v2,compress=lzo,subvolid=5,nofail,noauto 0  
>  0
>
> After send-receive operation on /home subvolume, usage looks like this:
>
> # btrfs filesystem usage /mnt/btrfs/backup/
> Overall:
> Device size:  16.00GiB
> Device allocated:  1.27GiB
> Device unallocated:   14.73GiB
> Device missing:  0.00B
> Used:844.18MiB
> Free (estimated): 14.92GiB  (min: 14.92GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:1.01GiB, Used:833.36MiB
>/dev/sdb1.01GiB
>
> Metadata,single: Size:264.00MiB, Used:10.80MiB
>/dev/sdb  264.00MiB
>
> System,single: Size:4.00MiB, Used:16.00KiB
>/dev/sdb4.00MiB
>
> Unallocated:
>/dev/sdb   14.73GiB
>
>
> Problem is, during send-receive of system subvolume, it runs out of space:
>
> # btrbk run /mnt/btrfs/ssd/system/ -v
> btrbk command line client, version 0.26.1  (Wed Oct 17 09:51:20 2018)
> Using configuration: /etc/btrbk/btrbk.conf
> Using transaction log: /var/log/btrbk.log
> Creating subvolume snapshot for: /mnt/btrfs/ssd/system
> [snapshot] source: /mnt/btrfs/ssd/system
> [snapshot] target: /mnt/btrfs/ssd/_snapshots/system.20181017T0951
> Checking for missing backups of subvolume "/mnt/btrfs/ssd/system" in 
> "/mnt/btrfs/backup/"
> Creating subvolume backup (send-receive) for: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> No common parent subvolume present, creating full backup...
> [send/receive] source: /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> [send/receive] target: /mnt/btrfs/backup/system.20181016T2034
> mbuffer: error: outputThread: error writing to  at offset 0x4b5bd000: 
> Broken pipe
> mbuffer: warning: error during output to : Broken pipe
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol system.20181016T2034
> ERROR: rename o77417-5519-0 -> 
> lib/modules/4.18.0-0.bpo.1-amd64/kernel/drivers/watchdog/pcwd_pci.ko failed: 
> No space left on device
> ERROR: Failed to send/receive btrfs subvolume: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034  -> /mnt/btrfs/backup
> [delete] options: commit-after
> [delete] target: /mnt/btrfs/backup/system.20181016T2034
> WARNING: Deleted partially received (garbled) subvolume: 
> /mnt/btrfs/backup/system.20181016T2034
> ERROR: Error while resuming backups, aborting
> Created 0/2 missing backups
> WARNING: Skipping cleanup of snapshots for subvolume "/mnt/btrfs/ssd/system", 
> as at least one target aborted earlier
> Completed within: 116s  (Wed Oct 17 09:53:16 2018)
> 

Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Qu Wenruo


On 2018/10/16 下午11:25, Anton Shepelev wrote:
> Qu Wenruo to Anton Shepelev:
> 
>>> On all our servers with BTRFS, which are otherwise working
>>> normally, `btrfs check /' complains that
>>>
>>> Superblock bytenr is larger than device size
>>> Couldn't open file system
>>>
>> Please try latest btrfs-progs and see if btrfs check
>> reports any error.
> 
> It is SUSE Linux Enterprise with its native repository, so I
> cannot update btrfs easily.  I will, though.

Then I recommend to use latest openSUSE Tumbleweed rescue ISO to do the
mount and check.

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Anton Shepelev
Qu Wenruo to Anton Shepelev:

>>On all our servers with BTRFS, which are otherwise working
>>normally, `btrfs check /' complains that
>>
>>Superblock bytenr is larger than device size
>>Couldn't open file system
>>
>Please try latest btrfs-progs and see if btrfs check
>reports any error.

It is SUSE Linux Enterprise with its native repository, so I
cannot update btrfs easily.  I will, though.

-- 
()  ascii ribbon campaign - against html e-mail
/\  http://preview.tinyurl.com/qcy6mjc [archived]


Re: btrfs check: Superblock bytenr is larger than device size

2018-10-16 Thread Qu Wenruo


On 2018/10/16 下午10:05, Anton Shepelev wrote:
> Hello, all
> 
> On all our servers with BTRFS, which are otherwise working
> normally, `btrfs check /' complains that

Btrfs check shouldn't continue on mount point.

Latest one would report error like:

  Opening filesystem to check...
  ERROR: not a regular file or block device: /mnt/btrfs
  ERROR: cannot open file system


> 
>Superblock bytenr is larger than device size

This shouldn't be a big problem, normally related to unaligned numbers,
and older kernel/btrfs-progs.

Please try latest btrfs-progs and see if btrfs check reports any error.

>Couldn't open file system
> 
> Since am not using any "dangerous" options and want merely
> to analyse the system for errors without any modifications,
> I don't think I must unount my FS, which in my case would
> mean booting from a live CD.  What may be causeing this
> error?

Normally old mkfs or old kernel.

Latest btrfs check result would definitely help to solve the problem.
And the follow result will also help (can be dumpded even with fs
mounted, also needs latest btrfs-progs):

  # btrfs ins dump-super -FfA 
  # btrfs ins dump-tree -t chunk 

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


Re: BTRFS bad block management. Does it exist?

2018-10-16 Thread Anand Jain





On 10/14/2018 07:08 PM, waxhead wrote:

In case BTRFS fails to WRITE to a disk. What happens?



Does the bad area get mapped out somehow?


There was a proposed patch, its not convincing because the disks does 
the bad block relocation part transparently to the host and if disk runs 
out of reserved list then probably its time to replace the disk as in my 
experience the disk would have failed for other non-media error before 
it runs out of the reserved list and where in this case the host 
performed relocation won't help. Further more being at the file-system 
level you won't be able to accurately determine whether the block write 
has failed for the bad media error and not because of the reason of 
target circuitry fault.


Does it try again until it 
succeed or



until it "times out" or reach a threshold counter?


Block IO timeout and retry are the properties of the block layer 
depending on the type of error it should.


SD module already does retry of 5 counts (when failfast is not set), it 
should be tune-able. And I think there was a patch for that in the ML.


We had few discussion on the retry part in the past. [1]
[1]
https://www.spinics.net/lists/linux-btrfs/msg70240.html
https://www.spinics.net/lists/linux-btrfs/msg71779.html


Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)


When there is mirror copy it does not go into the RO mode, and it leaves 
write hole(s) patchy across any transaction as we don't fail the disk at 
the first failed transaction. That means if a disk is at nth transaction 
per the super-block, its not guaranteed that all previous transactions 
have made it to the disk successfully in case of mirror-ed configs. I 
consider this as a bug. And there is a danger that it may read the junk 
data, which is hard but not impossible to hit due to our un-reasonable 
(there is a patch in the ML to address that as well) hard-coded 
pid-based read-mirror policy.


I sent a patch to fail the disk when first write fails so that we know 
the last good integrity of the FS based on the transaction id. That was 
a long time back I still believe its important patch. There wasn't 
enough comments I guess for it go into the next step.


The current solution is to replace the offending disk _without_ reading 
from it, to have a good recovery from the failed disk. As data centers 
can't relay on admin initiated manual recovery, there is also a patch to 
do this stuff automatically using the auto-replace feature, patches are 
in the ML. Again there wasn't enough comments I guess for it go into the 
next step.


Thanks, Anand


Re: BTRFS bad block management. Does it exist?

2018-10-15 Thread Austin S. Hemmelgarn

On 2018-10-14 07:08, waxhead wrote:

In case BTRFS fails to WRITE to a disk. What happens?
Does the bad area get mapped out somehow? Does it try again until it 
succeed or until it "times out" or reach a threshold counter?
Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)


Building on Qu's answer (which is absolutely correct), BTRFS makes the 
perfectly reasonable assumption that you're not trying to use known bad 
hardware.  It's not alone in this respect either, pretty much every 
Linux filesystem makes the exact same assumption (and almost all 
non-Linux ones too), because it really is a perfectly reasonable 
assumption.  The only exception is ext[234], but they only support it 
statically (you can set the bad block list at mkfs time, but not 
afterwards, and they don't update it at runtime), and it's a holdover 
from earlier filesystems which originated at a time when storage was 
sufficiently expensive _and_ unreliable that you kept using disks until 
they were essentially completely dead.


The reality is that with modern storage hardware, if you have 
persistently bad sectors the device is either defective (and should be 
returned under warranty), or it's beyond expected EOL (and should just 
be replaced).  Most people know about SSD's doing block remapping to 
avoid bad blocks, but hard drives do it to, and they're actually rather 
good at it.  In both cases, enough spare blocks are provided that the 
device can handle average rates of media errors through the entirety of 
it's average life expectancy without running out of spare blocks.


On top of all of that though, it's fully possible to work around bad 
blocks in the block layer if you take the time to actually do it.  With 
a bit of reasonably simple math, you can easily set up an LVM volume 
that actively avoids all the bad blocks on a disk while still fully 
utilizing the rest of the volume.  Similarly, with a bit of work (and a 
partition table that supports _lots_ of partitions) you can work around 
bad blocks with an MD concatenated device.


Re: BTRFS bad block management. Does it exist?

2018-10-14 Thread Qu Wenruo


On 2018/10/14 下午7:08, waxhead wrote:
> In case BTRFS fails to WRITE to a disk. What happens?

Normally it should return error when we flush disk.
And in that case, error will leads to transaction abort and the fs goes
RO to prevent further corruption.

> Does the bad area get mapped out somehow?

No.

> Does it try again until it
> succeed or until it "times out" or reach a threshold counter?

Unless it's done by block layer, btrfs doesn't try that.

> Does it eventually try to write to a different disk (in case of using
> the raid1/10 profile?)

No. That's not what RAID is designed to do.

It's only allowed to have any flush error if using "degraded" mount
option and the error is under tolerance.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: btrfs send receive ERROR: chown failed: No such file or directory

2018-10-11 Thread Leonard Lausen
Hey Filipe,

thanks for the feedback. I ran the command again with -vv. Below are
the last commands logged by btrfs receive to stderr:

  mkfile o2138798-5016457-0
  rename 
leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13
 -> o2138802-5207521-0
  rename o2138798-5016457-0 -> 
leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13
  utimes leonard/mail/lists/emacs-orgmode/new
  ERROR: cannot open 
/mnt/_btrbk_backups/leonard-xps13/@home.20180902T0045/o2138798-5016457-0: No 
such file or directory

The file which is mentioned above has in fact not changed between the
parent and current snapshot. I computed further a hash of the file on
source volume for parent and current snapshot and target on target
volume for parent snapshot. I get the same hash value in all 3 cases, so
the parent snapshot was apparently transferred correctly.

Note that the folder in which this file is placed contains 96558 files
whose sizes sum to 774M.

I further observed that the snapshot that previously failed to transfer
via send receive could now be sent / received without error. However,
the error then occured again for the next snapshot (which has the now
successfully transferred one as parent). There may be a race condition
that leads to non-deterministic failures.

Please let me know if I can help with any further information.

Best regards
Leonard

Filipe Manana  writes:

>> > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs
>> >btrfs send ... | ssh user@machine2 "btrfs receive /path1"
>> > 2) Machine 2 backups all subvolumes stored at /path1 to a second
>> >independent btrfs filesystem. Let /path1/rootsnapshot be the first
>> >snapshot stored at /path1 (ie. it has no Parent UUID). Let
>> >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot
>> >as a parent. Then
>> >btrfs send -v /path1/rootsnapshot | btrfs receive /path2
>> >works without issues, but
>> >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs 
>> > receive /path2
>
> -v is useless. Use -vv, which will dump all commands.
>
>> >fails as follows:
>> >ERROR: chown o257-4639416-0 failed: No such file or directory
>> >
>> > No error is shown in dmesg. /path1 and /path2 denote two independent
>> > btrfs filesystems.
>> >
>> > Note that there was no issue with transferring incrementalsnapshot from
>> > machine 1 to machine 2. No error is shown in dmesg.


Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Martin Steigerwald
Filipe Manana - 05.10.18, 17:21:
> On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald 
 wrote:
> > Hello!
> > 
> > On ThinkPad T520 after battery was discharged and machine just
> > blacked out.
> > 
> > Is that some sign of regular consistency check / replay or something
> > to investigate further?
> 
> I think it's harmless, if anything were messed up with link counts or
> mismatches between those and dir entries, fsck (btrfs check) should
> have reported something.
> I'll dig a big further and remove the warning if it's really harmless.

I just scrubbed the filesystem. I did not run btrfs check on it.

> > I already scrubbed all data and there are no errors. Also btrfs
> > device stats reports no errors. SMART status appears to be okay as
> > well on both SSD.
> > 
> > [4.524355] BTRFS info (device dm-4): disk space caching is
> > enabled [… backtrace …]
-- 
Martin




Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Filipe Manana
On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald  wrote:
>
> Hello!
>
> On ThinkPad T520 after battery was discharged and machine just blacked
> out.
>
> Is that some sign of regular consistency check / replay or something to
> investigate further?

I think it's harmless, if anything were messed up with link counts or
mismatches between those and dir entries, fsck (btrfs check) should
have reported something.
I'll dig a big further and remove the warning if it's really harmless.

Thanks.

>
> I already scrubbed all data and there are no errors. Also btrfs device stats
> reports no errors. SMART status appears to be okay as well on both SSD.
>
> [4.524355] BTRFS info (device dm-4): disk space caching is enabled
> [4.524356] BTRFS info (device dm-4): has skinny extents
> [4.563950] BTRFS info (device dm-4): enabling ssd optimizations
> [5.463085] Console: switching to colour frame buffer device 240x67
> [5.492236] i915 :00:02.0: fb0: inteldrmfb frame buffer device
> [5.882661] BTRFS info (device dm-3): disk space caching is enabled
> [5.882664] BTRFS info (device dm-3): has skinny extents
> [5.918579] SGI XFS with ACLs, security attributes, realtime, scrub, no 
> debug enabled
> [5.927421] Adding 20971516k swap on /dev/mapper/sata-swap.  Priority:-2 
> extents:1 across:20971516k SSDsc
> [5.935051] XFS (sdb1): Mounting V5 Filesystem
> [5.935218] XFS (sda1): Mounting V5 Filesystem
> [5.961100] XFS (sda1): Ending clean mount
> [5.970857] BTRFS info (device dm-3): enabling ssd optimizations
> [5.972358] XFS (sdb1): Ending clean mount
> [5.975955] WARNING: CPU: 1 PID: 1104 at fs/inode.c:342 inc_nlink+0x28/0x30
> [5.978271] Modules linked in: xfs msr pktcdvd intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc 
> arc4 snd_hda_codec_conexant snd_hda_codec_generic iwldvm mac80211 iwlwifi 
> aesni_intel snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd cryptd 
> snd_hda_core glue_helper intel_cstate snd_hwdep intel_rapl_perf snd_pcm 
> pcspkr input_leds i915 sg cfg80211 snd_timer thinkpad_acpi nvram 
> drm_kms_helper snd soundcore tpm_tis tpm_tis_core drm rfkill ac tpm 
> i2c_algo_bit fb_sys_fops battery rng_core video syscopyarea sysfillrect 
> sysimgblt button evdev sbs sbshc coretemp bfq hdaps(O) tp_smapi(O) 
> thinkpad_ec(O) loop ecryptfs cbc sunrpc mcryptd sha256_ssse3 sha256_generic 
> encrypted_keys ip_tables x_tables autofs4 dm_mod
> [5.990499]  btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate 
> raid6_pq libcrc32c crc32c_generic sr_mod cdrom sd_mod hid_lenovo hid_generic 
> usbhid hid ahci libahci libata ehci_pci crc32c_intel psmouse i2c_i801 
> sdhci_pci cqhci lpc_ich sdhci ehci_hcd e1000e scsi_mod i2c_core mfd_core 
> mmc_core usbcore usb_common thermal
> [5.990529] CPU: 1 PID: 1104 Comm: mount Tainted: G   O  
> 4.18.7-tp520 #63
> [5.990532] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET69WW (1.49 ) 
> 06/14/2018
> [6.000153] RIP: 0010:inc_nlink+0x28/0x30
> [6.000154] Code: 00 00 8b 47 48 85 c0 74 07 83 c0 01 89 47 48 c3 f6 87 a1 
> 00 00 00 04 74 11 48 8b 47 28 f0 48 ff 88 98 04 00 00 8b 47 48 eb df <0f> 0b 
> eb eb 0f 1f 40 00 41 54 8b 0d 70 3f aa 00 48 ba eb 83 b5 80
> [6.008573] RSP: 0018:c90002283828 EFLAGS: 00010246
> [6.008575] RAX:  RBX: 8804018bed58 RCX: 
> 00022261
> [6.008576] RDX: 00022251 RSI:  RDI: 
> 8804018bed58
> [6.008577] RBP: c90002283a50 R08: 0002a330 R09: 
> a02f3873
> [6.008578] R10: 30ff R11: 7763 R12: 
> 0011
> [6.008579] R13: 3d5f R14: 880403e19800 R15: 
> 88040a3c69a0
> [6.008580] FS:  7f071598f100() GS:88041e24() 
> knlGS:
> [6.008581] CS:  0010 DS:  ES:  CR0: 80050033
> [6.008589] CR2: 7fda4fbf8218 CR3: 000403e42001 CR4: 
> 000606e0
> [6.008590] Call Trace:
> [6.008614]  replay_one_buffer+0x80e/0x890 [btrfs]
> [6.008632]  walk_up_log_tree+0x1dc/0x260 [btrfs]
> [6.046858]  walk_log_tree+0xaf/0x1e0 [btrfs]
> [6.046872]  btrfs_recover_log_trees+0x21c/0x410 [btrfs]
> [6.046885]  ? btree_read_extent_buffer_pages+0xcd/0x210 [btrfs]
> [6.055941]  ? fixup_inode_link_counts+0x170/0x170 [btrfs]
> [6.055953]  open_ctree+0x1a0d/0x1b60 [btrfs]
> [6.055965]  btrfs_mount_root+0x67b/0x760 [btrfs]
> [6.065039]  ? pcpu_alloc_area+0xdd/0x120
> [6.065040]  ? pcpu_next_unpop+0x32/0x40
> [6.065052]  mount_fs+0x36/0x162
> [6.065055]  vfs_kern_mount.part.34+0x4f/0x120
> [6.065064]  btrfs_mount+0x15f/0x890 [btrfs]
> [6.065067]  ? pcpu_cnt_pop_pages+0x40/0x50
> [6.065069]  ? pcpu_alloc_area+0xdd/0x120
> [6.065071]  ? pcpu_next_unpop+0x32/0x40
> [6.065073]  ? cpumask_next+0x16/0x20
> [ 

Re: btrfs send receive ERROR: chown failed: No such file or directory

2018-10-02 Thread Filipe Manana
On Tue, Oct 2, 2018 at 7:02 AM Leonard Lausen  wrote:
>
> Hello,
>
> does anyone have an idea about below issue? It is a severe issue as it
> renders btrfs send / receive dysfunctional and it is not clear if there
> may be a data corruption issue hiding in the current send / receive
> code.
>
> Thank you.
>
> Best regards
> Leonard
>
> Leonard Lausen  writes:
> > Hello!
> >
> > I observe the following issue with btrfs send | btrfs receive in a setup
> > with 2 machines and 3 btrfs file-systems. All machines run Linux 4.18.9.
> > Machine 1 runs btrfs-progs 4.17.1, machine 2 runs btrfs-progs 4.17 (via
> > https://packages.debian.org/stretch-backports/btrfs-progs).
> >
> > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs
> >btrfs send ... | ssh user@machine2 "btrfs receive /path1"
> > 2) Machine 2 backups all subvolumes stored at /path1 to a second
> >independent btrfs filesystem. Let /path1/rootsnapshot be the first
> >snapshot stored at /path1 (ie. it has no Parent UUID). Let
> >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot
> >as a parent. Then
> >btrfs send -v /path1/rootsnapshot | btrfs receive /path2
> >works without issues, but
> >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs 
> > receive /path2

-v is useless. Use -vv, which will dump all commands.

> >fails as follows:
> >ERROR: chown o257-4639416-0 failed: No such file or directory
> >
> > No error is shown in dmesg. /path1 and /path2 denote two independent
> > btrfs filesystems.
> >
> > Note that there was no issue with transferring incrementalsnapshot from
> > machine 1 to machine 2. No error is shown in dmesg.
> >
> > Best regards
> > Leonard



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: btrfs send receive ERROR: chown failed: No such file or directory

2018-10-02 Thread Leonard Lausen
Hello,

does anyone have an idea about below issue? It is a severe issue as it
renders btrfs send / receive dysfunctional and it is not clear if there
may be a data corruption issue hiding in the current send / receive
code.

Thank you.

Best regards
Leonard

Leonard Lausen  writes:
> Hello!
>
> I observe the following issue with btrfs send | btrfs receive in a setup
> with 2 machines and 3 btrfs file-systems. All machines run Linux 4.18.9.
> Machine 1 runs btrfs-progs 4.17.1, machine 2 runs btrfs-progs 4.17 (via
> https://packages.debian.org/stretch-backports/btrfs-progs).
>
> 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs 
>btrfs send ... | ssh user@machine2 "btrfs receive /path1"
> 2) Machine 2 backups all subvolumes stored at /path1 to a second
>independent btrfs filesystem. Let /path1/rootsnapshot be the first
>snapshot stored at /path1 (ie. it has no Parent UUID). Let
>/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot
>as a parent. Then
>btrfs send -v /path1/rootsnapshot | btrfs receive /path2
>works without issues, but
>btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs 
> receive /path2
>fails as follows:
>ERROR: chown o257-4639416-0 failed: No such file or directory
>
> No error is shown in dmesg. /path1 and /path2 denote two independent
> btrfs filesystems.
>
> Note that there was no issue with transferring incrementalsnapshot from
> machine 1 to machine 2. No error is shown in dmesg.
>
> Best regards
> Leonard


Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang




在 2018年09月25日 16:31, Nikolay Borisov 写道:


On 25.09.2018 11:20, sunny.s.zhang wrote:

在 2018年09月20日 02:36, Liu Bo 写道:

On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang
 wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
    freelist = 0x2026,   <<< the value is id of one inode
    tid = 29567861,
    page = 0xea0132168d00,
    partial = 0x0
}

And, I found there are two different btrfs inodes pointing
delayed_node. It
means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s=


thanks,
liubo

I don't think so.
this patch has resolved the problem of radix_tree_lookup. I don't think
this can resolve my problem that race occur after
ACCESS_ONCE(btrfs_inode->delayed_node).
Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then
the function of btrfs_get_delayed_node will return, and don't continue.

Can you reproduce the problem on an upstream kernel with added delays?
The original report is from some RHEL-based distro (presumably oracle
unbreakable linux) so there is no indication currently that this is a
genuine problem in upstream kernels.

Not yet. I will reproduce later.
But I don't have any clue about this race now.
Thanks,
Sunny




Thanks,
Sunny


1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
inode
*inode)
    88 {
    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
    90 struct btrfs_root *root = btrfs_inode->root;
    91 u64 ino = btrfs_ino(inode);
    92 struct btrfs_delayed_node *node;
    93
    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
    95 if (node) {
    96 atomic_inc(>refs);
    97 return node;
    98 }


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
   #0 [88130404f940] machine_kexec at 8105ec10
   #1 [88130404f9b0] crash_kexec at 811145b8
   #2 [88130404fa80] oops_end at 8101a868
   #3 [88130404fab0] no_context at 8106ea91
   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
   #6 [88130404fb60] __do_page_fault at 8106f328
   #7 [88130404fbd0] do_page_fault at 8106f637
   #8 [88130404fc10] page_fault at 816f6308
  [exception RIP: kmem_cache_alloc+121]
  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
  RAX:   RBX:   RCX: 01c32b76
  RDX: 01c32b75  RSI:   RDI: 000224b0
  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
  R13: 2026  R14: 887e3e230a00  R15: a01abf49
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
a01abf49
[btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] 

Re: btrfs panic problem

2018-09-25 Thread Nikolay Borisov



On 25.09.2018 11:20, sunny.s.zhang wrote:
> 
> 在 2018年09月20日 02:36, Liu Bo 写道:
>> On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang
>>  wrote:
>>> Hi All,
>>>
>>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
>>> btrfs_get_or_create_delayed_node.
>>>
>>> I found that the freelist of the slub is wrong.
>>>
>>> crash> struct kmem_cache_cpu 887e7d7a24b0
>>>
>>> struct kmem_cache_cpu {
>>>    freelist = 0x2026,   <<< the value is id of one inode
>>>    tid = 29567861,
>>>    page = 0xea0132168d00,
>>>    partial = 0x0
>>> }
>>>
>>> And, I found there are two different btrfs inodes pointing
>>> delayed_node. It
>>> means that the same slub is used twice.
>>>
>>> I think this slub is freed twice, and then the next pointer of this slub
>>> point itself. So we get the same slub twice.
>>>
>>> When use this slub again, that break the freelist.
>>>
>>> Folloing code will make the delayed node being freed twice. But I don't
>>> found what is the process.
>>>
>>> Process A (btrfs_evict_inode) Process B
>>>
>>> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>>>
>>> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>
>>> BTRFS_I(inode)->delayed_node = NULL;
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>> if (node) {
>>> atomic_inc(>refs);
>>> return node;
>>> }
>>>
>>> ..
>>>
>>> btrfs_release_delayed_node(delayed_node);
>>>
>> By looking at the race,  seems the following commit has addressed it.
>>
>> btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s=
>>
>>
>> thanks,
>> liubo
> 
> I don't think so.
> this patch has resolved the problem of radix_tree_lookup. I don't think
> this can resolve my problem that race occur after
> ACCESS_ONCE(btrfs_inode->delayed_node).
> Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then
> the function of btrfs_get_delayed_node will return, and don't continue.

Can you reproduce the problem on an upstream kernel with added delays?
The original report is from some RHEL-based distro (presumably oracle
unbreakable linux) so there is no indication currently that this is a
genuine problem in upstream kernels.

> 
> Thanks,
> Sunny
> 
>>
>>> 1313 void btrfs_remove_delayed_node(struct inode *inode)
>>> 1314 {
>>> 1315 struct btrfs_delayed_node *delayed_node;
>>> 1316
>>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
>>> 1318 if (!delayed_node)
>>> 1319 return;
>>> 1320
>>> 1321 BTRFS_I(inode)->delayed_node = NULL;
>>> 1322 btrfs_release_delayed_node(delayed_node);
>>> 1323 }
>>>
>>>
>>>    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
>>> inode
>>> *inode)
>>>    88 {
>>>    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>>>    90 struct btrfs_root *root = btrfs_inode->root;
>>>    91 u64 ino = btrfs_ino(inode);
>>>    92 struct btrfs_delayed_node *node;
>>>    93
>>>    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>    95 if (node) {
>>>    96 atomic_inc(>refs);
>>>    97 return node;
>>>    98 }
>>>
>>>
>>> Thanks,
>>>
>>> Sunny
>>>
>>>
>>> PS:
>>>
>>> 
>>>
>>> panic informations
>>>
>>> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>>>   #0 [88130404f940] machine_kexec at 8105ec10
>>>   #1 [88130404f9b0] crash_kexec at 811145b8
>>>   #2 [88130404fa80] oops_end at 8101a868
>>>   #3 [88130404fab0] no_context at 8106ea91
>>>   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>>>   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>>>   #6 [88130404fb60] __do_page_fault at 8106f328
>>>   #7 [88130404fbd0] do_page_fault at 8106f637
>>>   #8 [88130404fc10] page_fault at 816f6308
>>>  [exception RIP: kmem_cache_alloc+121]
>>>  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>>>  RAX:   RBX:   RCX: 01c32b76
>>>  RDX: 01c32b75  RSI:   RDI: 000224b0
>>>  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>>>  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>>>  R13: 2026  R14: 887e3e230a00  R15: a01abf49
>>>  ORIG_RAX:   CS: 0010  SS: 0018
>>>   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
>>> a01abf49
>>> [btrfs]
>>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
>>> [btrfs]

Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang



在 2018年09月20日 00:12, Nikolay Borisov 写道:

On 19.09.2018 02:53, sunny.s.zhang wrote:

Hi Duncan,

Thank you for your advice. I understand what you mean.  But i have
reviewed the latest btrfs code, and i think the issue is exist still.

At 71 line, if the function of btrfs_get_delayed_node run over this
line, then switch to other process, which run over the 1282 and release
the delayed node at the end.

And then, switch back to the  btrfs_get_delayed_node. find that the node
is not null, and use it as normal. that mean we used a freed memory.

at some time, this memory will be freed again.

latest code as below.

1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode)
1279 {
1280 struct btrfs_delayed_node *delayed_node;
1281
1282 delayed_node = READ_ONCE(inode->delayed_node);
1283 if (!delayed_node)
1284 return;
1285
1286 inode->delayed_node = NULL;
1287 btrfs_release_delayed_node(delayed_node);
1288 }


   64 static struct btrfs_delayed_node *btrfs_get_delayed_node(
   65 struct btrfs_inode *btrfs_inode)
   66 {
   67 struct btrfs_root *root = btrfs_inode->root;
   68 u64 ino = btrfs_ino(btrfs_inode);
   69 struct btrfs_delayed_node *node;
   70
   71 node = READ_ONCE(btrfs_inode->delayed_node);
   72 if (node) {
   73 refcount_inc(>refs);
   74 return node;
   75 }
   76
   77 spin_lock(>inode_lock);
   78 node = radix_tree_lookup(>delayed_nodes_tree, ino);



You are analysis is correct, however it's missing one crucial point -
btrfs_remove_delayed_node is called only from btrfs_evict_inode. And
inodes are evicted when all other references have been dropped. Check
the code in evict_inodes() - inodes are added to the dispose list when
their i_count is 0 at which point there should be no references in this
inode. This invalidates your analysis...

Thanks.
Yes, I know this.  and I know that other process can not use this inode 
if the inode is in the I_FREEING status.
But,  Chris has fixed a bug, which is similar with this and is found in 
production.  it mean that this will occur in some condition.


btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s=


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of
options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are prepared to deal with it.

In 

Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang



在 2018年09月20日 02:36, Liu Bo 写道:

On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang  wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
   freelist = 0x2026,   <<< the value is id of one inode
   tid = 29567861,
   page = 0xea0132168d00,
   partial = 0x0
}

And, I found there are two different btrfs inodes pointing delayed_node. It
means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s=

thanks,
liubo


I don't think so.
this patch has resolved the problem of radix_tree_lookup. I don't think 
this can resolve my problem that race occur after 
ACCESS_ONCE(btrfs_inode->delayed_node).
Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then 
the function of btrfs_get_delayed_node will return, and don't continue.


Thanks,
Sunny




1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode
*inode)
   88 {
   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
   90 struct btrfs_root *root = btrfs_inode->root;
   91 u64 ino = btrfs_ino(inode);
   92 struct btrfs_delayed_node *node;
   93
   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
   95 if (node) {
   96 atomic_inc(>refs);
   97 return node;
   98 }


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
  #0 [88130404f940] machine_kexec at 8105ec10
  #1 [88130404f9b0] crash_kexec at 811145b8
  #2 [88130404fa80] oops_end at 8101a868
  #3 [88130404fab0] no_context at 8106ea91
  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
  #6 [88130404fb60] __do_page_fault at 8106f328
  #7 [88130404fbd0] do_page_fault at 8106f637
  #8 [88130404fc10] page_fault at 816f6308
 [exception RIP: kmem_cache_alloc+121]
 RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
 RAX:   RBX:   RCX: 01c32b76
 RDX: 01c32b75  RSI:   RDI: 000224b0
 RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
 R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
 R13: 2026  R14: 887e3e230a00  R15: a01abf49
 ORIG_RAX:   CS: 0010  SS: 0018
  #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49
[btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] iterate_dir at 81221929
#16 [88130404fee0] sys_getdents64 at 81221a19
#17 [88130404ff50] system_call_fastpath at 816f2594
 RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
 RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
 RDX: 1000  RSI: 00c83da14000  RDI: 0011
 RBP:    R8:    R9: 
 R10: 

Re: btrfs problems

2018-09-22 Thread Duncan
Adrian Bastholm posted on Thu, 20 Sep 2018 23:35:57 +0200 as excerpted:

> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost Another variant I was
> considering is running a raid1 mirror on two of the drives and maybe a
> subvolume on the third, for less important stuff

Agreed with CMurphy's reply, but he didn't mention...

As I wrote elsewhere recently, don't remember if it was in a reply to you 
before you tried zfs and came back, or to someone else, so I'll repeat 
here, briefer this time...

Keep in mind that on btrfs, it's possible (and indeed the default with 
multiple devices) to run data and metadata at different raid levels.

IMO, as long as you're following an appropriate backup policy that backs 
up anything valuable enough to be worth the time/trouble/resources of 
doing so, so if you /do/ lose the array you still have a backup of 
anything you considered valuable enough to worry about (and that caveat 
is always the case, no matter where or how it's stored, value of data is 
in practice defined not by arbitrary claims but by the number of backups 
it's considered worth having of it)...

With that backups caveat, I'm now confident /enough/ about raid56 mode to 
be comfortable cautiously recommending it for data, tho I'd still /not/ 
recommend it for metadata, which I'd recommend should remain the multi-
device default raid1 level.

That way, you're only risking a limited amount of raid5 data to the not 
yet as mature and well tested raid56 mode, the metadata remains protected 
by the more mature raid1 mode, and if something does go wrong, it's much 
more likely to be only a few files lost instead of the entire filesystem, 
as is at risk if your metadata is raid56 as well, the metadata including 
checksums will be intact so scrub should tell you what files are bad, and 
if those few files are valuable they'll be on the backup and easy enough 
to restore, compared to restoring the entire filesystem.  But for most 
use-cases, metadata should be relatively small compared to data, so 
duplicating metadata as raid1 while doing raid5 for data should go much 
easier on the capacity needs than raid1 for both would.

Tho I'd still recommend raid1 data as well for higher maturity and tested 
ability to use the good copy to rewrite the bad one if one copy goes bad 
(in theory, raid56 mode can use parity to rewrite as well, but that's not 
yet as well tested and there's still the narrow degraded-mode crash write 
hole to worry about), if it's not cost-prohibitive for the amount of data 
you need to store.  But for people on a really tight budget or who are 
storing double-digit TB of data or more, I can understand why they prefer 
raid5, and I do think raid5 is stable enough for data now, as long as the 
metadata remains raid1, AND they're actually executing on a good backup 
policy.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: btrfs problems

2018-09-20 Thread Remi Gauvin
On 2018-09-20 05:35 PM, Adrian Bastholm wrote:
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

In case you were not aware, it's perfectly acceptable with BTRFS to use
Raid 1 over 3 devices.  Even more amazing, regardless of how many
devices you start with, 2, 3, 4, whatever, you can add a single drive to
the array to increase capacity,  (at 50%, of course,, ie, adding a 4TB
drive will give you 2TB usable space, assuming the other drives add up
to at least 4TB to match it.)


<>

Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm  wrote:
>
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid.

Yep. I'm referring to the drives, their firmware, cables, logic board,
its firmware, the power supply, power, etc. Btrfs is by nature
intolerant of corruption. Other file systems are more tolerant because
they don't know about it (although recent versions of XFS and ext4 are
now defaulting to checksummed metadata and journals).


>three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

RAID does not substantially reduce the chances of data loss. It's not
anything like a backup. It's an uptime enhancer. If you have backups,
and your primary storage dies, of course you can restore from backup
no problem, but it takes time and while the restore is happening,
you're not online - uptime is killed. If that's a negative, might want
to run RAID so you can keep working during the degraded period, and
instead of a restore you're doing a rebuild. But of course there is a
chance of failure during the degraded period. So you have to have a
backup anyway. At least with Btrfs/ZFS, there is another reason to run
with some replication like raid1 or raid5 and that's so that if
there's corruption or a bad sector, Btrfs doesn't just detect it, it
can fix it up with the good copy.

For what it's worth, make sure the drives have lower SCT ERC time than
the SCSI command timer. This is the same for Btrfs as it is for md and
LVM RAID. The command timer default is 30 seconds, and most drives
have SCT ERC disabled with very high recovery times well over 30
seconds. So either set SCT ERC to something like 70 deciseconds. Or
increase the command timer to something like 120 or 180 (either one is
absurdly high but what you want is for the drive to eventually give up
and report a discrete error message which Btrfs can do something
about, rather than do a SATA link reset in which case Btrfs can't do
anything about it).




-- 
Chris Murphy


Re: btrfs problems

2018-09-20 Thread Adrian Bastholm
Thanks a lot for the detailed explanation.
Aabout "stable hardware/no lying hardware". I'm not running any raid
hardware, was planning on just software raid. three drives glued
together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
this be a safer bet, or would You recommend running the sausage method
instead, with "-d single" for safety ? I'm guessing that if one of the
drives dies the data is completely lost
Another variant I was considering is running a raid1 mirror on two of
the drives and maybe a subvolume on the third, for less important
stuff

BR Adrian
On Thu, Sep 20, 2018 at 9:39 PM Chris Murphy  wrote:
>
> On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm  wrote:
> > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:
> >
> >>
> >> Then I strongly recommend to use the latest upstream kernel and progs
> >> for btrfs. (thus using Debian Testing)
> >>
> >> And if anything went wrong, please report asap to the mail list.
> >>
> >> Especially for fs corruption, that's the ghost I'm always chasing for.
> >> So if any corruption happens again (although I hope it won't happen), I
> >> may have a chance to catch it.
> >
> > You got it
> >> >
> >> >> Anyway, enjoy your stable fs even it's not btrfs
> >
> >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
> >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
> >> > after the dust settled I realize I like the flexibility of BTRFS.
> >> >
> > I'm back to btrfs.
> >
> >> From the code aspect, the biggest difference is the chunk layout.
> >> Due to the ext* block group usage, each block group header (except some
> >> sparse bg) is always used, thus btrfs can't use them.
> >>
> >> This leads to highly fragmented chunk layout.
> >
> > The only thing I really understood is "highly fragmented" == not good
> > . I might need to google these "chunk" thingies
>
> Chunks are synonyms with block groups. They're like a super extent, or
> extent of extents.
>
> The block group is how Btrfs abstracts the logical address used most
> everywhere in Btrfs land, and device + physical location of extents.
> It's how a file is referenced only by on logical address, and doesn't
> need to know either where the extent is located, or how many copies
> there are. The block group allocation profile is what determines if
> there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk
> and where the copies are located. It's also fundamental to how device
> add, remove, replace, file system resize, and balance all interrelate.
>
>
> >> If your primary concern is to make the fs as stable as possible, then
> >> keep snapshots to a minimal amount, avoid any functionality you won't
> >> use, like qgroup, routinely balance, RAID5/6.
> >
> > So, is RAID5 stable enough ? reading the wiki there's a big fat
> > warning about some parity issues, I read an article about silent
> > corruption (written a while back), and chris says he can't recommend
> > raid56 to mere mortals.
>
> Depends on how you define stable. In recent kernels it's stable on
> stable hardware, i.e. no lying hardware (actually flushes when it
> claims it has), no power failures, and no failed devices. Of course
> it's designed to help protect against a clear loss of a device, but
> there's tons of stuff here that's just not finished including ejecting
> bad devices from the array like md and lvm raids will do. Btrfs will
> just keep trying, through all the failures. There are some patches to
> moderate this but I don't think they're merged yet.
>
> You'd also want to be really familiar with how to handle degraded
> operation, if you're going to depend on it, and how to replace a bad
> device. Last I refreshed my memory on it, it's advised to use "btrfs
> device add" followed by "btrfs device remove" for raid56; whereas
> "btrfs replace" is preferred for all other profiles. I'm not sure if
> the "btrfs replace" issues with parity raid were fixed.
>
> Metadata as raid56 shows a lot more problem reports than metadata
> raid1, so there's something goofy going on in those cases. I'm not
> sure how well understood they are. But other people don't have
> problems with it.
>
> It's worth looking through the archives about some things. Btrfs
> raid56 isn't exactly perfectly COW, there is read-modify-write code
> that means there can be overwrites. I vaguely recall that it's COW in
> the logical layer, but the physical writes can end up being RMW or not
> for sure COW.
>
>
>
> --
> Chris Murphy



-- 
Vänliga hälsningar / Kind regards,
Adrian Bastholm

``I would change the world, but they won't give me the sourcecode``


Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm  wrote:
> On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:
>
>>
>> Then I strongly recommend to use the latest upstream kernel and progs
>> for btrfs. (thus using Debian Testing)
>>
>> And if anything went wrong, please report asap to the mail list.
>>
>> Especially for fs corruption, that's the ghost I'm always chasing for.
>> So if any corruption happens again (although I hope it won't happen), I
>> may have a chance to catch it.
>
> You got it
>> >
>> >> Anyway, enjoy your stable fs even it's not btrfs
>
>> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
>> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
>> > after the dust settled I realize I like the flexibility of BTRFS.
>> >
> I'm back to btrfs.
>
>> From the code aspect, the biggest difference is the chunk layout.
>> Due to the ext* block group usage, each block group header (except some
>> sparse bg) is always used, thus btrfs can't use them.
>>
>> This leads to highly fragmented chunk layout.
>
> The only thing I really understood is "highly fragmented" == not good
> . I might need to google these "chunk" thingies

Chunks are synonyms with block groups. They're like a super extent, or
extent of extents.

The block group is how Btrfs abstracts the logical address used most
everywhere in Btrfs land, and device + physical location of extents.
It's how a file is referenced only by on logical address, and doesn't
need to know either where the extent is located, or how many copies
there are. The block group allocation profile is what determines if
there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk
and where the copies are located. It's also fundamental to how device
add, remove, replace, file system resize, and balance all interrelate.


>> If your primary concern is to make the fs as stable as possible, then
>> keep snapshots to a minimal amount, avoid any functionality you won't
>> use, like qgroup, routinely balance, RAID5/6.
>
> So, is RAID5 stable enough ? reading the wiki there's a big fat
> warning about some parity issues, I read an article about silent
> corruption (written a while back), and chris says he can't recommend
> raid56 to mere mortals.

Depends on how you define stable. In recent kernels it's stable on
stable hardware, i.e. no lying hardware (actually flushes when it
claims it has), no power failures, and no failed devices. Of course
it's designed to help protect against a clear loss of a device, but
there's tons of stuff here that's just not finished including ejecting
bad devices from the array like md and lvm raids will do. Btrfs will
just keep trying, through all the failures. There are some patches to
moderate this but I don't think they're merged yet.

You'd also want to be really familiar with how to handle degraded
operation, if you're going to depend on it, and how to replace a bad
device. Last I refreshed my memory on it, it's advised to use "btrfs
device add" followed by "btrfs device remove" for raid56; whereas
"btrfs replace" is preferred for all other profiles. I'm not sure if
the "btrfs replace" issues with parity raid were fixed.

Metadata as raid56 shows a lot more problem reports than metadata
raid1, so there's something goofy going on in those cases. I'm not
sure how well understood they are. But other people don't have
problems with it.

It's worth looking through the archives about some things. Btrfs
raid56 isn't exactly perfectly COW, there is read-modify-write code
that means there can be overwrites. I vaguely recall that it's COW in
the logical layer, but the physical writes can end up being RMW or not
for sure COW.



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-20 Thread Chris Murphy
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann  wrote:
> Am 13.9.2018 14:35, schrieb Nikolay Borisov:
>>
>> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>>>
>>> OK, I will install kdump later and perform a dump after the hang.
>>>
>>> One more noob question beforehand: does this dump contain sensitive
>>> information, for example the luks encryption key for the disk etc? A
>>> Google search only brings up one relevant search result which can only
>>> be viewed with a redhat subscription...
>>
>>
>>
>> So a kdump will dump the kernel memory so it's possible that the LUKS
>> encryption keys could be extracted from that image. Bummer, it's
>> understandable why you wouldn't want to upload it :). In this case you'd
>> have to install also the 'crash' utility to open the crashdump and
>> extract the calltrace of the btrfs process. The rough process should be :
>>
>>
>> crash 'path to vm linux' 'path to vmcore file', then once inside the
>> crash utility :
>>
>> set , you can acquire the pid by issuing 'ps'
>> which will give you a ps-like output of all running processes at the
>> time of crash. After the context has been set you can run 'bt' which
>> will give you a backtrace of the send process.
>>
>>
>>
>>>
>>> Best regards,
>>> Jürgen
>>>
>>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov
>>> :
>>>
 On 13.09.2018 14:50, Jürgen Herrmann wrote:
>
> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
> after the hang because I started the loop before the hang. The dmesg
> output should show the hanging tasks from second 346 on or so. Still
> not
> useful?
>

 So from 346 it's evident that transaction commit is waiting for
 commit_root_sem to be acquired. So something else is holding it and not
 giving the transaction chance to finish committing. Now the only place
 where send acquires this lock is in find_extent_clone around the  call
 to extent_from_logical. The latter basically does an extent tree search
 and doesn't loop so can't possibly deadlock. Furthermore I don't see any
 userspace processes being hung in kernel space.

 Additionally looking at the userspace processes they indicate that
 find_extent_clone has finished and are blocked in send_write_or_clone
 which does the write. But I guess this actually happens before the hang.


 So at this point without looking at the stacktrace of the btrfs send
 process after the hung has occurred I don't think much can be done
>>>
>>>
> I know this is probably not the correct list to ask this question but maybe
> someone of the devs can point me to the right list?
>
> I cannot get kdump to work. The crashkernel is loaded and everything is
> setup for it afaict. I asked a question on this over at stackexchange but no
> answer yet.
> https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing
>
> So i did a little digging and added some debug printk() statements to see
> whats going on and it seems that panic() is never called. maybe the second
> stack trace is the reason?
> Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN
>
> Could someone please tell me where I can report this problem and get some
> help on this topic?


Try kexec mailing list. They handle kdump.

http://lists.infradead.org/mailman/listinfo/kexec



-- 
Chris Murphy


Re: btrfs problems

2018-09-20 Thread Adrian Bastholm
On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:

>
> Then I strongly recommend to use the latest upstream kernel and progs
> for btrfs. (thus using Debian Testing)
>
> And if anything went wrong, please report asap to the mail list.
>
> Especially for fs corruption, that's the ghost I'm always chasing for.
> So if any corruption happens again (although I hope it won't happen), I
> may have a chance to catch it.

You got it
> >
> >> Anyway, enjoy your stable fs even it's not btrfs

> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
> > after the dust settled I realize I like the flexibility of BTRFS.
> >
I'm back to btrfs.

> From the code aspect, the biggest difference is the chunk layout.
> Due to the ext* block group usage, each block group header (except some
> sparse bg) is always used, thus btrfs can't use them.
>
> This leads to highly fragmented chunk layout.

The only thing I really understood is "highly fragmented" == not good
. I might need to google these "chunk" thingies

> We doesn't have error report about such layout yet, but if you want
> everything to be as stable as possible, I still recommend to use a newly
> created fs.

I guess I'll stick with ext4 on the rootfs

> > Another thing is I'd like to see a "first steps after getting started
> > " section in the wiki. Something like take your first snapshot, back
> > up, how to think when running it - can i just set some cron jobs and
> > forget about it, or does it need constant attention, and stuff like
> > that.
>
> There are projects do such things automatically, like snapper.
>
> If your primary concern is to make the fs as stable as possible, then
> keep snapshots to a minimal amount, avoid any functionality you won't
> use, like qgroup, routinely balance, RAID5/6.

So, is RAID5 stable enough ? reading the wiki there's a big fat
warning about some parity issues, I read an article about silent
corruption (written a while back), and chris says he can't recommend
raid56 to mere mortals.

> And keep the necessary btrfs specific operations to minimal, like
> subvolume/snapshot (and don't keep too many snapshots, say over 20),
> shrink, send/receive.
>
> Thanks,
> Qu
>
> >
> > BR Adrian
> >
> >
>


-- 
Vänliga hälsningar / Kind regards,
Adrian Bastholm

``I would change the world, but they won't give me the sourcecode``


Re: btrfs filesystem show takes a long time

2018-09-20 Thread Stefan K
Hi,

> You may try to run the show command under strace to see where it blocks.
any recommendations for strace options?


On Friday, September 14, 2018 1:25:30 PM CEST David Sterba wrote:
> Hi,
> 
> thanks for the report, I've forwarded it to the issue tracker
> https://github.com/kdave/btrfs-progs/issues/148
> 
> The show command uses the information provided by blkid, that presumably
> caches that. The default behaviour of 'fi show' is to skip mount checks,
> so the delays are likely caused by blkid, but that's not the only
> possible reason.
> 
> You may try to run the show command under strace to see where it blocks.
> 



Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-19 Thread Jürgen Herrmann

Am 13.9.2018 14:35, schrieb Nikolay Borisov:

On 13.09.2018 15:30, Jürgen Herrmann wrote:

OK, I will install kdump later and perform a dump after the hang.

One more noob question beforehand: does this dump contain sensitive
information, for example the luks encryption key for the disk etc? A
Google search only brings up one relevant search result which can only
be viewed with a redhat subscription...



So a kdump will dump the kernel memory so it's possible that the LUKS
encryption keys could be extracted from that image. Bummer, it's
understandable why you wouldn't want to upload it :). In this case 
you'd

have to install also the 'crash' utility to open the crashdump and
extract the calltrace of the btrfs process. The rough process should be 
:



crash 'path to vm linux' 'path to vmcore file', then once inside the
crash utility :

set , you can acquire the pid by issuing 
'ps'

which will give you a ps-like output of all running processes at the
time of crash. After the context has been set you can run 'bt' which
will give you a backtrace of the send process.





Best regards,
Jürgen

Am 13. September 2018 14:02:11 schrieb Nikolay Borisov 
:



On 13.09.2018 14:50, Jürgen Herrmann wrote:
I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work 
also

after the hang because I started the loop before the hang. The dmesg
output should show the hanging tasks from second 346 on or so. Still 
not

useful?



So from 346 it's evident that transaction commit is waiting for
commit_root_sem to be acquired. So something else is holding it and 
not
giving the transaction chance to finish committing. Now the only 
place
where send acquires this lock is in find_extent_clone around the  
call
to extent_from_logical. The latter basically does an extent tree 
search
and doesn't loop so can't possibly deadlock. Furthermore I don't see 
any

userspace processes being hung in kernel space.

Additionally looking at the userspace processes they indicate that
find_extent_clone has finished and are blocked in send_write_or_clone
which does the write. But I guess this actually happens before the 
hang.



So at this point without looking at the stacktrace of the btrfs send
process after the hung has occurred I don't think much can be done


I know this is probably not the correct list to ask this question but 
maybe someone of the devs can point me to the right list?


I cannot get kdump to work. The crashkernel is loaded and everything is 
setup for it afaict. I asked a question on this over at stackexchange 
but no answer yet.

https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing

So i did a little digging and added some debug printk() statements to 
see whats going on and it seems that panic() is never called. maybe the 
second stack trace is the reason?

Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN

Could someone please tell me where I can report this problem and get 
some help on this topic?


Best regards,
Jürgen

--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-19 Thread Jürgen Herrmann

Am 13.9.2018 18:22, schrieb Chris Murphy:

(resend to all)

On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov  
wrote:



On 13.09.2018 18:30, Chris Murphy wrote:

This is the 2nd or 3rd thread containing hanging btrfs send, with
kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
and the other I can't find at the moment. In that case though the 
hang

is reproducible in 4.14.x and weirdly it only happens when a snapshot
contains (perhaps many) reflinks. Scrub and check lowmem find nothing
wrong.

I have snapshots with a few reflinks (cp --reflink and also
deduplication), and I see maybe 15-30 second hangs where nothing is
apparently happening (in top or iotop), but I'm also not seeing any
blocked tasks or high CPU usage. Perhaps in my case it's just
recovering quickly.

Are there any kernel config options in "# Debug Lockups and Hangs"
that might hint at what's going on? Some of these are enabled in
Fedora debug kernels, which are built practically daily, e.g. right
now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
translates to git 54eda9df17f3.


If it's a lock-related problem then you need Lock Debugging => Lock
debugging: prove locking correctness


OK looks like that's under a different section as CONFIG_PROVE_LOCKING
which is enabled on Fedora debug kernels.


# Debug Lockups and Hangs
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# Lock Debugging (spinlocks, mutexes, etc...)
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_LOCK_TORTURE_TEST=m


Hello again!

I have CONFIG_PROVE_LOCKING enabled in my kernel, but no change to
the observed behaviour.

Best regards, Jürgen


--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen


Re: btrfs panic problem

2018-09-19 Thread Liu Bo
On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang  wrote:
> Hi All,
>
> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
> btrfs_get_or_create_delayed_node.
>
> I found that the freelist of the slub is wrong.
>
> crash> struct kmem_cache_cpu 887e7d7a24b0
>
> struct kmem_cache_cpu {
>   freelist = 0x2026,   <<< the value is id of one inode
>   tid = 29567861,
>   page = 0xea0132168d00,
>   partial = 0x0
> }
>
> And, I found there are two different btrfs inodes pointing delayed_node. It
> means that the same slub is used twice.
>
> I think this slub is freed twice, and then the next pointer of this slub
> point itself. So we get the same slub twice.
>
> When use this slub again, that break the freelist.
>
> Folloing code will make the delayed node being freed twice. But I don't
> found what is the process.
>
> Process A (btrfs_evict_inode) Process B
>
> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>
> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>
> BTRFS_I(inode)->delayed_node = NULL;
> btrfs_release_delayed_node(delayed_node);
>
> if (node) {
> atomic_inc(>refs);
> return node;
> }
>
> ..
>
> btrfs_release_delayed_node(delayed_node);
>

By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ec35e48b286959991cdbb886f1bdeda4575c80b4

thanks,
liubo


>
> 1313 void btrfs_remove_delayed_node(struct inode *inode)
> 1314 {
> 1315 struct btrfs_delayed_node *delayed_node;
> 1316
> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
> 1318 if (!delayed_node)
> 1319 return;
> 1320
> 1321 BTRFS_I(inode)->delayed_node = NULL;
> 1322 btrfs_release_delayed_node(delayed_node);
> 1323 }
>
>
>   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode
> *inode)
>   88 {
>   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>   90 struct btrfs_root *root = btrfs_inode->root;
>   91 u64 ino = btrfs_ino(inode);
>   92 struct btrfs_delayed_node *node;
>   93
>   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>   95 if (node) {
>   96 atomic_inc(>refs);
>   97 return node;
>   98 }
>
>
> Thanks,
>
> Sunny
>
>
> PS:
>
> 
>
> panic informations
>
> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>  #0 [88130404f940] machine_kexec at 8105ec10
>  #1 [88130404f9b0] crash_kexec at 811145b8
>  #2 [88130404fa80] oops_end at 8101a868
>  #3 [88130404fab0] no_context at 8106ea91
>  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>  #6 [88130404fb60] __do_page_fault at 8106f328
>  #7 [88130404fbd0] do_page_fault at 8106f637
>  #8 [88130404fc10] page_fault at 816f6308
> [exception RIP: kmem_cache_alloc+121]
> RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
> RAX:   RBX:   RCX: 01c32b76
> RDX: 01c32b75  RSI:   RDI: 000224b0
> RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
> R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
> R13: 2026  R14: 887e3e230a00  R15: a01abf49
> ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49
> [btrfs]
> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
> [btrfs]
> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
> #14 [88130404fe50] touch_atime at 812286d3
> #15 [88130404fe90] iterate_dir at 81221929
> #16 [88130404fee0] sys_getdents64 at 81221a19
> #17 [88130404ff50] system_call_fastpath at 816f2594
> RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
> RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
> RDX: 1000  RSI: 00c83da14000  RDI: 0011
> RBP:    R8:    R9: 
> R10:   R11: 0246  R12: 00c7
> R13: 02174e74  R14: 0555  R15: 0038
> ORIG_RAX: 00d9  CS: 0033  SS: 002b
>
>
> We also find the list double add informations, including n_list and p_list:
>
> [8642921.110568] [ cut here ]
> [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
> __list_add+0xbe/0xd0()
> [8642921.263780] list_add 

Re: btrfs panic problem

2018-09-19 Thread Nikolay Borisov



On 19.09.2018 02:53, sunny.s.zhang wrote:
> Hi Duncan,
> 
> Thank you for your advice. I understand what you mean.  But i have
> reviewed the latest btrfs code, and i think the issue is exist still.
> 
> At 71 line, if the function of btrfs_get_delayed_node run over this
> line, then switch to other process, which run over the 1282 and release
> the delayed node at the end.
> 
> And then, switch back to the  btrfs_get_delayed_node. find that the node
> is not null, and use it as normal. that mean we used a freed memory.
> 
> at some time, this memory will be freed again.
> 
> latest code as below.
> 
> 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode)
> 1279 {
> 1280 struct btrfs_delayed_node *delayed_node;
> 1281
> 1282 delayed_node = READ_ONCE(inode->delayed_node);
> 1283 if (!delayed_node)
> 1284 return;
> 1285
> 1286 inode->delayed_node = NULL;
> 1287 btrfs_release_delayed_node(delayed_node);
> 1288 }
> 
> 
>   64 static struct btrfs_delayed_node *btrfs_get_delayed_node(
>   65 struct btrfs_inode *btrfs_inode)
>   66 {
>   67 struct btrfs_root *root = btrfs_inode->root;
>   68 u64 ino = btrfs_ino(btrfs_inode);
>   69 struct btrfs_delayed_node *node;
>   70
>   71 node = READ_ONCE(btrfs_inode->delayed_node);
>   72 if (node) {
>   73 refcount_inc(>refs);
>   74 return node;
>   75 }
>   76
>   77 spin_lock(>inode_lock);
>   78 node = radix_tree_lookup(>delayed_nodes_tree, ino);
> 
> 

You are analysis is correct, however it's missing one crucial point -
btrfs_remove_delayed_node is called only from btrfs_evict_inode. And
inodes are evicted when all other references have been dropped. Check
the code in evict_inodes() - inodes are added to the dispose list when
their i_count is 0 at which point there should be no references in this
inode. This invalidates your analysis...

> 在 2018年09月18日 13:05, Duncan 写道:
>> sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:
>>
>>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
>>> btrfs_get_or_create_delayed_node.
>>>
>>> I found that the freelist of the slub is wrong.
>> [Not a dev, just a btrfs list regular and user, myself.  But here's a
>> general btrfs list recommendations reply...]
>>
>> You appear to mean kernel 4.1.12 -- confirmed by the version reported in
>> the posted dump:  4.1.12-112.14.13.el6uek.x86_64
>>
>> OK, so from the perspective of this forward-development-focused list,
>> kernel 4.1 is pretty ancient history, but you do have a number of
>> options.
>>
>> First let's consider the general situation.  Most people choose an
>> enterprise distro for supported stability, and that's certainly a valid
>> thing to want.  However, btrfs, while now reaching early maturity for the
>> basics (single device in single or dup mode, and multi-device in single/
>> raid0/1/10 modes, note that raid56 mode is newer and less mature),
>> remains under quite heavy development, and keeping reasonably current is
>> recommended for that reason.
>>
>> So you you chose an enterprise distro presumably to lock in supported
>> stability for several years, but you chose a filesystem, btrfs, that's
>> still under heavy development, with reasonably current kernels and
>> userspace recommended as tending to have the known bugs fixed.  There's a
>> bit of a conflict there, and the /general/ recommendation would thus be
>> to consider whether one or the other of those choices are inappropriate
>> for your use-case, because it's really quite likely that if you really
>> want the stability of an enterprise distro and kernel, that btrfs isn't
>> as stable a filesystem as you're likely to want to match with it.
>> Alternatively, if you want something newer to match the still under heavy
>> development btrfs, you very likely want a distro that's not focused on
>> years-old stability just for the sake of it.  One or the other is likely
>> to be a poor match for your needs, and choosing something else that's a
>> better match is likely to be a much better experience for you.
>>
>> But perhaps you do have reason to want to run the newer and not quite to
>> traditional enterprise-distro level stability btrfs, on an otherwise
>> older and very stable enterprise distro.  That's fine, provided you know
>> what you're getting yourself into, and are prepared to deal with it.
>>
>> In that case, for best support from the list, we'd recommend running one
>> of the latest two kernels in either the current or mainline LTS tracks.
>>
>> For current track, With 4.18 being the latest kernel, that'd be 4.18 or
>> 4.17, as available on kernel.org (tho 4.17 is already EOL, no further
>> releases, at 4.17.19).
>>
>> For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series
>> kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18
>> and it's just not out of normal 

Re: btrfs panic problem

2018-09-18 Thread Qu Wenruo


On 2018/9/19 上午8:35, sunny.s.zhang wrote:
> 
> 在 2018年09月19日 08:05, Qu Wenruo 写道:
>>
>> On 2018/9/18 上午8:28, sunny.s.zhang wrote:
>>> Hi All,
>>>
>>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
>>> btrfs_get_or_create_delayed_node.
>> Any reproducer?
>>
>> Anyway we need a reproducer as a testcase.
> 
> I have had a try, but could not  reproduce yet.

Since it's just one hit in production environment, I'm afraid we need to
inject some sleep or delay into this code and try bombing it with fsstress.

Despite that I have no good idea on reproducing it.

Thanks,
Qu

> 
> Any advice to reproduce it?
> 
>>
>> The code looks
>>
>>> I found that the freelist of the slub is wrong.
>>>
>>> crash> struct kmem_cache_cpu 887e7d7a24b0
>>>
>>> struct kmem_cache_cpu {
>>>    freelist = 0x2026,   <<< the value is id of one inode
>>>    tid = 29567861,
>>>    page = 0xea0132168d00,
>>>    partial = 0x0
>>> }
>>>
>>> And, I found there are two different btrfs inodes pointing delayed_node.
>>> It means that the same slub is used twice.
>>>
>>> I think this slub is freed twice, and then the next pointer of this slub
>>> point itself. So we get the same slub twice.
>>>
>>> When use this slub again, that break the freelist.
>>>
>>> Folloing code will make the delayed node being freed twice. But I don't
>>> found what is the process.
>>>
>>> Process A (btrfs_evict_inode) Process B
>>>
>>> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>>>
>>> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>
>>> BTRFS_I(inode)->delayed_node = NULL;
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>> if (node) {
>>> atomic_inc(>refs);
>>> return node;
>>> }
>>>
>>> ..
>>>
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>>
>>> 1313 void btrfs_remove_delayed_node(struct inode *inode)
>>> 1314 {
>>> 1315 struct btrfs_delayed_node *delayed_node;
>>> 1316
>>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
>>> 1318 if (!delayed_node)
>>> 1319 return;
>>> 1320
>>> 1321 BTRFS_I(inode)->delayed_node = NULL;
>>> 1322 btrfs_release_delayed_node(delayed_node);
>>> 1323 }
>>>
>>>
>>>    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
>>> inode *inode)
>>>    88 {
>>>    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>>>    90 struct btrfs_root *root = btrfs_inode->root;
>>>    91 u64 ino = btrfs_ino(inode);
>>>    92 struct btrfs_delayed_node *node;
>>>    93
>>>    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>    95 if (node) {
>>>    96 atomic_inc(>refs);
>>>    97 return node;
>>>    98 }
>>>
>> The analyse looks valid.
>> Can be fixed by adding a spinlock.
>>
>> Just wondering why we didn't hit it.
> 
> It just appeared once in our production environment.
> 
> Thanks,
> Sunny
>>
>> Thanks,
>> Qu
>>
>>> Thanks,
>>>
>>> Sunny
>>>
>>>
>>> PS:
>>>
>>> 
>>>
>>> panic informations
>>>
>>> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>>>   #0 [88130404f940] machine_kexec at 8105ec10
>>>   #1 [88130404f9b0] crash_kexec at 811145b8
>>>   #2 [88130404fa80] oops_end at 8101a868
>>>   #3 [88130404fab0] no_context at 8106ea91
>>>   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>>>   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>>>   #6 [88130404fb60] __do_page_fault at 8106f328
>>>   #7 [88130404fbd0] do_page_fault at 8106f637
>>>   #8 [88130404fc10] page_fault at 816f6308
>>>  [exception RIP: kmem_cache_alloc+121]
>>>  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>>>  RAX:   RBX:   RCX: 01c32b76
>>>  RDX: 01c32b75  RSI:   RDI: 000224b0
>>>  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>>>  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>>>  R13: 2026  R14: 887e3e230a00  R15: a01abf49
>>>  ORIG_RAX:   CS: 0010  SS: 0018
>>>   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
>>> a01abf49 [btrfs]
>>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
>>> [btrfs]
>>> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
>>> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
>>> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
>>> #14 [88130404fe50] touch_atime at 812286d3
>>> #15 [88130404fe90] iterate_dir at 81221929
>>> #16 [88130404fee0] sys_getdents64 at 81221a19
>>> #17 [88130404ff50] system_call_fastpath at 816f2594
>>>  RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
>>>  RAX: 

Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang



在 2018年09月19日 08:05, Qu Wenruo 写道:


On 2018/9/18 上午8:28, sunny.s.zhang wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

Any reproducer?

Anyway we need a reproducer as a testcase.


I have had a try, but could not  reproduce yet.

Any advice to reproduce it?



The code looks


I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
   freelist = 0x2026,   <<< the value is id of one inode
   tid = 29567861,
   page = 0xea0132168d00,
   partial = 0x0
}

And, I found there are two different btrfs inodes pointing delayed_node.
It means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
inode *inode)
   88 {
   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
   90 struct btrfs_root *root = btrfs_inode->root;
   91 u64 ino = btrfs_ino(inode);
   92 struct btrfs_delayed_node *node;
   93
   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
   95 if (node) {
   96 atomic_inc(>refs);
   97 return node;
   98 }


The analyse looks valid.
Can be fixed by adding a spinlock.

Just wondering why we didn't hit it.


It just appeared once in our production environment.

Thanks,
Sunny


Thanks,
Qu


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
  #0 [88130404f940] machine_kexec at 8105ec10
  #1 [88130404f9b0] crash_kexec at 811145b8
  #2 [88130404fa80] oops_end at 8101a868
  #3 [88130404fab0] no_context at 8106ea91
  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
  #6 [88130404fb60] __do_page_fault at 8106f328
  #7 [88130404fbd0] do_page_fault at 8106f637
  #8 [88130404fc10] page_fault at 816f6308
     [exception RIP: kmem_cache_alloc+121]
     RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
     RAX:   RBX:   RCX: 01c32b76
     RDX: 01c32b75  RSI:   RDI: 000224b0
     RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
     R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
     R13: 2026  R14: 887e3e230a00  R15: a01abf49
     ORIG_RAX:   CS: 0010  SS: 0018
  #9 [88130404fd10] btrfs_get_or_create_delayed_node at
a01abf49 [btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] iterate_dir at 81221929
#16 [88130404fee0] sys_getdents64 at 81221a19
#17 [88130404ff50] system_call_fastpath at 816f2594
     RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
     RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
     RDX: 1000  RSI: 00c83da14000  RDI: 0011
     RBP:    R8:    R9: 
     R10:   R11: 0246  R12: 00c7
     R13: 02174e74  R14: 0555  R15: 0038
     ORIG_RAX: 00d9  CS: 0033  SS: 002b


We also find the list double add informations, including n_list and p_list:

[8642921.110568] [ cut here ]
[8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
__list_add+0xbe/0xd0()
[8642921.263780] list_add corruption. prev->next should be next
(887e40fa5368), but was ff:ff884c85a36288. 

Re: btrfs panic problem

2018-09-18 Thread Qu Wenruo


On 2018/9/18 上午8:28, sunny.s.zhang wrote:
> Hi All,
> 
> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
> btrfs_get_or_create_delayed_node.

Any reproducer?

Anyway we need a reproducer as a testcase.

The code looks

> 
> I found that the freelist of the slub is wrong.
> 
> crash> struct kmem_cache_cpu 887e7d7a24b0
> 
> struct kmem_cache_cpu {
>   freelist = 0x2026,   <<< the value is id of one inode
>   tid = 29567861,
>   page = 0xea0132168d00,
>   partial = 0x0
> }
> 
> And, I found there are two different btrfs inodes pointing delayed_node.
> It means that the same slub is used twice.
> 
> I think this slub is freed twice, and then the next pointer of this slub
> point itself. So we get the same slub twice.
> 
> When use this slub again, that break the freelist.
> 
> Folloing code will make the delayed node being freed twice. But I don't
> found what is the process.
> 
> Process A (btrfs_evict_inode) Process B
> 
> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
> 
> node = ACCESS_ONCE(btrfs_inode->delayed_node);
> 
> BTRFS_I(inode)->delayed_node = NULL;
> btrfs_release_delayed_node(delayed_node);
> 
> if (node) {
> atomic_inc(>refs);
> return node;
> }
> 
> ..
> 
> btrfs_release_delayed_node(delayed_node);
> 
> 
> 1313 void btrfs_remove_delayed_node(struct inode *inode)
> 1314 {
> 1315 struct btrfs_delayed_node *delayed_node;
> 1316
> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
> 1318 if (!delayed_node)
> 1319 return;
> 1320
> 1321 BTRFS_I(inode)->delayed_node = NULL;
> 1322 btrfs_release_delayed_node(delayed_node);
> 1323 }
> 
> 
>   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
> inode *inode)
>   88 {
>   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>   90 struct btrfs_root *root = btrfs_inode->root;
>   91 u64 ino = btrfs_ino(inode);
>   92 struct btrfs_delayed_node *node;
>   93
>   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>   95 if (node) {
>   96 atomic_inc(>refs);
>   97 return node;
>   98 }
> 

The analyse looks valid.
Can be fixed by adding a spinlock.

Just wondering why we didn't hit it.

Thanks,
Qu

> 
> Thanks,
> 
> Sunny
> 
> 
> PS:
> 
> 
> 
> panic informations
> 
> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>  #0 [88130404f940] machine_kexec at 8105ec10
>  #1 [88130404f9b0] crash_kexec at 811145b8
>  #2 [88130404fa80] oops_end at 8101a868
>  #3 [88130404fab0] no_context at 8106ea91
>  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>  #6 [88130404fb60] __do_page_fault at 8106f328
>  #7 [88130404fbd0] do_page_fault at 8106f637
>  #8 [88130404fc10] page_fault at 816f6308
>     [exception RIP: kmem_cache_alloc+121]
>     RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>     RAX:   RBX:   RCX: 01c32b76
>     RDX: 01c32b75  RSI:   RDI: 000224b0
>     RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>     R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>     R13: 2026  R14: 887e3e230a00  R15: a01abf49
>     ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [88130404fd10] btrfs_get_or_create_delayed_node at
> a01abf49 [btrfs]
> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
> [btrfs]
> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
> #14 [88130404fe50] touch_atime at 812286d3
> #15 [88130404fe90] iterate_dir at 81221929
> #16 [88130404fee0] sys_getdents64 at 81221a19
> #17 [88130404ff50] system_call_fastpath at 816f2594
>     RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
>     RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
>     RDX: 1000  RSI: 00c83da14000  RDI: 0011
>     RBP:    R8:    R9: 
>     R10:   R11: 0246  R12: 00c7
>     R13: 02174e74  R14: 0555  R15: 0038
>     ORIG_RAX: 00d9  CS: 0033  SS: 002b
> 
> 
> We also find the list double add informations, including n_list and p_list:
> 
> [8642921.110568] [ cut here ]
> [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
> __list_add+0xbe/0xd0()
> [8642921.263780] list_add corruption. prev->next should be next
> (887e40fa5368), but 

Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang

Hi Duncan,

Thank you for your advice. I understand what you mean.  But i have 
reviewed the latest btrfs code, and i think the issue is exist still.


At 71 line, if the function of btrfs_get_delayed_node run over this 
line, then switch to other process, which run over the 1282 and release 
the delayed node at the end.


And then, switch back to the  btrfs_get_delayed_node. find that the node 
is not null, and use it as normal. that mean we used a freed memory.


at some time, this memory will be freed again.

latest code as below.

1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode)
1279 {
1280 struct btrfs_delayed_node *delayed_node;
1281
1282 delayed_node = READ_ONCE(inode->delayed_node);
1283 if (!delayed_node)
1284 return;
1285
1286 inode->delayed_node = NULL;
1287 btrfs_release_delayed_node(delayed_node);
1288 }


  64 static struct btrfs_delayed_node *btrfs_get_delayed_node(
  65 struct btrfs_inode *btrfs_inode)
  66 {
  67 struct btrfs_root *root = btrfs_inode->root;
  68 u64 ino = btrfs_ino(btrfs_inode);
  69 struct btrfs_delayed_node *node;
  70
  71 node = READ_ONCE(btrfs_inode->delayed_node);
  72 if (node) {
  73 refcount_inc(>refs);
  74 return node;
  75 }
  76
  77 spin_lock(>inode_lock);
  78 node = radix_tree_lookup(>delayed_nodes_tree, ino);


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are prepared to deal with it.

In that case, for best support from the list, we'd recommend running one
of the latest two kernels in either the current or mainline LTS tracks.

For current track, With 4.18 being the latest kernel, that'd be 4.18 or
4.17, as available on kernel.org (tho 4.17 is already EOL, no further
releases, at 4.17.19).

For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series
kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18
and it's just not out of normal stable range yet so not yet marked LTS?),
so it'll be coming up soon and 4.9 will then be dropping to third LTS
series and thus out of our best recommended range.  4.4 was the previous
LTS and while still in LTS support, is outside the two newest LTS series
that this list recommends.

And of course 4.1 is older than 4.4, so as I said, in btrfs development
terms, it's quite ancient indeed... quite out of practical support range
here, tho of course we'll still try, but in many cases the first question
when any problem's reported is going to be whether it's reproducible on
something closer to current.

But... you ARE on an enterprise kernel, likely on an enterprise distro,
and 

Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Hugo Mills
On Tue, Sep 18, 2018 at 06:28:37PM +, Gervais, Francois wrote:
> > No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?

   No, there's an ioctl to change the received UUID of a
subvolume. It's used by receive, at the very end of the receive
operation.

   Messing around in this area is basically a recipe for ending up
with a half-completed send/receive full of broken data because the
receiving subvolume isn't quite as identical as you thought. It
enforces the rules for a reason.

   Now, it's possible to modify the send stream and the logic around
it a bit to support a number of additional modes of operation
(bidirectional send, for example), but that's queued up waiting for
(a) a definitive list of send stream format changes, and (b) David's
bandwidth to put them together in one patch set.

   If you want to see more on the underlying UUID model, and how it
could be (ab)used and modified, there's a write-up here, in a thread
on pretty much exactly the same proposal that you've just made:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | Great films about cricket: Monster's No-Ball
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov
18.09.2018 21:28, Gervais, Francois пишет:
>> No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?
> 
No, I mean setting received UUID on subvolume. Unfortunately, it is
possible. Fortunately, it is not trivially done.


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Gervais, Francois
> No. It is already possible (by setting received UUID); it should not be
made too open to easy abuse.


Do you mean edit the UUID in the byte stream before btrfs receive?


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov
18.09.2018 20:56, Gervais, Francois пишет:
> 
> Hi,
> 
> I'm trying to apply a btrfs send diff (done through -p) to another subvolume 
> with the same content as the proper parent but with a different uuid.
> 
> I looked through btrfs receive and I get the feeling that this is not 
> possible right now.
> 
> I'm thinking of adding a -p option to btrfs receive which could override the 
> parent information from the stream.
> 
> Would that make sense?
> 
No. It is already possible (by setting received UUID); it should not be
made too open to easy abuse.


Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang

Add Junxiao


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are prepared to deal with it.

In that case, for best support from the list, we'd recommend running one
of the latest two kernels in either the current or mainline LTS tracks.

For current track, With 4.18 being the latest kernel, that'd be 4.18 or
4.17, as available on kernel.org (tho 4.17 is already EOL, no further
releases, at 4.17.19).

For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series
kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18
and it's just not out of normal stable range yet so not yet marked LTS?),
so it'll be coming up soon and 4.9 will then be dropping to third LTS
series and thus out of our best recommended range.  4.4 was the previous
LTS and while still in LTS support, is outside the two newest LTS series
that this list recommends.

And of course 4.1 is older than 4.4, so as I said, in btrfs development
terms, it's quite ancient indeed... quite out of practical support range
here, tho of course we'll still try, but in many cases the first question
when any problem's reported is going to be whether it's reproducible on
something closer to current.

But... you ARE on an enterprise kernel, likely on an enterprise distro,
and very possibly actually paying /them/ for support.  So you're not
without options if you prefer to stay with your supported enterprise
kernel.  If you're paying them for support, you might as well use it, and
of course of the very many fixes since 4.1, they know what they've
backported and what they haven't, so they're far better placed to provide
that support in any case.

Or, given what you posted, you appear to be reasonably able to do at
least limited kernel-dev-level analysis yourself.  Given that, you're
already reasonably well placed to simply decide to stick with what you
have and take the support you can get, diving into things yourself if
necessary.


So those are your kernel options.  What about userspace btrfs-progs?

Generally speaking, while the filesystem's running, it's the kernel code
doing most of the work.  If you have old userspace, it simply means you
can't take advantage of some of the newer features as the old userspace
doesn't know how to call for them.

But the situation changes as soon as you have problems and can't mount,
because it's userspace code that runs to try to fix that sort of problem,
or failing that, it's userspace code that btrfs restore runs to try to
grab what files can be grabbed off of the unmountable filesystem.

So for routine operation, it's no big deal if userspace is a bit old, at
least as long as it's new enough to have all the newer command formats,
etc, that you need, and for 

Re: btrfs panic problem

2018-09-17 Thread Duncan
sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:

> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
> btrfs_get_or_create_delayed_node.
> 
> I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a 
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in 
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list, 
kernel 4.1 is pretty ancient history, but you do have a number of options.

First let's consider the general situation.  Most people choose an 
enterprise distro for supported stability, and that's certainly a valid 
thing to want.  However, btrfs, while now reaching early maturity for the 
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature), 
remains under quite heavy development, and keeping reasonably current is 
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported 
stability for several years, but you chose a filesystem, btrfs, that's 
still under heavy development, with reasonably current kernels and 
userspace recommended as tending to have the known bugs fixed.  There's a 
bit of a conflict there, and the /general/ recommendation would thus be 
to consider whether one or the other of those choices are inappropriate 
for your use-case, because it's really quite likely that if you really 
want the stability of an enterprise distro and kernel, that btrfs isn't 
as stable a filesystem as you're likely to want to match with it.  
Alternatively, if you want something newer to match the still under heavy 
development btrfs, you very likely want a distro that's not focused on 
years-old stability just for the sake of it.  One or the other is likely 
to be a poor match for your needs, and choosing something else that's a 
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to 
traditional enterprise-distro level stability btrfs, on an otherwise 
older and very stable enterprise distro.  That's fine, provided you know 
what you're getting yourself into, and are prepared to deal with it.

In that case, for best support from the list, we'd recommend running one 
of the latest two kernels in either the current or mainline LTS tracks. 

For current track, With 4.18 being the latest kernel, that'd be 4.18 or 
4.17, as available on kernel.org (tho 4.17 is already EOL, no further 
releases, at 4.17.19).

For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series 
kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 
and it's just not out of normal stable range yet so not yet marked LTS?), 
so it'll be coming up soon and 4.9 will then be dropping to third LTS 
series and thus out of our best recommended range.  4.4 was the previous 
LTS and while still in LTS support, is outside the two newest LTS series 
that this list recommends.

And of course 4.1 is older than 4.4, so as I said, in btrfs development 
terms, it's quite ancient indeed... quite out of practical support range 
here, tho of course we'll still try, but in many cases the first question 
when any problem's reported is going to be whether it's reproducible on 
something closer to current.

But... you ARE on an enterprise kernel, likely on an enterprise distro, 
and very possibly actually paying /them/ for support.  So you're not 
without options if you prefer to stay with your supported enterprise 
kernel.  If you're paying them for support, you might as well use it, and 
of course of the very many fixes since 4.1, they know what they've 
backported and what they haven't, so they're far better placed to provide 
that support in any case.

Or, given what you posted, you appear to be reasonably able to do at 
least limited kernel-dev-level analysis yourself.  Given that, you're 
already reasonably well placed to simply decide to stick with what you 
have and take the support you can get, diving into things yourself if 
necessary.


So those are your kernel options.  What about userspace btrfs-progs?

Generally speaking, while the filesystem's running, it's the kernel code 
doing most of the work.  If you have old userspace, it simply means you 
can't take advantage of some of the newer features as the old userspace 
doesn't know how to call for them.

But the situation changes as soon as you have problems and can't mount, 
because it's userspace code that runs to try to fix that sort of problem, 
or failing that, it's userspace code that btrfs restore runs to try to 
grab what files can be grabbed off of the unmountable filesystem.

So for routine operation, it's no big deal if userspace is a bit old, at 
least as long as it's new enough to have all the newer command formats, 
etc, that you 

Re: btrfs panic problem

2018-09-17 Thread sunny.s.zhang

Sorry, modify some errors:

Process A (btrfs_evict_inode)   Process B

call btrfs_remove_delayed_node   call 
btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);
在 2018年09月18日 08:28, sunny.s.zhang 写道:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by 
btrfs_get_or_create_delayed_node.


I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
  freelist = 0x2026,   <<< the value is id of one inode
  tid = 29567861,
  page = 0xea0132168d00,
  partial = 0x0
}

And, I found there are two different btrfs inodes pointing 
delayed_node. It means that the same slub is used twice.


I think this slub is freed twice, and then the next pointer of this 
slub point itself. So we get the same slub twice.


When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I 
don't found what is the process.


Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


  87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct 
inode *inode)

  88 {
  89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
  90 struct btrfs_root *root = btrfs_inode->root;
  91 u64 ino = btrfs_ino(inode);
  92 struct btrfs_delayed_node *node;
  93
  94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
  95 if (node) {
  96 atomic_inc(>refs);
  97 return node;
  98 }


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
 #0 [88130404f940] machine_kexec at 8105ec10
 #1 [88130404f9b0] crash_kexec at 811145b8
 #2 [88130404fa80] oops_end at 8101a868
 #3 [88130404fab0] no_context at 8106ea91
 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
 #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
 #6 [88130404fb60] __do_page_fault at 8106f328
 #7 [88130404fbd0] do_page_fault at 8106f637
 #8 [88130404fc10] page_fault at 816f6308
    [exception RIP: kmem_cache_alloc+121]
    RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
    RAX:   RBX:   RCX: 01c32b76
    RDX: 01c32b75  RSI:   RDI: 000224b0
    RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
    R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
    R13: 2026  R14: 887e3e230a00  R15: a01abf49
    ORIG_RAX:   CS: 0010  SS: 0018
 #9 [88130404fd10] btrfs_get_or_create_delayed_node at 
a01abf49 [btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 
[btrfs]

#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] iterate_dir at 81221929
#16 [88130404fee0] sys_getdents64 at 81221a19
#17 [88130404ff50] system_call_fastpath at 816f2594
    RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
    RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
    RDX: 1000  RSI: 00c83da14000  RDI: 0011
    RBP:    R8:    R9: 
    R10:   R11: 0246  R12: 00c7
    R13: 02174e74  R14: 0555  R15: 0038
    ORIG_RAX: 00d9  CS: 0033  SS: 002b


We also find the list double add informations, including n_list and 
p_list:


[8642921.110568] [ cut here ]
[8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 
__list_add+0xbe/0xd0()
[8642921.263780] list_add corruption. prev->next should be next 
(887e40fa5368), but was ff:ff884c85a36288. 

Re: btrfs problems

2018-09-17 Thread Stefan K
> If your primary concern is to make the fs as stable as possible, then
> keep snapshots to a minimal amount, avoid any functionality you won't
> use, like qgroup, routinely balance, RAID5/6.
> 
> And keep the necessary btrfs specific operations to minimal, like
> subvolume/snapshot (and don't keep too many snapshots, say over 20),
> shrink, send/receive.

hehe, that sound like "hey use btrfs, its cool, but please - don't use any 
btrfs specific feature" ;)

best
Stefan


Re: btrfs problems

2018-09-17 Thread Qu Wenruo


On 2018/9/17 下午7:55, Adrian Bastholm wrote:
>> Well, I'd say Debian is really not your first choice for btrfs.
>> The kernel is really old for btrfs.
>>
>> My personal recommend is to use rolling release distribution like
>> vanilla Archlinux, whose kernel is already 4.18.7 now.
> 
> I just upgraded to Debian Testing which has the 4.18 kernel

Then I strongly recommend to use the latest upstream kernel and progs
for btrfs. (thus using Debian Testing)

And if anything went wrong, please report asap to the mail list.

Especially for fs corruption, that's the ghost I'm always chasing for.
So if any corruption happens again (although I hope it won't happen), I
may have a chance to catch it.

> 
>> Anyway, enjoy your stable fs even it's not btrfs anymore.
> 
> My new stable fs is too rigid. Can't grow it, can't shrink it, can't
> remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
> after the dust settled I realize I like the flexibility of BTRFS.
> 
> 
>  This time I'm considering BTRFS as rootfs as well, can I do an
> in-place conversion ? There's this guide
> (https://www.howtoforge.com/how-to-convert-an-ext3-ext4-root-file-system-to-btrfs-on-ubuntu-12.10)
> I was planning on following.

Btrfs-convert is recommended mostly for short term trial (the ability to
rollback to ext* without anything modified)

From the code aspect, the biggest difference is the chunk layout.
Due to the ext* block group usage, each block group header (except some
sparse bg) is always used, thus btrfs can't use them.

This leads to highly fragmented chunk layout.
We doesn't have error report about such layout yet, but if you want
everything to be as stable as possible, I still recommend to use a newly
created fs.

> 
> Another thing is I'd like to see a "first steps after getting started
> " section in the wiki. Something like take your first snapshot, back
> up, how to think when running it - can i just set some cron jobs and
> forget about it, or does it need constant attention, and stuff like
> that.

There are projects do such things automatically, like snapper.

If your primary concern is to make the fs as stable as possible, then
keep snapshots to a minimal amount, avoid any functionality you won't
use, like qgroup, routinely balance, RAID5/6.

And keep the necessary btrfs specific operations to minimal, like
subvolume/snapshot (and don't keep too many snapshots, say over 20),
shrink, send/receive.

Thanks,
Qu

> 
> BR Adrian
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: btrfs problems

2018-09-17 Thread Adrian Bastholm
> Well, I'd say Debian is really not your first choice for btrfs.
> The kernel is really old for btrfs.
>
> My personal recommend is to use rolling release distribution like
> vanilla Archlinux, whose kernel is already 4.18.7 now.

I just upgraded to Debian Testing which has the 4.18 kernel

> Anyway, enjoy your stable fs even it's not btrfs anymore.

My new stable fs is too rigid. Can't grow it, can't shrink it, can't
remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
after the dust settled I realize I like the flexibility of BTRFS.


 This time I'm considering BTRFS as rootfs as well, can I do an
in-place conversion ? There's this guide
(https://www.howtoforge.com/how-to-convert-an-ext3-ext4-root-file-system-to-btrfs-on-ubuntu-12.10)
I was planning on following.

Another thing is I'd like to see a "first steps after getting started
" section in the wiki. Something like take your first snapshot, back
up, how to think when running it - can i just set some cron jobs and
forget about it, or does it need constant attention, and stuff like
that.

BR Adrian


-- 
Vänliga hälsningar / Kind regards,
Adrian Bastholm

``I would change the world, but they won't give me the sourcecode``


Re: btrfs problems

2018-09-16 Thread Chris Murphy
On Sun, Sep 16, 2018 at 2:11 PM, Adrian Bastholm  wrote:
> Thanks for answering Qu.
>
>> At this timing, your fs is already corrupted.
>> I'm not sure about the reason, it can be a failed CoW combined with
>> powerloss, or corrupted free space cache, or some old kernel bugs.
>>
>> Anyway, the metadata itself is already corrupted, and I believe it
>> happens even before you noticed.
>  I suspected it had to be like that
>>
>> > BTRFS check --repair is not recommended, it
>> > crashes , doesn't fix all problems, and I later found out that my
>> > lost+found dir had about 39G of lost files and dirs.
>>
>> lost+found is completely created by btrfs check --repair.
>>
>> > I spent about two days trying to fix everything, removing a disk,
>> > adding it again, checking , you name it. I ended up removing one disk,
>> > reformatting it, and moving the data there.
>>
>> Well, I would recommend to submit such problem to the mail list *BEFORE*
>> doing any write operation to the fs (including btrfs check --repair).
>> As it would help us to analyse the failure pattern to further enhance btrfs.
>
> IMHO that's a, how should I put it, a design flaw, the wrong way of
> looking at how people think, with all respect to all the very smart
> people that put in countless hours of hard work. Users expect and fs
> check and repair to repair, not to break stuff.
> Reading that --repair is "destructive" is contradictory even to me.

It's contradictory to everyone including the developers. No developer
set out to make --repair dangerous from the outset. It just turns out
that it was a harder problem to solve and the thought was that it
would keep getting better.

Newer versions are "should be safe" now even if they can't fix
everything. The far bigger issue I think the developers are aware of
is that depending on repair at all for any Btrfs of appreciable size,
is simply not scalable. Taking a day or a week to run a repair on a
large file system, is unworkable. And that's why it's better to avoid
inconsistencies in the first place which is what Btrfs is supposed to
do, and if that's not happening it's a bug somewhere in Btrfs and also
sometimes in the hardware.


> This problem emerged in a direcory where motion (the camera software)
> was saving pictures. Either killing the process or a powerloss could
> have left these jpg files (or fs metadata) in a bad state. Maybe
> that's something to go on. I was thinking that there's not much anyone
> can do without root access to my box anyway, and I'm not sure I was
> prepared to give that to anyone.

I can't recommend raid56 for people new to Btrfs. It really takes
qualified hardware to make sure there's no betrayal, as everything
gets a lot more complicated with raid56. The general state of faulty
device handling on Btrfs, makes raid56 very much a hands on approach
you can't turn your back on it. And then when jumping into raid5, I
advise raid1 for metadata. It reduces problems. And that's true for
raid6 also, except that raid1 metadata is less redundancy than raid1
so...it's not helpful if you end up losing 2 devices.

If you need production grade parity raid you should use openzfs,
although I can't speak to how it behaves with respect to faulty
devices on Linux.




>> Any btrfs unexpected behavior, from strange ls output to aborted
>> transaction, please consult with the mail list first.
>> (Of course, with kernel version and btrfs-progs version, which is
>> missing in your console log though)
>
> Linux jenna 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21)
> x86_64 GNU/Linux
> btrfs-progs is already the newest version (4.7.3-1).

Well the newest versions are kernel 4.18.8, and btrfs-progs 4.17.1, so
in Btrfs terms those are kinda old.

That is not inherently bad, but there are literally thousands of
additions and deletions since kernel 4.9 so there's almost no way
anyone on this list, except a developer familiar with backport status,
can tell you if the problem you're seeing is a bug that's been fixed
in that particular version. There aren't that many developers that
familiar with that status who also have time to read user reports.
Since this is an upstream list, most developers will want to know if
you're able to reproduce the problem with a mainline kernel, because
if you can it's very probable it's a bug that needs to be fixed
upstream first before it can be backported. That's just the nature of
kernel development generally. And you'll find the same thing on ext4
and XFS lists...

The main reason why people use Debian and its older kernel bases is
they're willing to accept certain bugginess in favor of stability.
Transient bugs are really bad in that world. Consistent bugs they just
find work arounds for (avoidance) until there's a known highly tested
backport, because they want "The Behavior" to be predictable, both
good and bad. That is not a model well suited for a file system that's
in Btrfs really active development state. It's better now than it was
even a 

Re: btrfs problems

2018-09-16 Thread Chris Murphy
On Sun, Sep 16, 2018 at 7:58 AM, Adrian Bastholm  wrote:
> Hello all
> Actually I'm not trying to get any help any more, I gave up BTRFS on
> the desktop, but I'd like to share my efforts of trying to fix my
> problems, in hope I can help some poor noob like me.

There's almost no useful information provided for someone to even try
to reproduce your results, isolate cause and figure out the bugs.

No kernel version. No btrfs-progs version. No description of the
hardware and how it's laid out, and what mkfs and mount options are
being used. No one really has the time to speculate.


>BTRFS check --repair is not recommended

Right. So why did you run it anyway?

man btrfs check:

Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user


It is always a legitimate complaint, despite this warning, if btrfs
check --repair makes things worse, because --repair shouldn't ever
make things worse. But Btrfs repairs are complicated, and that's why
the warning is there. I suppose the devs could have made the flag
--riskyrepair but I doubt this would really slow users down that much.
A big part of --repair fixes weren't known to make things worse at the
time, and edge cases where it made things worse kept popping up, so
only in hindsight does it make sense --repair maybe could have been
called something different to catch the user's attention.

But anyway, I see this same sort of thing on the linux-raid list all
the time. People run into trouble, and they press full forward making
all kinds of changes, each change increases the chance of data loss.
And then they come on the list with WTF messages. And it's always a
lesson in patience for the list regulars and developers... if only
you'd come to us with questions sooner.

> Please have a look at the console logs.

These aren't logs. It's a record of shell commands. Logs would include
kernel messages, ideally all of them. Why is device 3 missing? We have
no idea. Most of Btrfs code is in the kernel, problems are reported by
the kernel. So we need kernel messages, user space messages aren't
enough.

Anyway, good luck with openzfs, cool project.


-- 
Chris Murphy


Re: btrfs problems

2018-09-16 Thread Qu Wenruo


On 2018/9/16 下午9:58, Adrian Bastholm wrote:
> Hello all
> Actually I'm not trying to get any help any more, I gave up BTRFS on
> the desktop, but I'd like to share my efforts of trying to fix my
> problems, in hope I can help some poor noob like me.
> 
> I decided to use BTRFS after reading the ArsTechnica article about the
> next-gen filesystems, and BTRFS seemed like the natural choice, open
> source, built into linux, etc. I even bought a HP microserver to have
> everything on because none of the commercial NAS-es supported BTRFS.
> What a mistake, I wasted weeks in total managing something that could
> have taken a day to set up, and I'd have MUCH more functionality now
> (if I wasn't hit by some ransomware, that is).
> 
> I had three 1TB drives, chose to use raid, and all was good for a
> while, until started fiddling with Motion, the image capturing
> software. When you kill that process (my take on it) a file can be
> written but it ends up with question marks instead of attributes, and
> it's impossible to remove.

At this timing, your fs is already corrupted.
I'm not sure about the reason, it can be a failed CoW combined with
powerloss, or corrupted free space cache, or some old kernel bugs.

Anyway, the metadata itself is already corrupted, and I believe it
happens even before you noticed.

> BTRFS check --repair is not recommended, it
> crashes , doesn't fix all problems, and I later found out that my
> lost+found dir had about 39G of lost files and dirs.

lost+found is completely created by btrfs check --repair.

> I spent about two days trying to fix everything, removing a disk,
> adding it again, checking , you name it. I ended up removing one disk,
> reformatting it, and moving the data there.

Well, I would recommend to submit such problem to the mail list *BEFORE*
doing any write operation to the fs (including btrfs check --repair).
As it would help us to analyse the failure pattern to further enhance btrfs.

> Now I removed BTRFS
> entirely and replaced it with a OpenZFS mirror array, to which I'll
> add the third disk later when I transferred everything over.

Understandable, it's really annoying a fs just get itself corrupted, and
without much btrfs specified knowledge it would just be a hell to try
any method to fix it (even a lot of them would just make the case worse).

> 
> Please have a look at the console logs. I've been running linux on the
> desktop for the past 15 years, so I'm not a noob, but for running
> BTRFS you better be involved in the development of it.

I'd say, yes.
For any btrfs unexpected behavior, don't use btrfs check --repair unless
you're a developer or some developer asked to do.

Any btrfs unexpected behavior, from strange ls output to aborted
transaction, please consult with the mail list first.
(Of course, with kernel version and btrfs-progs version, which is
missing in your console log though)

In fact, in recent (IIRC starting from v4.15) kernel releases, btrfs is
already doing much better error detection thus it would detect such
problem early on and protect the fs from being further modified.

(This further shows that the importance of using the latest mainline
kernel other than some old kernel provided by stable distribution).

Thanks,
Qu

> In my humble
> opinion, it's not for us "users" just yet. Not even for power users.
> 
> For those of you considering building a NAS without special purposes,
> don't. Buy a synology, pop in a couple of drives, and enjoy the ride.
> 
> 
> 
>  root  /home/storage/motion/2017-05-24  1  ls -al
> ls: cannot access '36-20170524201346-02.jpg': No such file or directory
> ls: cannot access '36-20170524201346-02.jpg': No such file or directory
> total 4
> drwxrwxrwx 1 motion   motion   114 Sep 14 12:48 .
> drwxrwxr-x 1 motion   adyhasch  60 Sep 14 09:42 ..
> -? ? ??  ?? 36-20170524201346-02.jpg
> -? ? ??  ?? 36-20170524201346-02.jpg
> -rwxr-xr-x 1 adyhasch adyhasch  62 Sep 14 12:43 remove.py
> root  /home/storage/motion/2017-05-24  1  touch test.raw
>  root  /home/storage/motion/2017-05-24  cat /dev/random > test.raw
> ^C
> root  /home/storage/motion/2017-05-24  ls -al
> ls: cannot access '36-20170524201346-02.jpg': No such file or directory
> ls: cannot access '36-20170524201346-02.jpg': No such file or directory
> total 8
> drwxrwxrwx 1 motion   motion   130 Sep 14 13:12 .
> drwxrwxr-x 1 motion   adyhasch  60 Sep 14 09:42 ..
> -? ? ??  ?? 36-20170524201346-02.jpg
> -? ? ??  ?? 36-20170524201346-02.jpg
> -rwxr-xr-x 1 adyhasch adyhasch  62 Sep 14 12:43 remove.py
> -rwxrwxrwx 1 root root 338 Sep 14 13:12 test.raw
>  root  /home/storage/motion/2017-05-24  1  cp test.raw
> 36-20170524201346-02.jpg
> 'test.raw' -> '36-20170524201346-02.jpg'
> 
>  root  /home/storage/motion/2017-05-24  ls -al
> total 20
> drwxrwxrwx 1 motion   motion   178 Sep 14 

Re: btrfs filesystem show takes a long time

2018-09-14 Thread David Sterba
Hi,

thanks for the report, I've forwarded it to the issue tracker
https://github.com/kdave/btrfs-progs/issues/148

The show command uses the information provided by blkid, that presumably
caches that. The default behaviour of 'fi show' is to skip mount checks,
so the delays are likely caused by blkid, but that's not the only
possible reason.

You may try to run the show command under strace to see where it blocks.


Re: btrfs convert problem

2018-09-14 Thread Qu Wenruo



On 2018/9/14 下午1:52, Nikolay Borisov wrote:
> 
> 
> On 14.09.2018 02:17, Qu Wenruo wrote:
>>
>>
>> On 2018/9/14 上午12:37, Nikolay Borisov wrote:
>>>
>>>
>>> On 13.09.2018 19:15, Serhat Sevki Dincer wrote:
> -1 seems to be EPERM, is your device write-protected, readonly or
> something like that ?

 I am able to use ext4 partition read/write, created a file and wrote
 in it, un/re mounted it, all is ok.
 The drive only has a microusb port and a little led light; no rw
 switch or anything like that..

>>>
>>> What might help is running btrfs convert under strace. So something like :
>>>
>>> sudo strace -f -o strace.log btrfs-convert /dev/sdb1 and then send the log.
>>>
>> strace would greatly help in this case.
>>
>> My guess is something wrong happened in migrate_super_block(), which
>> doesn't handle ret > 0 case well.
>> In that case, it means pwrite 4K doesn't finish in one call, which looks
>> pretty strange.
> 
> So Qu,
> 
> I'm not seeing any EPERM errors, though I see the following so you might
> be right:
> 
> openat(AT_FDCWD, "/dev/sdb1", O_RDWR) = 5
> 
> followed by a lot of preads, the last one of which is:
> 
> pread64(5, 0x557a0abd10b0, 4096, 2732765184) = -1 EIO (Input/output
> error)

With context, it's pretty easy to locate the problem.

It's a bad block of your device.

Please try to use the following command to verify it:

dd if=/dev/sdb1 of=/dev/null bs=1 count=4096 skip=2732765184

And it would be nicer to check kernel dmesg to see if there is clue there.

The culprit code is read_disk_extent(), which reset ret to -1 when error
happens.
I'll send patch(es) to fix it and enhance error messages for btrfs-convert.


There is a workaround to let btrfs-convert continue, you could use
--no-datasum option to skip datasum generation, thus btrfs-convert won't
try to read such data.
But that's just ostrich algorithm.

I'd recommend to read all files in the original fs and locate which
file(s) are affected. And backup data asap.

Thanks,
Qu

> 
> This is then followed by a a lot of pwrites, the last 10 or so syscalls
> are:
> 
> pwrite64(5, "\220\255\262\244\0\0\0\0\0\0"..., 16384, 91127808) = 16384
> 
> pwrite64(5, "|IO\233\0\0\0\0\0\0"..., 16384, 91144192) = 16384
> 
> pwrite64(5, "x\2501n\0\0\0\0\0\0"..., 16384, 91160576) = 16384
> 
> pwrite64(5, "\252\254l)\0\0\0\0\0\0"..., 16384, 91176960) = 16384
> 
> pwrite64(5, "P\256\331\373\0\0\0\0\0\0"..., 16384, 91193344) = 16384
> 
> pwrite64(5, "\3\230\230+\0\0\0\0\0\0"..., 4096, 83951616) = 4096
> 
> write(2, "ERROR: ", 7) = 7
> 
> write(2, "failed to "..., 37) = 37
> 
> write(2, "\n", 1) = 1
> 
> close(4) = 0
> 
> write(2, "WARNING: ", 9) = 9
> 
> write(2, "an error o"..., 104) = 104
> 
> write(2, "\n", 1) = 1
> 
> exit_group(1) = ?
> 
> but looking at migrate_super_block it's not doing that many writes -
> just writing the sb at BTRFS_SUPER_INFO_OFFSET and then zeroing out
> 0..BTRFS_SUPER_INFO_OFFSET. Also the number of pwrites:
> grep -c pwrite ~/Downloads/btrfs-issue/convert-strace.log
> 198
> 
> So the failure is not originating from migrate_super_block.
> 
>>
>> Thanks,
>> Qu
>>


  1   2   3   4   5   6   7   8   9   10   >