Re: btrfs subvolume mount with different options

2018-01-15 Thread Konstantin V. Gavrilenko
Thanks, chattr +C is that's what I am currently using.
Also you already answered my next question, why it is not possible to set +C 
attribute on the existing file :)


Yours sincerely,
Konstantin V. Gavrilenko


- Original Message -
From: "Roman Mamedov" <r...@romanrm.net>
To: "Konstantin V. Gavrilenko" <k.gavrile...@arhont.com>
Cc: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Friday, 12 January, 2018 9:37:49 PM
Subject: Re: btrfs subvolume mount with different options

On Fri, 12 Jan 2018 17:49:38 + (GMT)
"Konstantin V. Gavrilenko" <k.gavrile...@arhont.com> wrote:

> Hi list,
> 
> just wondering whether it is possible to mount two subvolumes with different 
> mount options, i.e.
> 
> |
> |- /a  defaults,compress-force=lza

You can have use different compression algorithms across the filesystem
(including none), via "btrfs properties" on directories or subvolumes. They
are inherited down the tree.

$ mkdir test
$ sudo btrfs prop set test compression zstd
$ echo abc > test/def
$ sudo btrfs prop get test/def compression
compression=zstd

But it appears this doesn't provide a way to apply compress-force.

> |- /b  defaults,nodatacow

Nodatacow can be applied to any dir/subvolume recursively, or any file (as long 
as it's created but not
written yet) via chattr +C.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs subvolume mount with different options

2018-01-12 Thread Konstantin V. Gavrilenko
Hi list,

just wondering whether it is possible to mount two subvolumes with different 
mount options, i.e.

|
|- /a  defaults,compress-force=lza
|
|- /b  defaults,nodatacow


since, when both subvolumes are mounted, and when I change the option for one 
it is changed for all of them.


thanks in advance.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: super_total_bytes 32004083023872 mismatch with fs_devices total_rw_bytes 64008166047744

2017-10-24 Thread Konstantin V. Gavrilenko
The mentioning of the device scan scode and the fact the total_bytes is double 
made me try hashing the raid from the fstab.
So i booted, run the "inspect-internal dump-super" that confirmed that it is in 
order.
# grep -i total_bytes hashed-inspect-internal

total_bytes 32004083023872
dev_item.total_bytes32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
total_bytes 32004083023872
dev_item.total_bytes32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
total_bytes 32004083023872
dev_item.total_bytes32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872
backup_total_bytes: 32004083023872

then I unhashed the device in fstab, mounted it manually and it successfully 
mounted. 

# time mount /mnt/arh-backup1/

real2m49.021s
user0m0.000s
sys 0m1.244s



With the unhashed device in the fstab, i rebooted and upon reboot I run mount

time mount /mnt/arh-backup1/
mount: wrong fs type, bad option, bad superblock on /dev/sda,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

real1m20.499s
user0m0.000s
sys 0m0.045s

that failed. I further waited for couple of minutes and run the mount again, 
and it mounted successfully.


So it seems that because of the amount of time it takes mount, nearly 3 
minutes, to mount the device, there is some sort of race condition, and two 
device scans are running at the same time, or something similar.
I can say one thing for sure, it wasn't happening on 4.10 and I have only 
observed such behaviour on 4.12 and 4.13

p.s. the disk does not mount automatically upon boot, but can be mounted 
manually later


# uptime 
 19:54:45 up 4 min,  1 user,  load average: 0.30, 0.74, 0.39

# time mount /mnt/arh-backup1/

real2m52.247s
user0m0.000s
sys 0m1.246s


Here is the  dmesg extract. It seems that for some reason on 204th second the 
system return "open ctree failed" 
on 329 second, I started the mount manually.

[  204.389231] BTRFS error (device sda): open_ctree failed
[  329.234613] BTRFS info (device sda): force zlib compression
[  329.234618] BTRFS info (device sda): using free space tree
[  329.234620] BTRFS info (device sda): has skinny extents


hope that helps and thanks for your help


Yours sincerely,
Konstantin V. Gavrilenko



- Original Message -
From: "Qu Wenruo" <quwenruo.bt...@gmx.com>
To: "Konstantin V. Gavrilenko" <k.gavrile...@arhont.com>
Cc: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Tuesday, 24 October, 2017 3:44:21 PM
Subject: Re: super_total_bytes 32004083023872 mismatch with fs_devices 
total_rw_bytes 64008166047744



On 2017年10月24日 19:44, Konstantin V. Gavrilenko wrote:
> answers inline marked with KVG:
> 
> Yours sincerely,
> Konstantin V. Gavrilenko
> 
> 
> 
> 
> - Original Message -
> From: "Qu Wenruo" <quwenruo.bt...@gmx.com>
> To: "Konstantin V. Gavrilenko" <k.gavrile...@arhont.com>, "Linux fs Btrfs" 
> <linux-btrfs@vger.kernel.org>
> Sent: Tuesday, 24 October, 2017 11:37:56 AM
> Subject: Re: super_total_bytes 32004083023872 mismatch with fs_devices 
> total_rw_bytes 64008166047744
> 
> 
> 
> On 2017年10月24日 17:20, Konstantin V. Gavrilenko wrote:
>> Hi list,
>>
>> having installed the recent kernel version I am no longer able to mount the 
>> btrfs partition with compression on the first attempt. Previously on 
>> 4.10.0-37-generic everything was working fine, once I switched to 
>> 4.13.9-041309-generic I started getting the following error while trying to 
>> mount it with the same  options "compress-force=zlib,space_cache=v2"
>>
>> [  204.596381] BTRFS error (device sda): open_ctree failed
>> [  204.631895] BTRFS info (device sda): force zlib compression
>> [  204.631901] BTRFS info (device sda): using free space tree
>> [  204.631903] BTRFS info (device sda): has skinny extents
>> [  204.890145] BTRFS error (device sda): super_total_bytes 32004083023872 
>> mismatch with fs_devices total_rw_bytes 64008166047744
>> [  204.891276] BTRFS error (device sda): failed to read chunk tree: -22
>> [  204.944333] BTRFS error (device sda): open_ctree failed
> 
> Such problem c

super_total_bytes 32004083023872 mismatch with fs_devices total_rw_bytes 64008166047744

2017-10-24 Thread Konstantin V. Gavrilenko
Hi list,

having installed the recent kernel version I am no longer able to mount the 
btrfs partition with compression on the first attempt. Previously on 
4.10.0-37-generic everything was working fine, once I switched to 
4.13.9-041309-generic I started getting the following error while trying to 
mount it with the same  options "compress-force=zlib,space_cache=v2"

[  204.596381] BTRFS error (device sda): open_ctree failed
[  204.631895] BTRFS info (device sda): force zlib compression
[  204.631901] BTRFS info (device sda): using free space tree
[  204.631903] BTRFS info (device sda): has skinny extents
[  204.890145] BTRFS error (device sda): super_total_bytes 32004083023872 
mismatch with fs_devices total_rw_bytes 64008166047744
[  204.891276] BTRFS error (device sda): failed to read chunk tree: -22
[  204.944333] BTRFS error (device sda): open_ctree failed

For some reason, the super_total_bytes is exactly half of total_rw_bytes.


however, if after unsuccessful first mount attempt, I mount it with minimum 
number of options "space_cache=v2" the partition mounts. Then I umount it, and 
mount normally, with full set of options "compress-force=zlib,space_cache=v2" 
it mounts without an error.
I also observed the same error on 4.12.14-041214-generic
Any ideas why this might be happening?



System information

distribution: Ubuntu 16.04
btrfs-progs v4.8.1 later upgraded to v4.13.3

# btrfs fi usage /mnt/backup
Overall:
Device size:  29.11TiB
Device allocated: 18.04TiB
Device unallocated:   11.07TiB
Device missing:  0.00B
Used: 17.99TiB
Free (estimated): 11.12TiB  (min: 5.58TiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:17.93TiB, Used:17.88TiB
   /dev/sda   17.93TiB

Metadata,DUP: Size:53.50GiB, Used:51.78GiB
   /dev/sda  107.00GiB

System,DUP: Size:8.00MiB, Used:2.30MiB
   /dev/sda   16.00MiB

Unallocated:
   /dev/sda   11.07TiB




Yours sincerely,
Konstantin V. Gavrilenko


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-31 Thread Konstantin V. Gavrilenko
Hello again list. I thought I would clear the things out and describe what is 
happening with my troubled RAID setup.

So having received the help from the list, I've initially run the full 
defragmentation of all the data and recompressed everything with zlib. 
That didn't help. Then I run the full rebalance of the data and that didn't 
help either.

So I had to take a disk out of the raid, copy all the data onto it, recreate 
the RAID drive with 32kb chunk size and 96kb stripe and copied the data back. 
Then added the disk back and resynced the raid.


So currently the RAID device is 

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name:
RAID Level  : Primary-5, Secondary-0, RAID Level Qualifier-3
Size: 21.830 TB
Sector Size : 512
Is VD emulated  : Yes
Parity Size : 7.276 TB
State   : Optimal
Strip Size  : 32 KB
Number Of Drives: 4
Span Depth  : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type : None
Bad Blocks Exist: No
Is VD Cached: No


It is about 40% full with compressed data
# btrfs fi usage /mnt/arh-backup1/
Overall:
Device size:  21.83TiB
Device allocated:  8.98TiB
Device unallocated:   12.85TiB
Device missing:  0.00B
Used:  8.98TiB
Free (estimated): 12.85TiB  (min: 6.43TiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)


I've decided to run a set of test, where 5 gb file was created using different 
blocksizes and different flags.
one file with urandom data was generated and another one filled with zeroes. 
the data was written with compression and without compression, and it seems 
that without compression it is possible to gain 30-40% speed, while the cpu was 
running at 50% idle during the highest loads.
dd write speeds (mb/s)

flags: conv=fsync
compress-force=zlib  compress-force=none
 RAND ZERORAND ZERO
bs1024k  387  407 584  577
bs512k   389  414 532  547
bs256k   412  409 558  585
bs128k   412  403 572  583
bs64k409  419 563  574
bs32k407  404 569  572


flags: oflag=sync
compress-force=zlib  compress-force=none
 RAND  ZERORAND  ZERO
bs1024k  86.1  97.0203   210
bs512k   50.6  64.485.0  170
bs256k   25.0  29.867.6  67.5
bs128k   13.2  16.448.4  49.8
bs64k7.4   8.3 24.5  27.9
bs32k3.8   4.1 14.0  13.7




flags: no flags
compress-force=zlib  compress-force=none
 RAND  ZERORAND  ZERO
bs1024k  480   419 681   595
bs512k   422   412 633   585
bs256k   413   384 707   712
bs128k   414   387 695   704
bs64k482   467 622   587
bs32k416   412 610   598


I have also run a test where I filled the array to about 97% capacity and the 
write speed went down by about 50% compared with the empty RAID.


thanks for the help. 

- Original Message -
From: "Peter Grandi" 
To: "Linux fs Btrfs" 
Sent: Tuesday, 1 August, 2017 10:09:03 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned 

Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko
Roman, initially I had a single process occupying 100% CPU, when sysrq it was 
indicating as "btrfs_find_space_for_alloc"
but that's when I used the autodefrag, compress, forcecompress and commit=10 
mount flags and space_cache was v1 by default.
when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu 
has dissapeared, but the shite performance remained.


As to the chunk size, there is no information in the article about the type of 
data that was used. While in our case we are pretty certain about the 
compressed block size (32-128). I am currently inclining towards 32k as it 
might be ideal in a situation when we have a 5 disk raid5 array.

In theory
1. The minimum compressed write (32k) would fill the chunk on a single disk, 
thus the IO cost of the operation would be 2 reads (original chunk + original 
parity)  and 2 writes (new chunk + new parity)

2. The maximum compressed write (128k) would require the update of 1 chunk on 
each of the 4 data disks + 1 parity  write 



Stefan what mount flags do you use?

kos



- Original Message -
From: "Roman Mamedov" <r...@romanrm.net>
To: "Konstantin V. Gavrilenko" <k.gavrile...@arhont.com>
Cc: "Stefan Priebe - Profihost AG" <s.pri...@profihost.ag>, "Marat Khalili" 
<m...@rqc.ru>, linux-btrfs@vger.kernel.org, "Peter Grandi" 
<p...@btrfs.list.sabi.co.uk>
Sent: Wednesday, 16 August, 2017 2:00:03 PM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

On Wed, 16 Aug 2017 12:48:42 +0100 (BST)
"Konstantin V. Gavrilenko" <k.gavrile...@arhont.com> wrote:

> I believe the chunk size of 512kb is even worth for performance then the 
> default settings on my HW RAID of  256kb.

It might be, but that does not explain the original problem reported at all.
If mdraid performance would be the bottleneck, you would see high iowait,
possibly some CPU load from the mdX_raidY threads. But not a single Btrfs
thread pegging into 100% CPU.

> So now I am moving the data from the array and will be rebuilding it with 64
> or 32 chunk size and checking the performance.

64K is the sweet spot for RAID5/6:
http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko


I believe the chunk size of 512kb is even worth for performance then the 
default settings on my HW RAID of  256kb.

Peter Grandi explained it earlier on in one of his posts.

QTE
++
That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.
++
UNQTE


I've also found another explanation of the same problem with the right chunk 
size and how it works here
http://holyhandgrenade.org/blog/2011/08/disk-performance-part-2-raid-layouts-and-stripe-sizing/#more-1212



So in my understanding, when working with compressed data, your compressed data 
will vary between 128kb (urandom) and 32kb (zeroes) that will be passed to the 
FS to take care of.

and in our setup of large chunk sizes, if we need to write 32kb-128kb of 
compressed data, the RAID5 would need to perform  3 read operations and 2 write 
operations.

As updating a parity chunk requires either
- The original chunk, the new chunk, and the old parity block
- Or, all chunks (except for the parity chunk) in the stripe

diskdisk1   disk2   disk3   disk4
chunk size  512kb   512kb   512kb   512kbP

So in worst case scenario, in order to write 32kb, RAID5 would need to read 
(480 + 512 + P512) then write (32 + P512)

That's my current understanding of the situation.
I was planning to write an update to my story later on, once I hopefully solve 
the problem. But an intermidiary update is that I have performed full defrag 
with full compression (2 days). Then balance of the all data (10 days)and it 
didn't help the performance .

So now I am moving the data from the array and will be rebuilding it with 64 or 
32 chunk size and checking the performance.

VG,
kos



- Original Message -
From: "Stefan Priebe - Profihost AG" <s.pri...@profihost.ag>
To: "Konstantin V. Gavrilenko" <k.gavrile...@arhont.com>
Cc: "Marat Khalili" <m...@rqc.ru>, linux-btrfs@vger.kernel.org
Sent: Wednesday, 16 August, 2017 11:26:38 AM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

Am 16.08.2017 um 11:02 schrieb Konstantin V. Gavrilenko:
> Could be similar issue as what I had recently, with the RAID5 and 256kb chunk 
> size.
> please provide more information about your RAID setup.

Hope this helps:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath]
[raid0] [raid10]
md0 : active raid5 sdd1[1] sdf1[4] sdc1[0] sde1[2]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

md2 : active raid5 sdm1[2] sdl1[1] sdk1[0] sdn1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md1 : active raid5 sdi1[2] sdg1[0] sdj1[4] sdh1[1]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 7/30 pages [28KB], 65536KB chunk

md3 : active raid5 sdp1[1] sdo1[0] sdq1[2] sdr1[4]
  11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []
  bitmap: 6/30 pages [24KB], 65536KB chunk

# btrfs fi usage /vmbackup/
Overall:
Device size:  43.65TiB
Device allocated: 31.98TiB
Device unallocated:   11.67TiB
Device missing:  0.00B
Used: 30.80TiB
Free (estimated): 12.84TiB  (min: 12.84TiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID0: Size:31.83TiB, Used:30.66TiB
   /dev/md07.96TiB
   /dev/md17.96TiB
   /dev/md27.96TiB
   /dev/md37.96TiB

Metadata,RAID0: Size:153.00GiB, Used:141.34GiB
   /dev/md0   38.25GiB
   /dev/md1   38.25GiB
   /dev/md2   38.25GiB
   /dev/md3

Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Konstantin V. Gavrilenko
Could be similar issue as what I had recently, with the RAID5 and 256kb chunk 
size.

please provide more information about your RAID setup.

p.s.
you can also check the tread "Btrfs + compression = slow performance and high 
cpu usage"

- Original Message -
From: "Stefan Priebe - Profihost AG" 
To: "Marat Khalili" , linux-btrfs@vger.kernel.org
Sent: Wednesday, 16 August, 2017 10:37:43 AM
Subject: Re: slow btrfs with a single kworker process using 100% CPU

Am 16.08.2017 um 08:53 schrieb Marat Khalili:
>> I've one system where a single kworker process is using 100% CPU
>> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is
>> there anything i can do to get the old speed again or find the culprit?
> 
> 1. Do you use quotas (qgroups)?

No qgroups and no quota.

> 2. Do you have a lot of snapshots? Have you deleted some recently?

1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner
process isn't running / consuming CPU currently.

> More info about your system would help too.
Kernel is OpenSuSE Leap 42.3.

btrfs is mounted with
compress-force=zlib

btrfs is running as a raid0 on top of 4 md raid 5 devices.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Konstantin V. Gavrilenko
- Original Message -
From: "Peter Grandi" 
To: "Linux fs Btrfs" 
Sent: Tuesday, 1 August, 2017 3:14:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.



As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

KOS: Ok, thanks for pointing it out. I have compared the filefrag -v on another 
btrfs  that is not fragmented
and see the difference with what is happening on the sluggish one.

5824:   186368..  186399: 2430093383..2430093414: 32: 2430093414: encoded
5825:   186400..  186431: 2430093384..2430093415: 32: 2430093415: encoded
5826:   186432..  186463: 2430093385..2430093416: 32: 2430093416: encoded
5827:   186464..  186495: 2430093386..2430093417: 32: 2430093417: encoded
5828:   186496..  186527: 2430093387..2430093418: 32: 2430093418: encoded
5829:   186528..  186559: 2430093388..2430093419: 32: 2430093419: encoded
5830:   186560..  186591: 2430093389..2430093420: 32: 2430093420: encoded



As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.


KOS: No I don't have persistent cache. Only the 512 Mb cache on board of a 
controller, that is 
BBU. If I had additional SSD caching on the controller I would have mentioned 
it.

I was also under impression, that in a situation where mostly extra large files 
will be stored on the massive, the bigger strip size would indeed increase the 
speed, thus I went with with the 256 Kb strip size.  Would I be correct in 
assuming that the RAID strip size of 128 Kb will be a better choice if one 
plans to use the BTRFS with compression?

thanks,
kos




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Konstantin V. Gavrilenko
Peter, I don't think the filefrag is showing the correct fragmentation status 
of the file when the compression is used.
At least the one that is installed by default in Ubuntu 16.04 -  e2fsprogs | 
1.42.13-1ubuntu1

So for example, fragmentation of compressed file is 320 times more then 
uncompressed one.

root@homenas:/mnt/storage/NEW# filefrag test5g-zeroes
test5g-zeroes: 40903 extents found

root@homenas:/mnt/storage/NEW# filefrag test5g-data 
test5g-data: 129 extents found


I am currently defragmenting that mountpoint, ensuring that everrything is 
compressed with zlib. 
# btrfs fi defragment -rv -czlib /mnt/arh-backup 

my guess is that it will take another 24-36 hours to complete and then I will 
redo the test to see if that has helped.
will keep the list posted.

p.s. any other suggestion that might help with the fragmentation and data 
allocation. Should I try and rebalance the data on the drive?

kos



- Original Message -
From: "Peter Grandi" 
To: "Linux fs Btrfs" 
Sent: Monday, 31 July, 2017 1:41:07 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

[ ... ]

> grep 'model name' /proc/cpuinfo | sort -u 
> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

Good, contemporary CPU with all accelerations.

> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size  : 256 KB

So the full RMW data stripe length is 768KiB.

> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.

That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.

> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.

> mountflags: 
> (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: 
> (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s

That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.

> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s

I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.

Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.

It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:

  soft#  grep -A2 sdb3 /proc/mdstat 
  md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] []

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 
/mnt/test5   
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 94.3605 s, 111 MB/s
  0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 
2932maxresident)k
  13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 93.5885 s, 112 MB/s
  0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 
2940maxresident)k
  13042144inputs+20482400outputs 

Re: Btrfs + compression = slow performance and high cpu usage

2017-07-30 Thread Konstantin V. Gavrilenko
Thanks for the comments. Initially the system performed well, I don't have the 
benchmark details written, but the compressed vs non compressed speeds were 
more or less similar. However, after several weeks of usage, the system started 
experiencing the described slowdowns, thus I started investigating the problem. 
This indeed is a backup drive, but it predominantly contains large files.

# ls -lahR | awk '/^-/ {print $5}' | sort | uniq -c  | sort -n | tail -n 15
  5 322
  5 396
  5 400
  6 1000G
  6 11
  6 200G
  8 24G
  8 48G
 13 500G
 20 8.0G
 25 165G
 32 20G
 57 100G
103 50G
201 10G


# grep 'model name' /proc/cpuinfo | sort -u 
model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

# lsscsi | grep 'sd[ae]'
[4:2:0:0]diskLSI  MR9260-8i2.13  /dev/sda 


The sda device is a hardware RAID5 consisting of 4x8TB drives.

Virtual Drive: 0 (Target Id: 0)
Name:
RAID Level  : Primary-5, Secondary-0, RAID Level Qualifier-3
Size: 21.830 TB
Sector Size : 512
Is VD emulated  : Yes
Parity Size : 7.276 TB
State   : Optimal
Strip Size  : 256 KB
Number Of Drives: 4
Span Depth  : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type : None
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 4


I have changed the mount flags as suggested, and I don't see the previously 
reported behaviour of one of the kworker consuming 100% of the cputime, but the 
write speed difference between the compression ON vs OFF is pretty large. 
Have run several tests under with zlib, lzo and no compression and the results 
are rather strange.

mountflags: (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 93.3418 s, 57.5 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s



mountflags: (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 116.246 s, 46.2 MB/s


dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 14.704 s, 365 MB/s


dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s



mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress 
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 32.2551 s, 166 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 19.9464 s, 269 MB/s

dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s


The CPU usage is pretty low as well. For example when the  force-compress=zlib 
is in effect, the cpu usage is pretty low now.

Linux 4.10.0-28-generic (ais-backup1)   30/07/17_x86_64_(12 CPU)

14:31:27CPU %user %nice   %system   %iowait%steal %idle
14:31:28all  0.00  0.00  1.50  0.00  0.00 98.50
14:31:29all  0.00  0.00  4.78  3.52  0.00 91.69
14:31:30all  0.08  0.00  4.92  3.75  0.00 91.25
14:31:31all  0.00  0.00  4.76  3.76  0.00 91.49
14:31:32all  0.00  0.00  4.76  3.76  0.00 91.48
14:31:33all  0.08  0.00  4.67  3.76  0.00 91.49
14:31:34all  0.00  0.00  4.76  3.68  0.00 91.56
14:31:35all  0.08  0.00  4.76  3.76  0.00 91.40
14:31:36all  0.00  0.00  4.60  3.77  0.00 91.63
14:31:37all  0.00  0.00  4.68  3.68  0.00 91.64
14:31:38all  0.08  0.00  4.52  3.76  0.00 91.64
14:31:39all  0.08  0.00  4.68  3.76  0.00 91.48
14:31:40all  0.08  0.00  4.52  3.76  0.00 91.64
14:31:41all  0.00  0.00  4.61  3.77  0.00 91.62
14:31:42all  0.08  0.00  5.07  3.74  0.00 91.10
14:31:43all  0.00  0.00  4.68  3.68  0.00 91.64
14:31:44  

Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Konstantin V. Gavrilenko
Hello list, 

I am stuck with a problem of btrfs slow performance when using compression.

when the compress-force=lzo mount flag is enabled, the performance drops to 
30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
mount options: btrfs 
relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

The command I am testing the write throughput is

# pv -tpreb /dev/sdb | dd of=./testfile bs=1M oflag=direct

# top -d 1 
top - 15:49:13 up  1:52,  2 users,  load average: 5.28, 2.32, 1.39
Tasks: 320 total,   2 running, 318 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 77.0 id, 21.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.0 sy,  0.0 ni, 90.0 id,  9.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  1.0 sy,  0.0 ni, 72.0 id, 27.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  1.0 sy,  0.0 ni, 57.0 id, 42.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni, 94.0 id,  6.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  1.0 sy,  0.0 ni, 95.1 id,  3.9 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  1.0 us,  2.0 sy,  0.0 ni, 24.0 id, 73.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni, 81.8 id, 18.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  2.0 sy,  0.0 ni, 83.3 id, 14.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32934136 total, 10137496 free,   602244 used, 22194396 buff/cache
KiB Swap:0 total,0 free,0 used. 30525664 avail Mem 

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND 

  
37017 root  20   0   0  0  0 R 100.0  0.0   0:32.42 
kworker/u49:8   
  
36732 root  20   0   0  0  0 D   4.0  0.0   0:02.40 
btrfs-transacti 
  
40105 root  20   08388   3040   2000 D   4.0  0.0   0:02.88 dd   


The keyworker process that causes the high cpu usage is  most likely searching 
for the free space.

# echo l > /proc/sysrq-trigger

# dmest -T
Fri Jul 28 15:57:51 2017] CPU: 1 PID: 36430 Comm: kworker/u49:2 Not tainted 
4.10.0-28-generic #32~16.04.2-Ubuntu
[Fri Jul 28 15:57:51 2017] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b 
  11/16/2012
[Fri Jul 28 15:57:51 2017] Workqueue: btrfs-delalloc btrfs_delalloc_helper 
[btrfs]
[Fri Jul 28 15:57:51 2017] task: 9ddce6206a40 task.stack: aa9121f6c000
[Fri Jul 28 15:57:51 2017] RIP: 0010:rb_next+0x1e/0x40
[Fri Jul 28 15:57:51 2017] RSP: 0018:aa9121f6fb40 EFLAGS: 0282
[Fri Jul 28 15:57:51 2017] RAX: 9dddc34df1b0 RBX: 0001 RCX: 
1000
[Fri Jul 28 15:57:51 2017] RDX: 9dddc34df708 RSI: 9ddccaf470a4 RDI: 
9dddc34df2d0
[Fri Jul 28 15:57:51 2017] RBP: aa9121f6fb40 R08: 0001 R09: 
3000
[Fri Jul 28 15:57:51 2017] R10:  R11: 0002 R12: 
9ddccaf47080
[Fri Jul 28 15:57:51 2017] R13: 1000 R14: aa9121f6fc50 R15: 
9dddc34df2d0
[Fri Jul 28 15:57:51 2017] FS:  () 
GS:9ddcefa4() knlGS:
[Fri Jul 28 15:57:51 2017] CS:  0010 DS:  ES:  CR0: 80050033
[Fri Jul 28 15:57:51 2017] Call Trace:_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_find_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  find_free_extent.isra.68+0x3c6/0x1040 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  btrfs_reserve_extent+0xab/0x210 [btrfs]btrfs]
[Fri Jul 28 15:57:51 2017]  submit_compressed_extents+0x154/0x580 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  ? submit_compressed_extents+0x580/0x580 [btrfs]
[Fri Jul 28 15:57:51 2017]  async_cow_submit+0x82/0x90 [btrfs]00 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_scrubparity_helper+0x1fe/0x300 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[Fri Jul 28 15:57:51 2017]  process_one_work+0x16b/0x4a0a0
[Fri Jul 28 15:57:51 2017]  worker_thread+0x4b/0x500+0x60/0x60
[Fri Jul 28 15:57:51 2017]  kthread+0x109/0x1400x4a0/0x4a0




When the compression is turned off, I am able to get the maximum 500-600 mb/s 
write speed on this disk (raid array) with minimal cpu usage.

mount options: relatime,discard,autodefrag,space_cache=v2,commit=10

# iostat -m 1 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.080.007.74   10.770.00   81.40

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda2376.00 0.00   

Re: btrfs-progs confusing message

2016-04-21 Thread Konstantin Svist
On 04/21/2016 04:02 AM, Austin S. Hemmelgarn wrote:
> On 2016-04-20 16:23, Konstantin Svist wrote:
>> Pretty much all commands print out the usage message when no device is
>> specified:
>>
>> [root@host ~]# btrfs scrub start
>> btrfs scrub start: too few arguments
>> usage: btrfs scrub start [-BdqrRf] [-c ioprio_class -n ioprio_classdata]
>> |
>> ...
>>
>> However, balance doesn't
>>
>> [root@host ~]# btrfs balance start
>> ERROR: can't access 'start': No such file or directory
>
> And this is an example of why backwards comparability can be a pain.
> The original balance command was 'btrfs filesystem balance', and had
> no start, stop, or similar sub-commands.  This got changed to the
> current incarnation when the support for filters was added.  For
> backwards compatibility reasons, we decided to still accept balance
> with no arguments other than the path as being the same as running
> 'btrfs balance start' on that path, and then made the old name an
> alias to the new one, with the restriction that you can't pass in
> filters through that interface.  What is happening here is that
> balance is trying to interpret start as a path, not a command, hence
> the message about not being able to access 'start'.
>

So since this is still detected as an error, why not print usage info at
this point?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-progs confusing message

2016-04-20 Thread Konstantin Svist
Pretty much all commands print out the usage message when no device is
specified:

[root@host ~]# btrfs scrub start
btrfs scrub start: too few arguments
usage: btrfs scrub start [-BdqrRf] [-c ioprio_class -n ioprio_classdata]
|
...

However, balance doesn't

[root@host ~]# btrfs balance start
ERROR: can't access 'start': No such file or directory




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bedup --defrag freezing

2015-08-12 Thread Konstantin Svist
On 08/06/2015 04:10 AM, Austin S Hemmelgarn wrote:
 On 2015-08-05 17:45, Konstantin Svist wrote:
 Hi,

 I've been running btrfs on Fedora for a while now, with bedup --defrag
 running in a night-time cronjob.
 Last few runs seem to have gotten stuck, without possibility of even
 killing the process (kill -9 doesn't work) -- all I could do is hard
 power cycle.

 Did something change recently? Is bedup simply too out of date? What
 should I use to de-duplicate across snapshots instead? Etc.?

 AFAIK, bedup hasn't been actively developed for quite a while (I'm
 actually kind of surprised it runs with the newest btrfs-progs).
 Personally, I'd suggest using duperemove
 (https://github.com/markfasheh/duperemove)

Thanks, good to know.
Tried duperemove -- it looks like it builds a database of its own
checksums every time it runs... why won't it use BTRFS internal
checksums for fast rejection? Would run a LOT faster...


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bedup --defrag freezing

2015-08-05 Thread Konstantin Svist
Hi,

I've been running btrfs on Fedora for a while now, with bedup --defrag
running in a night-time cronjob.
Last few runs seem to have gotten stuck, without possibility of even
killing the process (kill -9 doesn't work) -- all I could do is hard
power cycle.

Did something change recently? Is bedup simply too out of date? What
should I use to de-duplicate across snapshots instead? Etc.?


Thanks,
Konstantin



# uname -a
Linux mireille.svist.net 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10
21:09:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.1

# btrfs fi show
Label: none  uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939
Total devices 1 FS bytes used 243.43GiB
devid1 size 465.76GiB used 318.05GiB path /dev/sda2

btrfs-progs v4.1

# btrfs fi df /
Data, single: total=309.01GiB, used=238.24GiB
System, single: total=32.00MiB, used=64.00KiB
Metadata, single: total=9.01GiB, used=5.19GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

dmseg attached

[0.00] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.8-200.fc21.x86_64 (mockbu...@bkernel02.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #1 SMP Fri Jul 10 21:09:54 UTC 2015
[0.00] Command line: BOOT_IMAGE=/main/boot/vmlinuz-4.0.8-200.fc21.x86_64 root=/dev/sda2 ro rootflags=subvol=main vconsole.font=latarcyrheb-sun16 quiet
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xba14] usable
[0.00] BIOS-e820: [mem 0xba15-0xba156fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xba157000-0xba94] usable
[0.00] BIOS-e820: [mem 0xba95-0xbabedfff] reserved
[0.00] BIOS-e820: [mem 0xbabee000-0xcac0afff] usable
[0.00] BIOS-e820: [mem 0xcac0b000-0xcb10afff] reserved
[0.00] BIOS-e820: [mem 0xcb10b000-0xcb63dfff] usable
[0.00] BIOS-e820: [mem 0xcb63e000-0xcb7aafff] ACPI NVS
[0.00] BIOS-e820: [mem 0xcb7ab000-0xcbffefff] reserved
[0.00] BIOS-e820: [mem 0xcbfff000-0xcbff] usable
[0.00] BIOS-e820: [mem 0xcd00-0xcf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00022fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] DMI: Notebook P15SM-A/SM1-A/P15SM-A/SM1-A, BIOS 4.6.5 03/27/2014
[0.00] e820: update [mem 0x-0x0fff] usable == reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x22fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7E write-back
[0.00]   1 base 02 mask 7FE000 write-back
[0.00]   2 base 022000 mask 7FF000 write-back
[0.00]   3 base 00E000 mask 7FE000 uncachable
[0.00]   4 base 00D000 mask 7FF000 uncachable
[0.00]   5 base 00CE00 mask 7FFE00 uncachable
[0.00]   6 base 00CD00 mask 7FFF00 uncachable
[0.00]   7 base 022FE0 mask 7FFFE0 uncachable
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] e820: update [mem 0xcd00-0x] usable == reserved
[0.00] e820: last_pfn = 0xcc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd830-0x000fd83f] mapped at [880fd830]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using

corrupt 1, but no other indicators

2015-03-28 Thread Konstantin Svist
I'm seeing the following message on every bootup in dmesg 
/var/log/messages:
  BTRFS: bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

I've tried running scrub and it doesn't indicate any errors occurred
Is this normal? Is something actually corrupted? Can I fix it?


Details:

[root@mireille ~]# uname -a
Linux mireille.svist.net 3.19.1-201.fc21.x86_64 #1 SMP Wed Mar 18
04:29:24 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@mireille ~]# btrfs fi show
Label: none  uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939
Total devices 1 FS bytes used 237.28GiB
devid1 size 465.76GiB used 465.76GiB path /dev/sda2

Btrfs v3.18.1
[root@mireille ~]# btrfs --version
Btrfs v3.18.1
[root@mireille ~]# btrfs fi show
Label: none  uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939
Total devices 1 FS bytes used 237.28GiB
devid1 size 465.76GiB used 465.76GiB path /dev/sda2

Btrfs v3.18.1
[root@mireille ~]# btrfs fi df /
Data, single: total=457.75GiB, used=232.64GiB
System, single: total=4.00MiB, used=80.00KiB
Metadata, single: total=8.01GiB, used=4.64GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


dmesg: http://pastebin.com/9B0h4SuA
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Konstantin

Phillip Susi schrieb am 08.12.2014 um 15:59:
 On 12/7/2014 7:32 PM, Konstantin wrote:
  I'm guessing you are using metadata format 0.9 or 1.0, which put
  the metadata at the end of the drive and the filesystem still
  starts in sector zero.  1.2 is now the default and would not have
  this problem as its metadata is at the start of the disk ( well,
  4k from the start ) and the fs starts further down.
  I know this and I'm using 0.9 on purpose. I need to boot from
  these disks so I can't use 1.2 format as the BIOS wouldn't
  recognize the partitions. Having an additional non-RAID disk for
  booting introduces a single point of failure which contrary to the
  idea of RAID0.

 The bios does not know or care about partitions.  All you need is a
That's only true for older BIOSs. With current EFI boards they not only
care but some also mess around with GPT partition tables.
 partition table in the MBR and you can install grub there and have it
 boot the system from a mdadm 1.1 or 1.2 format array housed in a
 partition on the rest of the disk.  The only time you really *have* to
I was thinking of this solution as well but as I'm not aware of any
partitioning tool caring about mdadm metadata so I rejected it. It
requires a non-standard layout leaving reserved empty spaces for mdadm
metadata. It's possible but it isn't documented so far I know and before
losing hours of trying I chose the obvious one.
 use 0.9 or 1.0 ( and you really should be using 1.0 instead since it
 handles larger arrays and can't be confused vis. whole disk vs.
 partition components ) is if you are running a raid1 on the raw disk,
 with no partition table and then partition inside the array instead,
 and really, you just shouldn't be doing that.
That's exactly what I want to do - running RAID1 on the whole disk as
most hardware based RAID systems do. Before that I was running RAID on
disk partitions for some years but this was quite a pain in comparison.
Hot(un)plugging a drive brings you a lot of issues with failing mdadm
commands as they don't like concurrent execution when the same physical
device is affected. And rebuild of RAID partitions is done sequentially
with no deterministic order. We could talk for hours about that but if
interested maybe better in private as it is not BTRFS related.
  Anyway, to avoid a futile discussion, mdraid and its format is not
  the problem, it is just an example of the problem. Using dm-raid
  would do the same trouble, LVM apparently, too. I could think of a
  bunch of other cases including the use of hardware based RAID
  controllers. OK, it's not the majority's problem, but that's not
  the argument to keep a bug/flaw capable of crashing your system.

 dmraid solves the problem by removing the partitions from the
 underlying physical device ( /dev/sda ), and only exposing them on the
 array ( /dev/mapper/whatever ).  LVM only has the problem when you
 take a snapshot.  User space tools face the same issue and they
 resolve it by ignoring or deprioritizing the snapshot.
I don't agree. dmraid and mdraid both remove the partitions. This is not
a solution BTRFS will still crash the PC using /dev/mapper/whatever or
whatever device appears in the system providing the BTRFS volume.
  As it is a nice feature that the kernel apparently scans for drives
  and automatically identifies BTRFS ones, it seems to me that this
  feature is useless. When in a live system a BTRFS RAID disk fails,
  it is not sufficient to hot-replace it, the kernel will not
  automatically rebalance. Commands are still needed for the task as
  are with mdraid. So the only point I can see at the moment where
  this auto-detect feature makes sense is when mounting the device
  for the first time. If I remember the documentation correctly, you
  mount one of the RAID devices and the others are automagically
  attached as well. But outside of the mount process, what is this
  auto-detect used for?

  So here a couple of rather simple solutions which, as far as I can
  see, could solve the problem:

  1. Limit the auto-detect to the mount process and don't do it when
  devices are appearing.

  2. When a BTRFS device is detected and its metadata is identical to
  one already mounted, just ignore it.

 That doesn't really solve the problem since you can still pick the
 wrong one to mount in the first place.
Oh, it does solve the problem, you are are speaking of another problem
which is always there when having several disks in a system. Mounting
the wrong device can happen the case I'm describing if you use UUID,
label or some other metadata related information to mount it. You won't
try do that when you insert a disk you know it has the same metadata. It
will not happen (except user tools outsmart you ;-)) when using the
device name(s). I think it could be expected from a user mounting things
manually to know or learn which device node is which drive. On the other
hand in my case one of the drives is already mounted so getting

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-08 Thread Konstantin

Robert White schrieb am 08.12.2014 um 18:20:
 On 12/07/2014 04:32 PM, Konstantin wrote:
 I know this and I'm using 0.9 on purpose. I need to boot from these
 disks so I can't use 1.2 format as the BIOS wouldn't recognize the
 partitions. Having an additional non-RAID disk for booting introduces a
 single point of failure which contrary to the idea of RAID0.

 GRUB2 has raid 1.1 and 1.2 metadata support via the mdraid1x module.
 LVM is also supported. I don't know if a stack of both is supported.

 There is, BTW, no such thing as a (commodity) computer without a
 single point of failure in it somewhere. I've watched government
 contracts chase this demon for decades. Be it disk, controller,
 network card, bus chip, cpu or stick-of-ram you've got a single point
 of failure somewhere. Actually you likely have several such points of
 potential failure.

 For instance, are you _sure_ your BIOS is going to check the second
 drive if it gets read failure after starting in on your first drive?
 Chances are it won't because that four-hundred bytes-or-so boot loader
 on that first disk has no way to branch back into the bios.

 You can waste a lot of your life chasing that ghost and you'll still
 discover you've missed it and have to whip out your backup boot media.

 It may well be worth having a second copy of /boot around, but make
 sure you stay out of bandersnatch territory when designing your
 system. The more you over-think the plumbing, the easier it is to
 stop up the pipes.
You are right, there is as good as always a single point of failure
somewhere, even if it is the power plant providing your electricity ;-).
I should have written introduces an additional single point of failure
to be 100% correct but I thought this was obvious. As I have replaced
dozens of damaged hard disks but only a few CPUs, RAMs etc. it is more
important for me to reduce the most frequent and easy-to-solve points of
failure. For more important systems there are high availability
solutions which alleviate many of the problems you mention of but that's
not the point here when speaking about the major bug in BTRFS which can
make your system crash.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-07 Thread Konstantin

Anand Jain wrote on 02.12.2014 at 12:54:



 On 02/12/2014 19:14, Goffredo Baroncelli wrote:
 I further investigate this issue.

 MegaBrutal, reported the following issue: doing a lvm snapshot of the
 device of a
 mounted btrfs fs, the new snapshot device name replaces the name of
 the original
 device in the output of /proc/mounts. This confused tools like
 grub-probe which
 report a wrong root device.

 very good test case indeed thanks.

 Actual IO would still go to the original device, until FS is remounted.
This seems to be correct at least at the beginning but I wouldn't be so
sure - why else the system is crashing in my case after a while when the
second drive is present?! So if the kernel was not using it in some way,
except the wrong /proc/mounts nothing else should happen.


 It has to be pointed out that instead the link under
 /sys/fs/btrfs/fsid/devices is
 correct.

 In this context the above sysfs path will be out of sync with the
 reality, its just stale sysfs entry.


 What happens is that *even if the filesystem is mounted*, doing a
 btrfs dev scan of a snapshot (of the real volume), the device name
 of the
 filesystem is replaced with the snapshot one.

 we have some fundamentally wrong stuff. My original patch tried
 to fix it. But later discovered that some external entities like
 systmed and boot process is using that bug as a feature and we had
 to revert the patch.

 Fundamentally scsi inquiry serial number is only number which is unique
 to the device (including the virtual device, but there could be some
 legacy virtual device which didn't follow that strictly, Anyway those
 I deem to be device side issue.) Btrfs depends on the combination of
 fsid, uuid and devid (and generation number) to identify the unique
 device volume, which is weak and easy to go wrong.


 Anand, with b96de000b, tried to fix it; however further regression
 appeared
 and Chris reverted this commit (see below).

 BR
 G.Baroncelli

 commit b96de000bc8bc9688b3a2abea4332bd57648a49f
 Author: Anand Jain anand.j...@oracle.com
 Date:   Thu Jul 3 18:22:05 2014 +0800

  Btrfs: device_list_add() should not update list when mounted
 [...]


 commit 0f23ae74f589304bf33233f85737f4fd368549eb
 Author: Chris Mason c...@fb.com
 Date:   Thu Sep 18 07:49:05 2014 -0700

  Revert Btrfs: device_list_add() should not update list when
 mounted

  This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.

  This commit is triggering failures to mount by subvolume id in some
  configurations.  The main problem is how many different ways this
  scanning function is used, both for scanning while mounted and
  unmounted.  A proper cleanup is too big for late rcs.

 [...]

 On 12/02/2014 09:28 AM, MegaBrutal wrote:
 2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it:
 On 12/02/2014 01:15 AM, MegaBrutal wrote:
 2014-12-02 0:24 GMT+01:00 Robert White rwh...@pobox.com:
 On 12/01/2014 02:10 PM, MegaBrutal wrote:

 Since having duplicate UUIDs on devices is not a problem for me
 since
 I can tell them apart by LVM names, the discussion is of little
 relevance to my use case. Of course it's interesting and I like to
 read it along, it is not about the actual problem at hand.


 Which is why you use the device= mount option, which would take
 LVM names
 and which was repeatedly discussed as solving this very problem.

 Once you decide to duplicate the UUIDs with LVM snapshots you
 take up the
 burden of disambiguating your storage.

 Which is part of why re-reading was suggested as this was covered
 in some
 depth and _is_ _exactly_ about the problem at hand.

 Nope.

 root@reproduce-1391429:~# cat /proc/cmdline
 BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
 root=/dev/mapper/vg-rootlv ro
 rootflags=device=/dev/mapper/vg-rootlv,subvol=@

 Observe, device= mount option is added.

 device= options is needed only in a btrfs multi-volume scenario.
 If you have only one disk, this is not needed


 I know. I only did this as a demonstration for Robert. He insisted it
 will certainly solve the problem. Well, it doesn't.



 root@reproduce-1391429:~# ./reproduce-1391429.sh
 #!/bin/sh -v
 lvs
LV VG   Attr  LSize   Pool Origin Data%  Move Log
 Copy%  Convert
rootlv vg   -wi-ao---   1.00g
swap0  vg   -wi-ao--- 256.00m

 grub-probe --target=device /
 /dev/mapper/vg-rootlv

 grep  /  /proc/mounts
 rootfs / rootfs rw 0 0
 /dev/dm-1 / btrfs rw,relatime,space_cache 0 0

 lvcreate --snapshot --size=128M --name z vg/rootlv
Logical volume z created

 lvs
LV VG   Attr  LSize   Pool Origin Data%  Move Log
 Copy%  Convert
rootlv vg   owi-aos--   1.00g
swap0  vg   -wi-ao--- 256.00m
z  vg   swi-a-s-- 128.00m  rootlv   0.11

 ls -l /dev/vg/
 total 0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv - ../dm-1
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 - ../dm-0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 z - ../dm-2

 grub-probe --target=device /
 /dev/mapper/vg-z

 grep  /  

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-07 Thread Konstantin
Phillip Susi wrote on 02.12.2014 at 20:19:
 On 12/1/2014 4:45 PM, Konstantin wrote:
  The bug appears also when using mdadm RAID1 - when one of the
  drives is detached from the array then the OS discovers it and
  after a while (not directly, it takes several minutes) it appears
  under /proc/mounts: instead of /dev/md0p1 I see there /dev/sdb1.
  And usually after some hour or so (depending on system workload)
  the PC completely freezes. So discussion about the uniqueness of
  UUIDs or not, a crashing kernel is telling me that there is a
  serious bug.

 I'm guessing you are using metadata format 0.9 or 1.0, which put the
 metadata at the end of the drive and the filesystem still starts in
 sector zero.  1.2 is now the default and would not have this problem
 as its metadata is at the start of the disk ( well, 4k from the start
 ) and the fs starts further down.
I know this and I'm using 0.9 on purpose. I need to boot from these
disks so I can't use 1.2 format as the BIOS wouldn't recognize the
partitions. Having an additional non-RAID disk for booting introduces a
single point of failure which contrary to the idea of RAID0.

Anyway, to avoid a futile discussion, mdraid and its format is not the
problem, it is just an example of the problem. Using dm-raid would do
the same trouble, LVM apparently, too. I could think of a bunch of other
cases including the use of hardware based RAID controllers. OK, it's not
the majority's problem, but that's not the argument to keep a bug/flaw
capable of crashing your system.

As it is a nice feature that the kernel apparently scans for drives and
automatically identifies BTRFS ones, it seems to me that this feature is
useless. When in a live system a BTRFS RAID disk fails, it is not
sufficient to hot-replace it, the kernel will not automatically
rebalance. Commands are still needed for the task as are with mdraid. So
the only point I can see at the moment where this auto-detect feature
makes sense is when mounting the device for the first time. If I
remember the documentation correctly, you mount one of the RAID devices
and the others are automagically attached as well. But outside of the
mount process, what is this auto-detect used for?

So here a couple of rather simple solutions which, as far as I can see,
could solve the problem:

1. Limit the auto-detect to the mount process and don't do it when
devices are appearing.

2. When a BTRFS device is detected and its metadata is identical to one
already mounted, just ignore it.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid5 filesystem only mountable ro and not currently fixable after a drive produced read errors

2014-12-02 Thread Konstantin Matuschek
Hello,

I have a raid5 btrfs that refuses to mount rw (ro works) and I think I'm out of 
options to get it fixed.

First, this is roughly what got my filesystem corrupted:


1. I created the raid5 fs in March 2014 using the latest code available (Btrfs 
3.12) on four 4TB devices (each encrypted using dm-crypt). I also created 3 
subvolumes. The command used was:
mkfs.btrfs -O skinny-metadata -d raid5 -m raid5 /dev/mapper/wdred4tb[2345]


2. Around October I noticed one of the drived (wdred4tb3) produced read errors. 
Running a long smartctl self-test would fail as well and the reported 
Raw_Read_Error_Rate increased steadily.


3. Since I had a spare drive around, but replacing a device wasn't implemented 
back then for raid5, I decided to use the add-then-delete approach outlined 
here: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Raid5-Status . I 
did *not* remove the failing drive for that.


4. The rebalance triggered by the btrfs device delete /dev/mapper/wdred4tb3 
command crashed a few times (and read errors kept increasing), but each time I 
started it, a few hundred GiB were moved over to the newly added device. But 
when 414GiB were left on the failing drive, it didn't get further. It now still 
looks like this:
# btrfs fi show /mnt/box
Label: none  uuid: 9f3a48b7-1b88-44f0-a387-f3712fc2c0b6
Total devices 5 FS bytes used 4.43TiB
devid1 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb2
devid2 size 3.64TiB used 414.00GiB path /dev/mapper/wdred4tb3
devid3 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb4
devid4 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb5
devid5 size 3.64TiB used 1.10TiB path /dev/mapper/wdred4tb1
Btrfs v3.17.2-50-gcc0723c


5. I tried several things (probably a new kernel around 3.17, propbably 
affected the snapshot-bug, but I don't use snapshots, only subvolumes) and 
ended up doing a btrfsck --repair (v3.17-rc3) on the filesystem. I still have 
the complete output of that, let me know if you need it. Here are some lines 
that seem interesting to me:
# btrfsck --repair /dev/mapper/wdred4tb2
enabling repair mode
Checking filesystem on /dev/mapper/wdred4tb2
UUID: 9f3a48b7-1b88-44f0-a387-f3712fc2c0b6
checking extents
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
[...]
owner ref check failed [500170752 16384]
repair deleting extent record: key 500170752 169 0
adding new tree backref on start 500170752 len 16384 parent 7 root 7
[...]
repaired damaged extent references
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
[...]
Check tree block failed, want=668598272, have=668794880
Csum didn't match
[...]
checking csums
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
Error going to next leaf -5
checking root refs
found 1469190132145 bytes used err is 0
total csum bytes: 4750630700
total tree bytes: 6141100032
total fs tree bytes: 345964544
total extent tree bytes: 194052096
btree space waste bytes: 867842012
file data blocks allocated: 4865657503744
 referenced 4895640494080
Btrfs v3.17-rc3
extent buffer leak: start 842235904 len 16384
extent buffer leak: start 842235904 len 16384
[...]


6. As far as I can remember, that was the point when mounting rw stopped 
working. Mounting ro seems to work quite fine though (no idea if data was 
lost/corrupted).



I removed the failing drive today and updated to the latest integration 
branch of cmason's git repository (including Miao Xie's patches for raid56 
replacement) and David's integration-20141125 branch for btrfs-progs. With 
those, I tried a mount with -o ro,degraded,recovery (works, but didn't 
recover). I also tried a btrfsck again, but it just prints some errors and then 
exits.
Mounting rw with -o degraded gives the following output in dmesg:

[ 7358.907119] BTRFS: open /dev/dm-4 failed
[ 7358.907860] BTRFS info (device dm-6): allowing degraded mounts
[ 7358.907866] BTRFS info (device dm-6): enabling auto recovery
[ 7358.907870] BTRFS info (device dm-6): disk space caching is enabled
[ 7358.907872] BTRFS: has skinny extents
[ 7360.549993] BTRFS: bdev /dev/dm-4 errs: wr 0, rd 22288, flush 0, corrupt 0, 
gen 0
[ 7377.923939] BTRFS info (device dm-6): The free space cache file 
(7065489637376) is invalid. skip it

[ 7383.443486] BTRFS (device dm-6): parent transid verify failed on 

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-01 Thread Konstantin

MegaBrutal schrieb am 01.12.2014 um 13:56:
 Hi all,

 I've reported the bug I've previously posted about in BTRFS messes up
 snapshot LV with origin in the Kernel Bug Tracker.
 https://bugzilla.kernel.org/show_bug.cgi?id=89121
Hi MegaBrutal. If I understand your report correctly, I can give you
another example where this bug is appearing. It is so bad that it leads
to freezing the system and I'm quite sure it's the same thing. I was
thinking about filing a bug but didn't have the time for that yet. Maybe
you could add this case to your bug report as well.

The bug appears also when using mdadm RAID1 - when one of the drives is
detached from the array then the OS discovers it and after a while (not
directly, it takes several minutes) it appears under /proc/mounts:
instead of /dev/md0p1 I see there /dev/sdb1. And usually after some hour
or so (depending on system workload) the PC completely freezes. So
discussion about the uniqueness of UUIDs or not, a crashing kernel is
telling me that there is a serious bug.

While in my case detaching was intentional, there are several real
possibilities when a RAID1 disk can get detached and currently this
leads to crashing the server when using BTRFS. That not what is intended
when using RAID ;-).

In my case I wanted to do something which was working perfectly all the
years before with all other file systems - checking the file system of
the root disk while the server is running. The procedure is simple:

1. detach one of the disks
2. do fsck on the disk device
3. mdadm --zero-superblock on the device so it gets completely rewritten
4. mdadm --add it to the array

There were some surprises with BTRFS - if 2. is not done directly after
1. btrfsck refuses to check the disk as it is reported to be mounted by
/proc/mounts. And while 2. or even after finishing it the system was
freezing. If I got to get to 4. fast enough everything was OK, but
again, that's not what I expect from a good operating system. Any
objections?

Konstantin

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two persistent problems

2014-11-17 Thread Konstantin
Josef Bacik wrote on 14.11.2014 at 23:00:
 On 11/14/2014 04:51 PM, Hugo Mills wrote:
 Chris, Josef, anyone else who's interested,

 On IRC, I've been seeing reports of two persistent unsolved
 problems. Neither is showing up very often, but both have turned up
 often enough to indicate that there's something specific going on
 worthy of investigation.

 One of them is definitely a btrfs problem. The other may be btrfs,
 or something in the block layer, or just broken hardware; it's hard to
 tell from where I sit.

 Problem 1: ENOSPC on balance

 This has been going on since about March this year. I can
 reasonably certainly recall 8-10 cases, possibly a number more. When
 running a balance, the operation fails with ENOSPC when there's plenty
 of space remaining unallocated. This happens on full balance, filtered
 balance, and device delete. Other than the ENOSPC on balance, the FS
 seems to work OK. It seems to be more prevalent on filesystems
 converted from ext*. The first few or more reports of this didn't make
 it to bugzilla, but a few of them since then have gone in.

 Problem 2: Unexplained zeroes

 Failure to mount. Transid failure, expected xyz, have 0. Chris
 looked at an early one of these (for Ke, on IRC) back in September
 (the 27th -- sadly, the public IRC logs aren't there for it, but I can
 supply a copy of the private log). He rapidly came to the conclusion
 that it was something bad going on with TRIM, replacing some blocks
 with zeroes. Since then, I've seen a bunch of these coming past on
 IRC. It seems to be a 3.17 thing. I can successfully predict the
 presence of an SSD and -odiscard from the have 0. I've successfully
 persuaded several people to put this into bugzilla and capture
 btrfs-images.  btrfs recover doesn't generally seem to be helpful in
 recovering data.


 I think Josef had problem 1 in his sights, but I don't know if
 additional images or reports are helpful at this point. For problem 2,
 there's obviously something bad going on, but there's not much else to
 go on -- and the inability to recover data isn't good.

 For each of these, what more information should I be trying to
 collect from any future reporters?



 So for #2 I've been looking at that the last two weeks.  I'm always
 paranoid we're screwing up one of our data integrity sort of things,
 either not waiting on IO to complete properly or something like that.
 I've built a dm target to be as evil as possible and have been running
 it trying to make bad things happen.  I got slightly side tracked
 since my stress test exposed a bug in the tree log stuff an csums
 which I just fixed.  Now that I've fixed that I'm going back to try
 and make the expected blah, have 0 type errors happen.

 As for the ENOSPC I keep meaning to look into it and I keep getting
 distracted with other more horrible things.  Ideally I'd like to
 reproduce it myself, so more info on that front would be good, like do
 all reports use RAID/compression/some other odd set of features? 
 Thanks for taking care of this stuff Hugo, #2 is the worst one and I'd
 like to be absolutely sure it's not our bug, once I'm happy we aren't
 I'll look at the balance thing.

 Josef

For #2, I had a strangely damaged BTRFS I reported a week or so ago
which may have similar background. Dmesg gives:

parent transid verify failed on 586239082496 wanted 13329746340512024838
found 588
BTRFS: open_ctree failed

The thing is that btrfsck crashes when trying to check this. As nobody
seemed to be interested I reformatted this disk today.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfsck crash

2014-11-10 Thread Konstantin
Hello!

I got a strangely corrupted btrfs where btrfsck seems to crash. First
try with v3.14 spit a large amount of messages
(http://pastebin.com/J1jCzhzx http://pastebin.com/J1jCzhzx), then a
run with v3.17 gives an Assertion failed error with other messages
(http://pastebin.com/TE6dSjgR http://pastebin.com/TE6dSjgR). Anyone
interested in looking into this or should I reformat this disk?

Konstantin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs partition remounted read-only

2014-07-13 Thread Konstantin Svist
On 07/13/2014 10:13 AM, Chris Murphy wrote:
 On Jul 4, 2014, at 11:00 AM, Konstantin Svist fry@gmail.com wrote:

 I have an overnight cron job with

 /sbin/fstrim -v /
 /bin/bedup dedup --defrag
 Probably not related, but these look backwards, why not reverse them?


 Chris Murphy


Thanks, will do that. Anything else useful I could add to the cron job,
btw? I was thinking maybe a scrub operation to check for errors..


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs partition remounted read-only

2014-07-04 Thread Konstantin Svist
I have an overnight cron job with

/sbin/fstrim -v /
/bin/bedup dedup --defrag

Every once in a while, it causes the FS to be remounted read-only.
Problem is pretty intermittent so far (aside from a few kernel revisions
a while ago).

Please advise.


Corresponding bugs:
https://bugzilla.kernel.org/show_bug.cgi?id=71311
https://bugzilla.redhat.com/show_bug.cgi?id=1071408

Addtl info:

# uname -a
Linux mireille.svist.net 3.14.9-200.fc20.x86_64 #1 SMP Thu Jun 26
21:40:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
# btrfs --version
Btrfs v3.14.2
# btrfs fi show
Label: none  uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939
Total devices 1 FS bytes used 151.77GiB
devid1 size 465.76GiB used 282.02GiB path /dev/sda2

Btrfs v3.14.2
# btrfs fi df /
Data, single: total=277.01GiB, used=148.86GiB
System, single: total=4.00MiB, used=48.00KiB
Metadata, single: total=5.01GiB, used=2.91GiB

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: severe hardlink bug

2012-08-08 Thread Konstantin Dmitriev
Jan Schmidt list.btrfs at jan-o-sch.net writes:
 Please give the patch set btrfs: extended inode refs by Mark Fasheh a try
 (http://lwn.net/Articles/498226/). It eliminates the hard links per directory
 limit (introducing a rather random, artificial limit of 64k instead).

Hi, Jan!
I'm happy to see that there is something done on fixing that issue.
Unfortunately I cannot afford to have unstable patched kernel on the production
server. Probably I'll give another try to btrfs in 2013. ^__^
K.




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: severe hardlink bug

2012-07-29 Thread Konstantin Dmitriev
Dipl.-Ing. Michael Niederle mniederle at gmx.at writes:

 I reinstalled over 700 packages - plt-scheme beeing the only one failing due 
 to
 the btrfs link restriction.
 

I have hit the same issue - tried to run BackupPC with a pool on btrfs
filesystem. After some time the error of too many links (31) appeared to me.
Now I'm forced to migrate to some other filesystem...

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: severe hardlink bug

2012-07-29 Thread Konstantin Dmitriev
C Anthony Risinger anthony at xtfx.me writes:

 btrfs only fails when you have hundreds of hardlinks to the same file
 in the *same* directory ... certainly not a standard use case.
 
 use snapshots to your advantage:
  - snap source
  - rsync --inplace source to target (with some other opts that have
 been discussed on list)
  - snap target
  - {rinse-and-repeat-in-24-hrs}

I understand that the case is only for *same* directory.
You can claim that it's not a standard use case, but first Michael hit that,
now me. There's at least one more case  -
https://lists.samba.org/archive/rsync/2011-December/027117.html
The count of such cases will be increasing and the sooner it will be fixed - 
the less pain it will bring to the users. I know fixing that is a big 
structural change, but it will become worse with time.
If it's not going to be fixed - I don't care. Right now I'm forced 
to migrate to old mdadm raid-1 or ZFS. The sad thing is that 
I really LOVED btrfs. Only that. ^__^

K.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix warning in iput for bad-inode

2011-08-17 Thread Konstantin Khlebnikov
iput() shouldn't be called for inodes in I_NEW state,
lets call __destroy_inode() and btrfs_destroy_inode() instead

[1.871723] WARNING: at fs/inode.c:1309 iput+0x1d9/0x200()
[1.873722] Modules linked in:
[1.873722] Pid: 1, comm: swapper Tainted: GW   3.1.0-rc2-zurg #58
[1.875722] Call Trace:
[1.875722]  [8113cb99] ? iput+0x1d9/0x200
[1.876722]  [81044c3a] warn_slowpath_common+0x7a/0xb0
[1.877722]  [81044c85] warn_slowpath_null+0x15/0x20
[1.879722]  [8113cb99] iput+0x1d9/0x200
[1.879722]  [81295cf4] btrfs_iget+0x1c4/0x450
[1.881721]  [812b7e6b] ? btrfs_tree_read_unlock_blocking+0x3b/0x60
[1.882721]  [8111769a] ? kmem_cache_free+0x2a/0x160
[1.883721]  [812966f3] btrfs_lookup_dentry+0x413/0x490
[1.885721]  [8103b1e1] ? get_parent_ip+0x11/0x50
[1.886720]  [81296781] btrfs_lookup+0x11/0x30
[1.887720]  [8112de50] d_alloc_and_lookup+0x40/0x80
[1.888720]  [8113ac10] ? d_lookup+0x30/0x50
[1.889720]  [811301a8] do_lookup+0x288/0x370
[1.890720]  [8103b1e1] ? get_parent_ip+0x11/0x50
[1.891720]  [81132210] do_last+0xe0/0x910
[1.892720]  [81132b4d] path_openat+0xcd/0x3a0
[1.893719]  [813bab4b] ? wait_for_xmitr+0x3b/0xa0
[1.895719]  [8131c50a] ? put_dec_full+0x5a/0xb0
[1.896719]  [813babdb] ? serial8250_console_putchar+0x2b/0x40
[1.897719]  [81132e7d] do_filp_open+0x3d/0xa0
[1.898719]  [8103b1e1] ? get_parent_ip+0x11/0x50
[1.899718]  [8103b1e1] ? get_parent_ip+0x11/0x50
[1.900718]  [816e80fd] ? sub_preempt_count+0x9d/0xd0
[1.902718]  [8112a09d] open_exec+0x2d/0xf0
[1.903718]  [8112aaaf] do_execve_common.isra.32+0x12f/0x340
[1.906717]  [8112acd6] do_execve+0x16/0x20
[1.907717]  [8100af02] sys_execve+0x42/0x70
[1.908717]  [816ed968] kernel_execve+0x68/0xd0
[1.909717]  [816d828e] ? run_init_process+0x1e/0x20
[1.911717]  [816d831e] init_post+0x8e/0xc0
[1.912716]  [81cb8c79] kernel_init+0x13d/0x13d
[1.913716]  [816ed8f4] kernel_thread_helper+0x4/0x10
[1.914716]  [81cb8b3c] ? start_kernel+0x33f/0x33f
[1.915716]  [816ed8f0] ? gs_change+0xb/0xb

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
---
 fs/btrfs/inode.c |   10 +++---
 1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 15fceef..3e949bd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3952,7 +3952,6 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
 struct btrfs_root *root, int *new)
 {
struct inode *inode;
-   int bad_inode = 0;
 
inode = btrfs_iget_locked(s, location-objectid, root);
if (!inode)
@@ -3968,15 +3967,12 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
if (new)
*new = 1;
} else {
-   bad_inode = 1;
+   __destroy_inode(inode);
+   btrfs_destroy_inode(inode);
+   inode = ERR_PTR(-ESTALE);
}
}
 
-   if (bad_inode) {
-   iput(inode);
-   inode = ERR_PTR(-ESTALE);
-   }
-
return inode;
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html