Hello again list. I thought I would clear the things out and describe what is 
happening with my troubled RAID setup.

So having received the help from the list, I've initially run the full 
defragmentation of all the data and recompressed everything with zlib. 
That didn't help. Then I run the full rebalance of the data and that didn't 
help either.

So I had to take a disk out of the raid, copy all the data onto it, recreate 
the RAID drive with 32kb chunk size and 96kb stripe and copied the data back. 
Then added the disk back and resynced the raid.


So currently the RAID device is 

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 21.830 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 7.276 TB
State               : Optimal
Strip Size          : 32 KB
Number Of Drives    : 4
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No


It is about 40% full with compressed data
# btrfs fi usage /mnt/arh-backup1/
Overall:
    Device size:                  21.83TiB
    Device allocated:              8.98TiB
    Device unallocated:           12.85TiB
    Device missing:                  0.00B
    Used:                          8.98TiB
    Free (estimated):             12.85TiB      (min: 6.43TiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)


I've decided to run a set of test, where 5 gb file was created using different 
blocksizes and different flags.
one file with urandom data was generated and another one filled with zeroes. 
the data was written with compression and without compression, and it seems 
that without compression it is possible to gain 30-40% speed, while the cpu was 
running at 50% idle during the highest loads.
dd write speeds (mb/s)

flags: conv=fsync
compress-force=zlib  compress-force=none
         RAND ZERO    RAND ZERO
bs1024k  387  407     584  577
bs512k   389  414     532  547
bs256k   412  409     558  585
bs128k   412  403     572  583
bs64k    409  419     563  574
bs32k    407  404     569  572


flags: oflag=sync
compress-force=zlib  compress-force=none
         RAND  ZERO    RAND  ZERO
bs1024k  86.1  97.0    203   210
bs512k   50.6  64.4    85.0  170
bs256k   25.0  29.8    67.6  67.5
bs128k   13.2  16.4    48.4  49.8
bs64k    7.4   8.3     24.5  27.9
bs32k    3.8   4.1     14.0  13.7




flags: no flags
compress-force=zlib  compress-force=none
         RAND  ZERO    RAND  ZERO
bs1024k  480   419     681   595
bs512k   422   412     633   585
bs256k   413   384     707   712
bs128k   414   387     695   704
bs64k    482   467     622   587
bs32k    416   412     610   598


I have also run a test where I filled the array to about 97% capacity and the 
write speed went down by about 50% compared with the empty RAID.


thanks for the help. 

----- Original Message -----
From: "Peter Grandi" <p...@btrfs.list.sabi.co.uk>
To: "Linux fs Btrfs" <linux-btrfs@vger.kernel.org>
Sent: Tuesday, 1 August, 2017 10:09:03 PM
Subject: Re: Btrfs + compression = slow performance and high cpu usage

>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:

>>>> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
>>>> BBU
>>>> Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
>>>> BBU

So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.

> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.

That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.

> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?

That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to