Re: csum failed root raveled during balance

Austin S. Hemmelgarn Tue, 29 May 2018 07:35:30 -0700

On 2018-05-29 10:02, ein wrote:

On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote:

On 2018-05-28 13:10, ein wrote:

On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:

On 2018-05-23 06:09, ein wrote:

On 05/23/2018 11:09 AM, Duncan wrote:

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:

IMHO the best course of action would be to disable checksumming for
you
vm files.


Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.


You can use nocow (-C), but of course that has other restrictions (like
setting it on the files when they're zero-length, easiest done for
existing data by setting it on the containing dir and copying files (no
reflink) in) as well as the nocow effects, and nocow becomes cow1
after a
snapshot (which locks the existing copy in place so changes written to a
block /must/ be written elsewhere, thus the cow1, aka cow the first time
written after the snapshot but retain the nocow for repeated writes
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way
to go.


Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.

Setting NOCOW on a file doesn't prevent it from being snapshotted, it
just prevents COW operations from happening under most normal
circumstances.  In essence, it prevents COW from happening except for
writing right after the snapshot.  More specifically, the first write to
a given block in a file set for NOCOW after taking a snapshot will
trigger a _single_ COW operation for _only_ that block (unless you have
autodefrag enabled too), after which that block will revert to not doing
COW operations on write.  This way, you still get consistent and working
snapshots, but you also don't take the performance hits from COW except
right after taking a snapshot.


Yeah, just after I've post it, I've found some Duncan post from 2015,
explaining it, thank you anyway.

Is there anything we can do better in random/write VM workload to speed
the BTRFS up and why?

My settings:

<disk type='file' device='disk'>
        <driver name='qemu' type='raw' cache='none' io='native'/>
        <source file='/var/lib/libvirt/images/db.raw'/>
        <target dev='vda' bus='virtio'/>
        [...]
</disk>

/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
        468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
        bitmap: 0/4 pages [0KB], 65536KB chunk

CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
kernel 4.15.0-0.bpo.2-amd64

As far as I understand compress and autodefrag are impacting negatively
for performance (latency), especially autodefrag. I think also that
nodatacow shall also speed things up and it's a must when using qemu and
BTRFS. Is it better to use virtio or virt-scsi with TRIM support?

FWIW, I've been doing just fine without nodatacow, but I also use raw images 
contained in sparse
files, and keep autodefrag off for the dedicated filesystem I put the images on.


So do I, RAW images created by qemu-img, but I am not sure if preallocation 
works as expected. The
size of disks in filesystem looks fine though.

Unless I'm mistaken, qemu-img will fully pre-allocate the images.

You can easily check though with `ls -ls`, which will show the amount ofspace taken up by the file on-disk (before compression or deduplication)on the left. If that first column on the left doesn't match up with theapparent file size, then the file is sparse and not fully pre-allocated.

From a practical perspective, if you really want maximal performance,it's worth pre-allocating space, as that both avoids the non-determinismof allocating blocks on first-write, and avoids some degree offragmentation.

If you would rather save the space and not pre-allocate, you can eitheruse touch with the `--size` argument to quickly create an apropriatelysized virtual disk image file.


May I ask in what workloads? From my testing while having VM on BTRFS storage:
- file/web servers works perfect on BTRFS.
- Windows (2012/2016) file servers with AD, are perfect too, besides time 
required for Windows
Update, but this service is... let's say not fine enough.
- database (firebird) impact is huuuge, guest filesystem is Ext4, the database 
performs slower in
this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm 
SASes. I am still
thinking how to benchmark it properly. A lot of iowait in host's kernel.

In my case, I've got a couple of different types of VM's, each with it'sown type of workload:- A total of 8 static VM's that are always running, each running adifferent distribution/version of Linux. These see very little activitymost of the time (I keep them around as reference systems so i havesomething I can look at directly when doing development or providingsupport), use ext4 for the internal filesystems, and are notparticularly big to begin with.- A bunch of transient VM's used for testing kernel patches for BTRFS.These literally start up, run xfstests, copy the results out to a fileshare on the host, and shut down. The overall behavior for theseshouldn't be too drastically different from most database workloads (theinternals of BTRFS are very similar to many database systems).- Less frequently, transient VM's for testing other software (mostlyNetdata recently). These have varied workloads depending on whatexactly I'm testing, but often don't touch the disk much.

So, overall, I don't have any systems quite comparable to what you'rerunning, but still at least have a reasonable spread of workloads.

Compression shouldn't have much in the way of negative impact unless you're 
also using transparent
compression (or disk for file encryption) inside the VM (in fact, it may speed 
things up
significantly depending on what filesystem is being used by the guest OS, the 
ext4 inode table in
particular seems to compress very well).  If you are using `nodatacow` though, 
you can just turn
compression off, as it's not going to be used anyway.  If you want to keep 
using compression, then
I'd suggest using `compress-force` instead of `compress`, which makes BTRFS a 
bit more aggressive
about trying to compress things, but makes the performance much more 
deterministic.  You may also
want to look int using `zstd` instead of `lzo` for the compression, it gets 
better ratios most of
the time, and usually performs better than `lzo` does.


Yeah, I do know exact values from the post we both know for sure:
https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=next&id=5c1aab1dd5445ed8bdcdbb575abc1b0d7ee5b2e7

Autodefrag should probably be off.  If you have nodatacow set (or just have all 
the files marked
with the NOCOW attribute), then there's not really any point to having 
autodefrag on.  If like me
you aren't turning off COW for data, it's still a good idea to have it off and 
just do batch
defragmentation at a regularly scheduled time.


Well, at least I need to try nodatacow to check the impact.

Provided that the files aren't fragmented, you should see an increase inwrite performance, but probably not much improvement for reads.

For the VM settings, everything looks fine to me (though if you have somewhat 
slow storage and
aren't giving the VM's lots of memory to work with, doing write-through caching 
might be helpful).
I would probably be using virtio-scsi for the TRIM support, as with raw images 
you will get holes in
the file where the TRIM command was issued, which can actually improve 
performance (and does improve
storage utilization (though doing batch trims instead of using the `discard` 
mount option is better
for performance if you have Linux guests).


I don't consider this... :
4: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822367V
21: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822370A
38: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6141002D240AGN
55: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6063000F240AGN
as slow in Raid 10 mode, because:

root@node0:~# time dd if=/dev/md1 of=/dev/null bs=4096 count=10M
10485760+0 records in
10485760+0 records out
42949672960 bytes (43 GB, 40 GiB) copied, 31.6336 s, 1.4 GB/s

real    0m31.636s
user    0m1.949s
sys     0m12.222s

root@node0:~# iostat -x 5 /dev/md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.63    0.00    4.85    6.61    0.00   87.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await r_await
w_await  svctm  %util
sdc             306.80     0.00  672.00    0.00 333827.20     0.00   993.53     
1.05    1.57    1.57
    0.00   1.21  81.20
sdb             329.80     0.00  663.40    0.00 332640.00     0.00  1002.83     
0.94    1.41    1.41
    0.00   1.05  69.44
sdd             298.80     0.00  664.80    0.00 329110.40     0.00   990.10     
1.00    1.50    1.50
    0.00   1.22  80.96
sda             291.60     0.00  657.40    0.00 330297.60     0.00  1004.86     
0.92    1.40    1.40
    0.00   1.05  69.20
md1               0.00     0.00 3884.80    0.00 2693254.40     0.00  1386.56    
 0.00    0.00
0.00    0.00   0.00   0.00

It gives me much more than 100k IOPs to the BTRFS filesystem if I remember 
correctly while fio
benchmark random workload (75% reads, 25% writes), 2 threads.

Yeah, I wouldn't consider that 'slow' either. In my case, I'm runningmy VM's with the back-end storage being a BTRFS raid1 volume on top oftwo LVM thinp targets, which are in turn on top of a pair of consumer7200RPM SATA3 HDD's (because I ran out of space on the two half-TB SSD'sthat I had everything in the system on, and happened to have someessentially new 1TB HDD's around still from before I converted to SSD'severywhere), and that I would definitely call slow, and it's probablyworth noting that my definition of 'works just fine' is at least partlybased on the fact that the storage is so slow.>

You're using an MD RAID10 array.  This is generally the fastest option in terms 
of performance, but
it also means you can't take advantage of BTRFS' self repairing ability very 
well, and you may be
wasting space and some performance (because you probably have the 'dup' profile 
set for metadata).
If it's an option, I'd suggest converting this to a BTRFS raid1 volume on top 
of two MD RAID0
volumes, which should either get the same performance, or slightly better 
performance, will avoid
wasting space storing metadata, and will also let you take advantage of the 
self-repair
functionality in BTRFS.


That's a good point.

You should probably switch the `ssd` mount option to `nossd` (and then run a 
full recursive defrag
on the volume, as this option affects the allocation policy, so the changes 
only take effect for new
allocations).  The SSD allocator can actually pretty significantly hurt 
performance in many cases,
and has at best very limited benefits for device lifetimes (you'll maybe get 
another few months out
of a device that will last for ten years without issue).  Make a point to test 
this though, because
you're on a RAID array, this may actually be improving performance slightly.


Good point either. I am going to test ssd parameter impact to, I think that 
recreating fs and copy
the data may be good idea.

One quick point, do make sure you explicitly set 'nossd', as BTRFS triesto set the 'ssd' parameter automatically based on whether or not theunderlying storage is rotational (and I don't remember if MD copies therotational flag from the lower-level storage or not).


We should not care about the wearout:
20 users working on the database every day (work week), for about a year and:

Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BP240G4

233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       
-       0

and

177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always       
-       282

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 PRO 256GB

Which means 50 more years @ Samsung Pro 850 and 100 years @ Intel 730, which is 
interesting. (btw.
start date is exactly the same).

For what it's worth, based on my own experience, the degradation isn'texactly linear, it's more of an exponential falloff (as more blocks gobad, there's less extra space for the FTL to work with for wearleveling, so it can't do as good a job wear-leveling, which in turncauses blocks to fail faster). Realistically though, you do stillprobably have a few decades worth of life in them at minimum.


Thank you for sharing Austin.

Glad I could help!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root raveled during balance

Reply via email to