Hello everyone,

Anybody had the chance to test out this setup and reproduce the problem? I assumed it would be something that's used often these days and a solution would benefit a lot of users. If can be of any assistance please contact me.

--
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates)

On 04/10/2017 10:08 AM, Sandro Bonazzola wrote:
Adding Paolo and Miroslav.

On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote <rich...@rimote.nl <mailto:rich...@rimote.nl>> wrote:

    Hello,

    I would really appreciate some help/guidance with this problem.
    First of all sorry for the long message. I would file a bug, but
    do not know if it is my fault, dm-cache, qemu or (probably) a
    combination of both. And i can imagine some of you have this setup
    up and running without problems (or maybe you think it works, just
    like i did, but it does not):

    PROBLEM
    LVM cache writeback stops working as expected after a while with a
    qemu-kvm VM. A 100% working setup would be the holy grail in my
    opinion... and the performance of KVM/qemu is great i must say in
    the beginning.

    DESCRIPTION

    When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and
    create a cached LV out of them, the VM performs initially great
    (at least 40.000 IOPS on 4k rand read/write)! But then after a
    while (and a lot of random IO, ca 10 - 20 G) it effectively turns
    in to a writethrough cache although there's much space left on the
    cachedlv.


    When  working as expected on KVM host all writes go to SSDs

    iostat -x -m 2

    Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda 0.00 324.50 0.00 22.00 0.00 14.94 1390.57 1.90 86.39 0.00 86.39 5.32 11.70 sdb 0.00 324.50 0.00 22.00 0.00 14.94 1390.57 2.03 92.45 0.00 92.45 5.48 12.05 sdc 0.00 3932.00 0.00 *2191.50* 0.00 *270.07* 252.39 37.83 17.55 0.00 17.55 0.36 *78.05* sdd 0.00 3932.00 0.00 *2197.50 * 0.00 *271.01 * 252.57 38.96 18.14 0.00 18.14 0.36 *78.95*


    When not working as expected on KVM host all writes go through the
    SSD on to the HDDs (effectively disabling writeback so it becomes
    a writethrough)

    Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda 0.00 7.00 234.50 *173.50 * 0.92 *1.95* 14.38 29.27 71.27 111.89 16.37 2.45 *100.00* sdb 0.00 3.50 212.00 *177.50 * 0.83 *1.95* 14.60 35.58 91.24 143.00 29.42 2.57*100.10* sdc 2.50 0.00 566.00 *199.00 * 2.69 0.78 9.28 0.08 0.11 0.13 0.04 0.10 *7.70* sdd 1.50 0.00 76.00 *199.00* 0.65 0.78 10.66 0.02 0.07 0.16 0.04 0.07 *1.85*


    Stuff i've checked/tried:

    - The data in the cached LV has then not exceeded even half of the
    space, so this should not happen. It even happens when only 20% of
    cachedata is used.
    - It seems to be triggerd most of the time when %cpy/sync column
    of `lvs -a` is about 30%. But this is not always the case!
    - changing the cachepolicy from cleaner to smq, wait (check flush
    ready with lvs -a) and then back to smq seems to help /sometimes/!
    But not always...

    lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv

    lvs -a

    lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv

    - *when mounting the LV inside the host this does not seem to
    happen!!* So it looks like a qemu-kvm / dm-cache combination
    issue. Only difference is that inside host i do mkfs in stead of
    LVM inside VM (so could be LVM inside VM on top of LVM on KVM host
    problem too? small chance probably because the first 10 - 20GB it
    works great!)

    - tried disabling Selinux, upgrading to newest kernels (elrepo ml
    and lt), played around with dirty_cache thingeys like
    proc/sys/vm/dirty_writeback_centisecs
    /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio ,
    and migration threashold of dmsetup, and other probably non
    important stuff like vm.dirty_bytes

    - when in "slow state" the systems kworkers are exessively using
    IO (10 - 20 MB per kworker process). This seems to be the
    writeback process (CPY%Sync) because the cache wants to flush to
    HDD. But the strange thing is that after a good sync (0% left),
    the disk may become slow again after a few MBs of data. A reboot
    sometimes helps.

    - have tried iothreads, virtio-scsi, vcpu driver setting on
    virtio-scsi controller, cachesettings, disk shedulers etc. Nothing
    helped.

    - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have
    AMD FX(tm)-8350, 16G RAM

    It feels like the lvm cache has a threshold (about 20G of data
    that is dirty) and that is stops allowing the qemu-kvm process to
    use writeback caching (the root uses inside the host seems to not
    have this limitation). It starts flushing, but only to a certain
    point. After a few  MBs of data it is right back in the slow spot
    again. Only solution is waiting for a long time (independant of
    CPY%SYNC) or sometimes change cachepolicy and force flush. This
    prevents for me the production use of this system. But it's so
    promising, so I hope somebody can help.

    desired state:  Doing the FIO test (described in section
    reproduce) repeatedly should keep being fast till cachedlv is more
    or less full. If resyncing back to disc causes this degradation,
    it should actually flush it fully within a reasonable time and
    give opportunity to write fast again up to a given threshold. It
    now seems like a one time use cache that only uses a fraction of
    the SSD and is useless/very unstable afterwards.

    REPRODUCE
    1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep
    a lot of space for the LVM cache (no /home)! So make the VG as
    large as possible during anaconda partitioning.

    2. once installed and booted in to the system, install qemu-kvm

    yum install -y centos-release-qemu-ev
    yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
    # disbale ksm (probably not important / needed)
    systemctl disable ksm
    systemctl disable ksmtuned

    3. create LVM cache

    #set some variables and create a raid1 array with the two SSDs

    VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
    hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm
    --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none
    --raid-devices=2 ${ssddevice1} ${ssddevice2}

    # create PV and extend VG

     pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}

    # create slow LV on HDDs (use max space left if you want)

     pvdisplay ${hddraiddevice}
     lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}

    # create the meta and data: for testing purposes I keep about 20G
    of the SSD for a uncached lv. To rule out it is not the SSD.

    lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}

    #The rest can be used as cachedata/metadata.

     pvdisplay ${ssdraiddevice}
    # about 1/1000 of the space you have left on the SSD for the meta
    (minimum of 4)
     lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
    # the rest can be used as cachedata
     lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}

    # convert/combine pools so cachedlv is actually cached

     lvconvert --type cache-pool --cachemode writeback --poolmetadata
    ${VGBASE}/cachemeta ${VGBASE}/cachedata

     lvconvert --type cache --cachepool ${VGBASE}/cachedata
    ${VGBASE}/cachedlv


    # my system now looks like (VG is called cl, default of installer)
    [root@localhost ~]# lvs -a
      LV                VG Attr       LSize   Pool Origin
      [cachedata]       cl Cwi---C--- 97.66g
    *  [cachedata_cdata] cl Cwi-ao---- 97.66g **
    **  [cachedata_cmeta] cl ewi-ao---- 100.00m *
    *  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
    [cachedlv_corig] *
      [cachedlv_corig]  cl owi-aoC--- 1.75t
      [lvol0_pmspare]   cl ewi------- 100.00m
      root              cl -wi-ao---- 46.56g
      swap              cl -wi-ao---- 14.96g
    *  testssd           cl -wi-a-----  45.47g

    *[root@localhost ~]#lsblk*
    *
    NAME                     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    sdd                        8:48   0   163G  0 disk
    └─sdd1                     8:49   0   163G  0 part
      └─md128                  9:128  0 162.9G  0 raid1
        ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
        │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
        ├─cl-testssd         253:2    0  45.5G  0 lvm
        └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
          └─cl-cachedlv      253:6    0   1.8T  0 lvm
    sdb                        8:16   0   1.8T  0 disk
    ├─sdb2                     8:18   0   1.8T  0 part
    │ └─md127                  9:127  0   1.8T  0 raid1
    │   ├─cl-swap            253:1    0    15G  0 lvm [SWAP]
    │   ├─cl-root            253:0    0  46.6G  0 lvm   /
    │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
    │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
    └─sdb1                     8:17   0   954M  0 part
      └─md126                  9:126  0   954M  0 raid1 /boot
    sdc                        8:32   0   163G  0 disk
    └─sdc1                     8:33   0   163G  0 part
      └─md128                  9:128  0 162.9G  0 raid1
        ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
        │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
        ├─cl-testssd         253:2    0  45.5G  0 lvm
        └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
          └─cl-cachedlv      253:6    0   1.8T  0 lvm
    sda                        8:0    0   1.8T  0 disk
    ├─sda2                     8:2    0   1.8T  0 part
    │ └─md127                  9:127  0   1.8T  0 raid1
    │   ├─cl-swap            253:1    0    15G  0 lvm [SWAP]
    │   ├─cl-root            253:0    0  46.6G  0 lvm   /
    │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
    │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
    └─sda1                     8:1    0   954M  0 part
      └─md126                  9:126  0   954M  0 raid1 /boot

    # now create vm
    wget
    http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
    
<http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso>
    -P /home/
    DISK=/dev/mapper/XXXX-cachedlv

    # watch out, my netsetup uses a custom bridge/network in the
    following command. Please replace with what you normally use.
    virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7
    --disk path=${DISK},cache=none,bus=virtio --network
    bridge=pubbr,model=virtio --cdrom
    /home/CentOS-6.9-x86_64-minimal.iso --graphics
    vnc,port=5998,listen=0.0.0.0 --cpu host

    # now connect with client PC to qemu
    virt-viewer --connect=qemu+ssh://r...@192.168.0.xxx/system --name
    CentOS1

    And install everything on the single vda disc with LVM (i use
    defaults in anaconda, but remove the large /home to prevent SSD
    beeing over used).

    After install and reboot log in to VM and

    yum install epel-release -y && yum install screen fio htop -y

    and then run disk test:

    fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
    --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
    --readwrite=randrw --rwmixread=75

    then *keep repeating *but *change the filename* attribute so it
    does not use the same blocks over and over again.

    In the beginning the performance is great!! Wow, very impressive
    150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss).
    But after a few (usually about 4 or 5) runs (always changing the
    filename, but not overfilling the FS, it drops to about 10 MBs/sec.

    normal/in the beginning

     read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
      write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec

    but then

     read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
      write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec

    or even worse up to the point that it is actually the HDD that is
    written to (about 500 iops).

    P.S. when a test is/was slow, that means it is on the HDDs. So
    even when fixing the problem (sometimes just by waiting), that
    specific file will keep being slow when redoing the test till its
    promoted to the lvm cache (takes a lot of reads I think). And once
    on the SSD it sometimes keeps beeing fast, although a new testfile
    will be slow. So I really recommend changing the testfile all the
    time when trying to see if a change in speed has occurred.

-- Met vriendelijke groet,

    Richard Landsman
    http://rimote.nl

    T: +31 (0)50 - 763 04 07
    (ma-vr 9:00 tot 18:00)

    24/7 bij storingen:
    +31 (0)6 - 4388 7949
    @RimoteSaS (Twitter Serviceberichten/security updates)


    _______________________________________________
    CentOS-virt mailing list
    CentOS-virt@centos.org <mailto:CentOS-virt@centos.org>
    https://lists.centos.org/mailman/listinfo/centos-virt
    <https://lists.centos.org/mailman/listinfo/centos-virt>




--

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>

<https://red.ht/sig>      
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>



_______________________________________________
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt

_______________________________________________
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt

Reply via email to