On 14/07/17 18:43, Ruben Rodriguez wrote:
> How to reproduce...

I'll provide more concise details on how to test this behavior:

Ceph config:

[client]
rbd readahead max bytes = 0 # we don't want forced readahead to fool us
rbd cache = true

Start a qemu vm, with a rbd image attached with virtio-scsi:
   <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='...'/>
      </auth>
      <source protocol='rbd' name='libvirt-pool/test'>
        <host name='cephmon1' port='6789'/>
        <host name='cephmon2' port='6789'/>
        <host name='cephmon3' port='6789'/>
      </source>
      <blockio logical_block_size='512' physical_block_size='512'/>
      <target dev='sdb' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>

Block device parameters, inside the vm:
NAME ALIGN  MIN-IO  OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA WSAME
sdb      0 4194304 4194304     512     512    1 noop      128 4096    2G

Collect performance statistics from librbd, using command:

$ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump

Note the values for:
- rd: number of read operations done by qemu
- rd_bytes: length of read requests done by qemu
- cache_ops_hit: read operations hitting the cache
- cache_ops_miss: read ops missing the cache
- data_read: data read from the cache
- op_r: number of objects sent by the OSD

Perform one small read, not at the beginning of the image (because udev
may have read it already), at a 4MB boundary line:

dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes

Do it again advancing 5000 bytes (to not overlap with the previous read)
Run the perf dump command again

dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes
Run the perf dump command again

If you compare the op_r values at each step, you should see a cache miss
each time, and a object read each time. Same object fetched twice.

IMPACT:

Let's take a look at how the op_r value increases by doing some common
operations:

- Booting a vm: This operation needs (in my case) ~70MB to be read,
which include the kernel, initrd and all files read by systemd and
daemons, until a command prompt appears. Values read
        "rd": 2524,
        "rd_bytes": 69685248,
        "cache_ops_hit": 228,
        "cache_ops_miss": 2268,
        "cache_bytes_hit": 90353664,
        "cache_bytes_miss": 63902720,
        "data_read": 69186560,
        "op": 2295,
        "op_r": 2279,
That is 2299 objects being fetched from the OSD to read 69MB.

- Greping inside the linux source code (833MB), takes almost 3 minutes.
  Values get increased to:
        "rd": 65127,
        "rd_bytes": 1081487360,
        "cache_ops_hit": 228,
        "cache_ops_miss": 64885,
        "cache_bytes_hit": 90353664,
        "cache_bytes_miss": 1075672064,
        "data_read": 1080988672,
        "op_r": 64896,
That is over 60.000 objects fetched to read <1GB, and *0* cache hits.
Optimized, this should take 10 seconds, and fetch ~700 objects.

Is my Qemu implementation completely broken? Or is this expected? Please
help!

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to