Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Jens Kristian Søgaard

Hi guys,

I'm testing Ceph as storage for KVM virtual machine images and found an 
inconvenience that I am hoping it is possible to find the cause of.


I'm running a single KVM Linux guest on top of Ceph storage. In that 
guest I run rsync to download files from the internet. When rsync is 
running, the guest will seemingly stall and run very slowly.


For example if I log in via SSH to the guest and use the command prompt, 
nothing will happen for a long period (30+ seconds), then it processes a 
few typed characters, and then it blocks for another long period of 
time, then process a bit more, etc.


I was hoping to be able to tweak the system so that it runs more like 
when using conventional storage - i.e. perhaps the rsync won't be super 
fast, but the machine will be equally responsive all the time.


I'm hoping that you can provide some hints on how to best benchmark or 
test the system to find the cause of this?


The ceph OSDs periodically logs thse two messages, that I do not fully 
understand:


12-12-30 17:07:12.894920 7fc8f3242700  1 heartbeat_map is_healthy 
'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
2012-12-30 17:07:13.599126 7fc8cbfff700  1 heartbeat_map reset_timeout 
'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30


Is this to be expected when the system is in use, or does it indicate 
that something is wrong?


Ceph also logs messages such as this:

2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow 
request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236: 
osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write 
532480~4096] 0.f2a63fe) v4 currently waiting for sub ops



My setup:

3 servers running Fedora 17 with Ceph 0.55.1 from RPM.
Each server runs one osd and one mon. One of the servers also runs an mds.
Backing file system is btrfs stored on a md-raid . Journal is stored on 
the same SATA disks as the rests of the data.

Each server has 3 bonded gigabit/sec NICs.

One server running Fedora 16 with qemu-kvm.
Has gigabit/sec NIC connected to the same network as the Ceph servers, 
and a gigabit/sec NIC connected to the Internet.

Disk is mounted with:

-drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio


iostat on the KVM guest gives:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,000,000,00  100,000,000,00

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
vda   0,00 1,400,100,30 0,8013,60 
36,00 1,66 2679,25 2499,75  99,99



Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting.

iostat on a OSD gives:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,130,001,50   15,790,00   82,58

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda 240,70   441,20   33,00   42,70  1122,40  1961,80 
81,4814,45  164,42  319,14   44,85   6,63  50,22
sdb 299,10   393,10   33,90   38,40  1363,60  1720,60 
85,3213,55  171,32  316,21   43,41   6,55  47,39
sdc 268,50   441,60   28,80   45,40  1191,60  1977,00 
85,4119,08  159,39  345,98   41,02   6,56  48,69
sdd 255,50   445,50   30,20   45,00  1150,40  1975,80 
83,1418,18  155,97  338,90   33,20   6,95  52,23
md0   0,00 0,001,20  132,70 4,80  4086,40 
61,11 0,000,000,000,00   0,00   0,00



The figures are similar on all three OSDs.

I am thinking that one possible cause could be that the journal is 
stored on the same disks as the rest of the data, but I don't know how 
to benchmark if this is actually the case (?)


Thanks for any help or advice, you can offer!

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Andrey Korolyov
On Sun, Dec 30, 2012 at 9:05 PM, Jens Kristian Søgaard
 wrote:
> Hi guys,
>
> I'm testing Ceph as storage for KVM virtual machine images and found an
> inconvenience that I am hoping it is possible to find the cause of.
>
> I'm running a single KVM Linux guest on top of Ceph storage. In that guest I
> run rsync to download files from the internet. When rsync is running, the
> guest will seemingly stall and run very slowly.
>
> For example if I log in via SSH to the guest and use the command prompt,
> nothing will happen for a long period (30+ seconds), then it processes a few
> typed characters, and then it blocks for another long period of time, then
> process a bit more, etc.
>
> I was hoping to be able to tweak the system so that it runs more like when
> using conventional storage - i.e. perhaps the rsync won't be super fast, but
> the machine will be equally responsive all the time.
>
> I'm hoping that you can provide some hints on how to best benchmark or test
> the system to find the cause of this?
>
> The ceph OSDs periodically logs thse two messages, that I do not fully
> understand:
>
> 12-12-30 17:07:12.894920 7fc8f3242700  1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
> 2012-12-30 17:07:13.599126 7fc8cbfff700  1 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
>
> Is this to be expected when the system is in use, or does it indicate that
> something is wrong?
>
> Ceph also logs messages such as this:
>
> 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow
> request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236:
> osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write
> 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops
>
>
> My setup:
>
> 3 servers running Fedora 17 with Ceph 0.55.1 from RPM.
> Each server runs one osd and one mon. One of the servers also runs an mds.
> Backing file system is btrfs stored on a md-raid . Journal is stored on the
> same SATA disks as the rests of the data.
> Each server has 3 bonded gigabit/sec NICs.
>
> One server running Fedora 16 with qemu-kvm.
> Has gigabit/sec NIC connected to the same network as the Ceph servers, and a
> gigabit/sec NIC connected to the Internet.
> Disk is mounted with:
>
> -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio
>
>
> iostat on the KVM guest gives:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0,000,000,00  100,000,000,00
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vda   0,00 1,400,100,30 0,8013,60 36,00
> 1,66 2679,25 2499,75  99,99
>
>
> Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting.
>
> iostat on a OSD gives:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>0,130,001,50   15,790,00   82,58
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda 240,70   441,20   33,00   42,70  1122,40  1961,80 81,48
> 14,45  164,42  319,14   44,85   6,63  50,22
> sdb 299,10   393,10   33,90   38,40  1363,60  1720,60 85,32
> 13,55  171,32  316,21   43,41   6,55  47,39
> sdc 268,50   441,60   28,80   45,40  1191,60  1977,00 85,41
> 19,08  159,39  345,98   41,02   6,56  48,69
> sdd 255,50   445,50   30,20   45,00  1150,40  1975,80 83,14
> 18,18  155,97  338,90   33,20   6,95  52,23
> md0   0,00 0,001,20  132,70 4,80  4086,40 61,11
> 0,000,000,000,00   0,00   0,00
>
>
> The figures are similar on all three OSDs.
>
> I am thinking that one possible cause could be that the journal is stored on
> the same disks as the rest of the data, but I don't know how to benchmark if
> this is actually the case (?)
>
> Thanks for any help or advice, you can offer!

Hi Jens,

You may try do play with SCHED_RT, I have found it hard to use for
myself, but you can achieve your goal by adding small RT slices via
``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
overall VM` responsibility. I have thrown it off because RT scheduler
is a very strange thing - it may cause endless lockup on disk
operation during heavy operations or produce ever-stuck ``kworker'' on
some cores if you have killed VM which has separate RT slices for vcpu
threads. Of course, some Ceph tuning like writeback cache and large
journal may help you too, I`m speaking primarily of VM` performance by
itself.

>
> --
> Jens Kristian Søgaard, Mermaid Consulting ApS,
> j...@mermaidconsulting.dk,
> http://www.mermaidconsulting.com/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-de

Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Jens Kristian Søgaard

Hi Andrey,

Thanks for your reply!


You may try do play with SCHED_RT, I have found it hard to use for
myself, but you can achieve your goal by adding small RT slices via
``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
overall VM` responsibility.


I'm not quite sure I understand your suggestion.

Do you mean that you set the process priority to real-time on each 
qemu-kvm process, and then use cgroups cpu.rt_runtime_us / 
cpu.rt_period_us to restrict the amount of CPU time those processes can 
receive?


I'm not sure how that would apply here, as I have only one qemu-kvm 
process and it is not non-responsive because of the lack of allocated 
CPU time slices - but rather because some I/Os take a long time to 
complete, and other I/Os apparently have to wait for those to complete.



threads. Of course, some Ceph tuning like writeback cache and large
journal may help you too, I`m speaking primarily of VM` performance by


I have been considering the journal as something where I could improve 
performance by tweaking the setup. I have set aside 10 GB of space for 
the journal, but I'm not sure if this is too little - or if the size 
really doesn't matter that much when it is on the same mdraid as the 
data itself.


Is there a tool that can tell me how much of my journal space that is 
actually actively being used?


I.e. I'm looking for something that could tell me, if increasing the 
size of the journal or placing it on a seperate (SSD) disk could solve 
my problem.


How do I change the size of the writeback cache when using qemu-kvm like 
I do?


Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, 
where the drive is defined as:


  format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-31 Thread Andrey Korolyov
On Mon, Dec 31, 2012 at 3:12 AM, Jens Kristian Søgaard
 wrote:
> Hi Andrey,
>
> Thanks for your reply!
>
>
>> You may try do play with SCHED_RT, I have found it hard to use for
>> myself, but you can achieve your goal by adding small RT slices via
>> ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
>> overall VM` responsibility.
>
>
> I'm not quite sure I understand your suggestion.
>
> Do you mean that you set the process priority to real-time on each qemu-kvm
> process, and then use cgroups cpu.rt_runtime_us / cpu.rt_period_us to
> restrict the amount of CPU time those processes can receive?
>
> I'm not sure how that would apply here, as I have only one qemu-kvm process
> and it is not non-responsive because of the lack of allocated CPU time
> slices - but rather because some I/Os take a long time to complete, and
> other I/Os apparently have to wait for those to complete.
>
Yep, I meant the same. Of course it`ll not help with only one VM, RT
may help in more concurrent cases :)
>
>> threads. Of course, some Ceph tuning like writeback cache and large
>> journal may help you too, I`m speaking primarily of VM` performance by
>
>
> I have been considering the journal as something where I could improve
> performance by tweaking the setup. I have set aside 10 GB of space for the
> journal, but I'm not sure if this is too little - or if the size really
> doesn't matter that much when it is on the same mdraid as the data itself.
>
> Is there a tool that can tell me how much of my journal space that is
> actually actively being used?
>
> I.e. I'm looking for something that could tell me, if increasing the size of
> the journal or placing it on a seperate (SSD) disk could solve my problem.

As I understood right, you have md device holding both journal and
filestore? What type of raid you have here? Of course you`ll need a
separate device (for experimental purposes, fast disk may be enough)
for the journal, and if you set any type of redundant storage under
filestore partition, you may also change it to simple RAID0, or even
separate disks, and create one osd over every disk(you should see to
the journal device` throughput which must be equal to sum of speeds of
all filestore devices, so for commodity-type SSD it sums to two
100MB/s disks, for example). I have ``pure'' disk setup in my dev
environment built on quite old desktop-class machines and one rsync
process may hang VM for short time, despite of using dedicated SATA
disk for journal.
>
> How do I change the size of the writeback cache when using qemu-kvm like I
> do?
>
> Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, where
> the drive is defined as:
>
>   format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio
>
What size of cache_size/max_dirty you have inside ceph.conf and which
qemu version you use? Default values good enough to prevent pushing
I/O spikes down to the physical storage, but for long I/O-intensive
tasks increasing cache may help OS to align writes more smoothly. Also
you don`t need to set rbd_cache explicitly in the disk config using
qemu 1.2 and younger releases, for older ones
http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
should be applied.

>
> --
> Jens Kristian Søgaard, Mermaid Consulting ApS,
> j...@mermaidconsulting.dk,
> http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-31 Thread Jens Kristian Søgaard

Hi Andrey,


As I understood right, you have md device holding both journal and
filestore? What type of raid you have here?


Yes, same md device holding both journal and filestore. It is a raid5.


Of course you`ll need a
separate device (for experimental purposes, fast disk may be enough)
for the journal


Is there a way to tell if the journal is the bottleneck without actually 
 adding such an extra device?



filestore partition, you may also change it to simple RAID0, or even
separate disks, and create one osd over every disk(you should see to


I have only 3 OSDs with 4 disks each. I was afraid that it would be too 
brittle as a RAID0, and if I created seperate OSDs for each disk, it 
would stall the file system due to recovery if a server crashes.



What size of cache_size/max_dirty you have inside ceph.conf


I haven't set them explicitly, so I imagine the cache_size is 32 MB and 
the max_dirty is 24 MB.


> and which

qemu version you use?


Using the default 0.15 version in Fedora 16.


tasks increasing cache may help OS to align writes more smoothly. Also
you don`t need to set rbd_cache explicitly in the disk config using
qemu 1.2 and younger releases, for older ones
http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
should be applied.


I read somewhere that I needed to enable it specifically for older 
qemu-kvm versions, which I did like this:


  format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio

However now I read in the docs for qemu-rbd that it needs to be set like 
this:


  format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback

I'm not sure if 1 and true are interpreted the same way?

I'll try using "true" and see if I get any noticable changes in behaviour.

The link you sent me seems to indicate that I need to compile my own 
version of qemu-kvm to be able to test this?



--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-31 Thread Andrey Korolyov
On Mon, Dec 31, 2012 at 2:58 PM, Jens Kristian Søgaard
 wrote:
> Hi Andrey,
>
>
>> As I understood right, you have md device holding both journal and
>> filestore? What type of raid you have here?
>
>
> Yes, same md device holding both journal and filestore. It is a raid5.

Ahem, of course you need to reassemble it to something faster :)
>
>
>> Of course you`ll need a
>> separate device (for experimental purposes, fast disk may be enough)
>> for the journal
>
>
> Is there a way to tell if the journal is the bottleneck without actually
> adding such an extra device?
>
In theory, yes - but your setup already dying under high amount of
write seeks, so it may be not necessary. Also I don`t see a right way
to measure a bottleneck when disk device used for both filestore and
journal - in case of separated ones, you may measure maximum values
using fio and compare to calculated ones from /proc/diskstats,
``all-in-one'' case seems obviously hard to measure, even if you able
to log writes to journal file and filestore files separately without
significant overhead.

>
>> filestore partition, you may also change it to simple RAID0, or even
>> separate disks, and create one osd over every disk(you should see to
>
>
> I have only 3 OSDs with 4 disks each. I was afraid that it would be too
> brittle as a RAID0, and if I created seperate OSDs for each disk, it would
> stall the file system due to recovery if a server crashes.

No, it isn`t too bad in most cases. Recovery process is not affecting
operations to the rbd storage except small performance degradation, so
you may split your raid setup to the lightweight R0. It depends, on
plain SATA controller software R0 under one OSD will do better work
than >2 separate OSDs having one disk each, on cache-backed controller
separate OSDs is more preferably until controller is not able to align
writes due to overall write bandwidth.

>
>
>> What size of cache_size/max_dirty you have inside ceph.conf
>
>
> I haven't set them explicitly, so I imagine the cache_size is 32 MB and the
> max_dirty is 24 MB.
>
>
>> and which
>>
>> qemu version you use?
>
>
> Using the default 0.15 version in Fedora 16.
>
>
>> tasks increasing cache may help OS to align writes more smoothly. Also
>> you don`t need to set rbd_cache explicitly in the disk config using
>> qemu 1.2 and younger releases, for older ones
>> http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
>> should be applied.
>
>
> I read somewhere that I needed to enable it specifically for older qemu-kvm
> versions, which I did like this:
>
>   format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio
>
> However now I read in the docs for qemu-rbd that it needs to be set like
> this:
>
>   format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback
>
> I'm not sure if 1 and true are interpreted the same way?
>
> I'll try using "true" and see if I get any noticable changes in behaviour.
>
> The link you sent me seems to indicate that I need to compile my own version
> of qemu-kvm to be able to test this?
>

No, there is no significant changes since 0.15 to the current version
and your options will work just fine. So there may be general
recommendations to remove redundancy from your disk backend and then
move out journal to separate disk or ssd.

>
>
> --
> Jens Kristian Søgaard, Mermaid Consulting ApS,
> j...@mermaidconsulting.dk,
> http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html