We are also seeing a similar problem which we believe it's #3737. Our VMs
(running mongodbs) were being completely frozen for 2-3 minutes (sometimes
longer) while adding a new OSD. We have reduced recovery max active and
backfill settings and ensured that we have RBD caching and now it seems
things are better. We still see some increase in iowaits but VM's continue
to function.

But that i guess depends on what VM actually does at that moment. We did
some fio tests before running actual services and what we saw was that
while individual read or write tests were able to survive OSD addition with
some degraded performance, concurrent read-write tests (rw and randrw in
fio talk) were completely stalled. I mean the VM was able to function in
individual read or write tests even if performance sometimes drops to 0
iops, but it was frozen in rw/randrw test in addition to dropping to 0 iops.

BTW Stefan, i'm in no way experienced with ceph and i don't know about your
OSD's but 8128 pgs for a 8TB cluster seems too much. Or is it OK when disks
are SSDs?


On Fri, Apr 12, 2013 at 5:23 PM, Dave Spano <dsp...@optogenics.com> wrote:

> Very interesting. I ran into the same thing yesterday when I added SATA
> disks to the cluster. I was about to return them for SAS drives instead
> because of how long it took, and how slow some of my RBDs got.
>
> Are most people using SATA 7200 RPM drives? My concern was with Oracle
> DBs. Postgres doesn't seem to have as much of a problem running on an RBD,
> but I noticed a marked difference with Oracle.
>
> Dave Spano
>
>
>
> ------------------------------
> *From: *"Stefan Priebe - Profihost AG" <s.pri...@profihost.ag>
> *To: *"Wido den Hollander" <w...@42on.com>
> *Cc: *ceph-users@lists.ceph.com
> *Sent: *Wednesday, April 10, 2013 3:51:23 PM
> *Subject: *Re: [ceph-users] ceph recovering results in offline VMs
>
>
> Am 10.04.2013 um 21:36 schrieb Wido den Hollander <w...@42on.com>:
>
> > On 04/10/2013 09:16 PM, Stefan Priebe wrote:
> >> Hello list,
> >>
> >> i'm using ceph 0.56.4 and i've to replace some drives. But while ceph is
> >> backfilling / recovering all VMs have high latencies and sometimes
> >> they're even offline. I just replace one drive at a time.
> >>
> >> I putted in the new drives and i'm reweighting them from 0.0 to 1.0 in
> >> 0.1 steps.
> >>
> >> I already lowered osd recovery max active = 2 and osd max backfills = 3,
> >> but when i put them back at 1.0 the vms are nearly all down.
> >>
> >> Right now some drives are SSDs so they're a lot faster than the HDDs i'm
> >> going to replace them too.
> >>
> >> Nothing in the logs but it is recovering at 3700MB/s that this is not
> >> possible on SATA HDDs is clear.
> >>
> >> Log example:
> >> 2013-04-10 20:55:33.711289 mon.0 [INF] pgmap v9293315: 8128 pgs: 233
> >> active, 7876 active+clean, 19 active+recovery_wait; 557 GB data, 1168 GB
> >> used, 7003 GB / 8171 GB avail; 2108KB/s wr, 329op/s; 31/309692 degraded
> >> (0.010%);  recovering 840 o/s, 3278MB/s
> >
> > There is a issue about this in the tracker, I saw it this week but I'm
> not able to find it anymore.
>
> 3737?
>
> > I'm seeing this as well, when the cluster is recovering RBD images tend
> to get very sluggish.
> >
> > Most of the time I'm blaiming the CPUs in the OSDs for it, but I've also
> seen it on faster systems.
>
> I've 3,6Ghz xeons with just 4 osds per host.
>
> Stefan
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
erdem agaoglu
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to