Re: [ceph-users] ceph recovering results in offline VMs

Erdem Agaoglu Fri, 12 Apr 2013 12:00:41 -0700

We have 12 osds per host, so we've gone conservative and set recovery max
active to 1 and max backfills to 4. We also set nodown prior to adding a
new osd since we saw flapping can be even more problematic in recovery.
 On Apr 12, 2013 8:04 PM, "Dave Spano" <dsp...@optogenics.com> wrote:


> What are you settings for recovery max active and backfill? Just curious.
>
> Dave Spano
>
>
> ------------------------------
> *From: *"Erdem Agaoglu" <erdem.agao...@gmail.com>
> *To: *"Dave Spano" <dsp...@optogenics.com>
> *Cc: *"Stefan Priebe - Profihost AG" <s.pri...@profihost.ag>,
> "ceph-users" <ceph-users@lists.ceph.com>
> *Sent: *Friday, April 12, 2013 12:48:05 PM
> *Subject: *Re: [ceph-users] ceph recovering results in offline VMs
>
> We are also seeing a similar problem which we believe it's #3737. Our VMs
> (running mongodbs) were being completely frozen for 2-3 minutes (sometimes
> longer) while adding a new OSD. We have reduced recovery max active and
> backfill settings and ensured that we have RBD caching and now it seems
> things are better. We still see some increase in iowaits but VM's continue
> to function.
>
> But that i guess depends on what VM actually does at that moment. We did
> some fio tests before running actual services and what we saw was that
> while individual read or write tests were able to survive OSD addition with
> some degraded performance, concurrent read-write tests (rw and randrw in
> fio talk) were completely stalled. I mean the VM was able to function in
> individual read or write tests even if performance sometimes drops to 0
> iops, but it was frozen in rw/randrw test in addition to dropping to 0 iops.
>
> BTW Stefan, i'm in no way experienced with ceph and i don't know about
> your OSD's but 8128 pgs for a 8TB cluster seems too much. Or is it OK when
> disks are SSDs?
>
>
> On Fri, Apr 12, 2013 at 5:23 PM, Dave Spano <dsp...@optogenics.com> wrote:
>
>> Very interesting. I ran into the same thing yesterday when I added SATA
>> disks to the cluster. I was about to return them for SAS drives instead
>> because of how long it took, and how slow some of my RBDs got.
>>
>> Are most people using SATA 7200 RPM drives? My concern was with Oracle
>> DBs. Postgres doesn't seem to have as much of a problem running on an RBD,
>> but I noticed a marked difference with Oracle.
>>
>> Dave Spano
>>
>>
>>
>> ------------------------------
>> *From: *"Stefan Priebe - Profihost AG" <s.pri...@profihost.ag>
>> *To: *"Wido den Hollander" <w...@42on.com>
>> *Cc: *ceph-users@lists.ceph.com
>> *Sent: *Wednesday, April 10, 2013 3:51:23 PM
>> *Subject: *Re: [ceph-users] ceph recovering results in offline VMs
>>
>>
>> Am 10.04.2013 um 21:36 schrieb Wido den Hollander <w...@42on.com>:
>>
>> > On 04/10/2013 09:16 PM, Stefan Priebe wrote:
>> >> Hello list,
>> >>
>> >> i'm using ceph 0.56.4 and i've to replace some drives. But while ceph
>> is
>> >> backfilling / recovering all VMs have high latencies and sometimes
>> >> they're even offline. I just replace one drive at a time.
>> >>
>> >> I putted in the new drives and i'm reweighting them from 0.0 to 1.0 in
>> >> 0.1 steps.
>> >>
>> >> I already lowered osd recovery max active = 2 and osd max backfills =
>> 3,
>> >> but when i put them back at 1.0 the vms are nearly all down.
>> >>
>> >> Right now some drives are SSDs so they're a lot faster than the HDDs
>> i'm
>> >> going to replace them too.
>> >>
>> >> Nothing in the logs but it is recovering at 3700MB/s that this is not
>> >> possible on SATA HDDs is clear.
>> >>
>> >> Log example:
>> >> 2013-04-10 20:55:33.711289 mon.0 [INF] pgmap v9293315: 8128 pgs: 233
>> >> active, 7876 active+clean, 19 active+recovery_wait; 557 GB data, 1168
>> GB
>> >> used, 7003 GB / 8171 GB avail; 2108KB/s wr, 329op/s; 31/309692 degraded
>> >> (0.010%);  recovering 840 o/s, 3278MB/s
>> >
>> > There is a issue about this in the tracker, I saw it this week but I'm
>> not able to find it anymore.
>>
>> 3737?
>>
>> > I'm seeing this as well, when the cluster is recovering RBD images tend
>> to get very sluggish.
>> >
>> > Most of the time I'm blaiming the CPUs in the OSDs for it, but I've
>> also seen it on faster systems.
>>
>> I've 3,6Ghz xeons with just 4 osds per host.
>>
>> Stefan
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> erdem agaoglu
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph recovering results in offline VMs

Reply via email to