Hi Daniel,

> 
> After upgrading the cluster to Jewel, I changed our crushmap to use the
> newer straw2 algorithm, which resulted in a little data movment, but no
> problems at that stage.


I've not done that, instead i've switch the profile to optimal rightaway.


> 
> Once the cluster had settled down again, I set tunables to optimal
> (hammer profile -> jewel profile), which has triggered between 50% and
> 70% misplaced PGs on our clusters. This is when the trouble started each
> time, and when we had cascading failures of VMs.
> 
> However, after performing hard shutdowns on the VMs and restarting them,
> they seemed to be OK.
> 
> At this stage, I have a strong suspicion that it is the introduction of
> "require_feature_tunables5 = 1" in the tunables. This seems to require
> all RADOS connections to be re-established.
> 


In my experience,, shutting down the vm and restarting didn't help. I've waited 
about 30+ minutes for the vm to start, but it was still unable to start.

I've also noticed that it took a while for vms to start failing, initially the 
IO wait on vms went up just a bit and it slowly started to increase over the 
course of about an hour. At the end, there was 100% iowait on all vms. If this 
was the case, wouldn't I see iowait jumping to 100% pretty quickly? Also, I 
wasn't able to start any of my vms until i've rebooted one of my osd / mon 
servers following the successful PGs rebuild.








> 
> On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> Hi Oliver,
>> 
>> I am also seeing this as a strange behavriour indeed! I was going through the
>> logs and I was not able to find any errors or issues. There was also no
>> slow/blocked requests that I could see during the recovery process.
>> 
>> Does anyone has an idea what could be the issue here? I don't want to shut 
>> down
>> all vms every time there is a new release with updated tunable values.
>> 
>> 
>> Andrei
>> 
>> 
>> 
>> ----- Original Message -----
>>> From: "Oliver Dzombic" <i...@ip-interactive.de>
>>> To: "andrei" <and...@arhont.com>, "ceph-users" <ceph-users@lists.ceph.com>
>>> Sent: Sunday, 19 June, 2016 10:14:35
>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>> and
>>> client IO optimisations
>> 
>>> Hi,
>>>
>>> so far the key values for that are:
>>>
>>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>>> osd_recovery_op_priority = 1
>>>
>>>
>>> In addition i set:
>>>
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>>
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to