Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Andrei Mikhailovsky
Hi Daniel,

Many thanks, I will keep this in mind while performing the updates in the 
future.

Note to documentation manager - perhaps it makes sens to add this solution as a 
note/tip to the Upgrade section of the release notes?


Andrei

- Original Message -
> From: "Daniel Swarbrick" 
> To: "ceph-users" 
> Cc: "ceph-devel" 
> Sent: Wednesday, 22 June, 2016 17:09:48
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> On 22/06/16 17:54, Andrei Mikhailovsky wrote:
>> Hi Daniel,
>> 
>> Many thanks for your useful tests and your results.
>> 
>> How much IO wait do you have on your client vms? Has it significantly 
>> increased
>> or not?
>> 
> 
> Hi Andrei,
> 
> Bearing in mind that this cluster is tiny (four nodes, each with four
> OSDs), our metrics may not be that meaningful. However, on a VM that is
> running ElasticSearch, collecting logs from Graylog, we're seeing no
> more than about 5% iowait for a 5s period, and most of the time it's
> below 1%. This VM is really not writing a lot of data though.
> 
> The cluster as a whole is peaking at only about 1200 write op/s,
> according to ceph -w.
> 
> Executing a "sync" in a VM does of course have a noticeable delay due to
> the recovery happening in the background, but nothing is waiting for IO
> long enough to trigger the kernel's 120s timer / warning.
> 
> The recovery has been running for about four hours now, and is down to
> 20% misplaced objects. So far we have not had any clients block
> indefinitely, so I think the migration of VMs to Jewel-capable
> hypervisors did the trick.
> 
> Best,
> Daniel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Daniel Swarbrick
On 22/06/16 17:54, Andrei Mikhailovsky wrote:
> Hi Daniel,
> 
> Many thanks for your useful tests and your results.
> 
> How much IO wait do you have on your client vms? Has it significantly 
> increased or not?
> 

Hi Andrei,

Bearing in mind that this cluster is tiny (four nodes, each with four
OSDs), our metrics may not be that meaningful. However, on a VM that is
running ElasticSearch, collecting logs from Graylog, we're seeing no
more than about 5% iowait for a 5s period, and most of the time it's
below 1%. This VM is really not writing a lot of data though.

The cluster as a whole is peaking at only about 1200 write op/s,
according to ceph -w.

Executing a "sync" in a VM does of course have a noticeable delay due to
the recovery happening in the background, but nothing is waiting for IO
long enough to trigger the kernel's 120s timer / warning.

The recovery has been running for about four hours now, and is down to
20% misplaced objects. So far we have not had any clients block
indefinitely, so I think the migration of VMs to Jewel-capable
hypervisors did the trick.

Best,
Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Andrei Mikhailovsky
Hi Daniel,

Many thanks for your useful tests and your results.

How much IO wait do you have on your client vms? Has it significantly increased 
or not?

Many thanks

Andrei

- Original Message -
> From: "Daniel Swarbrick" 
> To: "ceph-users" 
> Cc: "ceph-devel" 
> Sent: Wednesday, 22 June, 2016 13:43:37
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> On 20/06/16 19:51, Gregory Farnum wrote:
>> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>>>
>>> At this stage, I have a strong suspicion that it is the introduction of
>>> "require_feature_tunables5 = 1" in the tunables. This seems to require
>>> all RADOS connections to be re-established.
>> 
>> Do you have any evidence of that besides the one restart?
>> 
>> I guess it's possible that we aren't kicking requests if the crush map
>> but not the rest of the osdmap changes, but I'd be surprised.
>> -Greg
> 
> I think the key fact to take note of is that we had long-running Qemu
> processes that had been started a few months ago, using Infernalis
> librbd shared libs.
> 
> If Infernalis had no concept of require_feature_tunables5, then it seems
> logical that these clients would block if the cluster were upgraded to
> Jewel and this tunable became mandatory.
> 
> I have just upgraded our fourth and final cluster to Jewel. Prior to
> applying optimal tunables, we upgraded our hypervisor nodes' librbd
> also, and migrated all VMs at least once, to start a fresh Qemu process
> for each (using the updated librbd).
> 
> We're seeing ~65% data movement due to chooseleaf_stable 0 => 1, but
> other than that, so far so good. No clients are blocking indefinitely.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Daniel Swarbrick
On 20/06/16 19:51, Gregory Farnum wrote:
> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>>
>> At this stage, I have a strong suspicion that it is the introduction of
>> "require_feature_tunables5 = 1" in the tunables. This seems to require
>> all RADOS connections to be re-established.
> 
> Do you have any evidence of that besides the one restart?
> 
> I guess it's possible that we aren't kicking requests if the crush map
> but not the rest of the osdmap changes, but I'd be surprised.
> -Greg

I think the key fact to take note of is that we had long-running Qemu
processes that had been started a few months ago, using Infernalis
librbd shared libs.

If Infernalis had no concept of require_feature_tunables5, then it seems
logical that these clients would block if the cluster were upgraded to
Jewel and this tunable became mandatory.

I have just upgraded our fourth and final cluster to Jewel. Prior to
applying optimal tunables, we upgraded our hypervisor nodes' librbd
also, and migrated all VMs at least once, to start a fresh Qemu process
for each (using the updated librbd).

We're seeing ~65% data movement due to chooseleaf_stable 0 => 1, but
other than that, so far so good. No clients are blocking indefinitely.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Josef, 

are you saying that there is no ceph config option that can be used to provide 
IO to the vms while the ceph cluster is in heavy data move? I am really 
struggling to understand that this could be the case. I've read so much about 
ceph being the solution to the modern storage needs and that all of its 
components were designed to be redundant to provide an always on availability 
of the storage in case of upgrades and hardware failures. Has something been 
overlooked? 

Also, judging by a low number of people with similar issues I am thinking that 
there are a lot of ceph users which are still using non optimal profile, either 
because they don't want to risk the downtime or simply they don't know about 
the latest crush tunables. 

For any future updates, should I be scheduling a maintenance day or two and 
shutdown all vms prior to upgrading the cluster? It so seems like the backwards 
approach of the 90s and early 2000s ((( 

Cheers 

Andrei 

> From: "Josef Johansson" 
> To: "Gregory Farnum" , "Daniel Swarbrick"
> 
> Cc: "ceph-users" , "ceph-devel"
> 
> Sent: Monday, 20 June, 2016 20:22:02
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and
> client IO optimisations

> Hi,

> People ran into this when there were some changes in tunables that caused
> 70-100% movement, the solution was to find out what values that changed and
> increment them in the smallest steps possible.

> I've found that with major rearrangement in ceph the VMs does not neccesarily
> survive ( last time on a ssd cluster ), so linux and timeouts doesn't work 
> well
> os my assumption. Which is true with any other storage backend out there ;)

> Regards,
> Josef
> On Mon, 20 Jun 2016, 19:51 Gregory Farnum, < gfar...@redhat.com > wrote:

>> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>> < daniel.swarbr...@profitbricks.com > wrote:
>> > We have just updated our third cluster from Infernalis to Jewel, and are
>> > experiencing similar issues.

>> > We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
>> > have seen a lot of D-state processes and even jbd/2 timeouts and kernel
>> > stack traces inside the guests. At first I thought the VMs were being
>> > starved of IO, but this is still happening after throttling back the
>> > recovery with:

>> > osd_max_backfills = 1
>> > osd_recovery_max_active = 1
>> > osd_recovery_op_priority = 1

>> > After upgrading the cluster to Jewel, I changed our crushmap to use the
>> > newer straw2 algorithm, which resulted in a little data movment, but no
>> > problems at that stage.

>> > Once the cluster had settled down again, I set tunables to optimal
>> > (hammer profile -> jewel profile), which has triggered between 50% and
>> > 70% misplaced PGs on our clusters. This is when the trouble started each
>> > time, and when we had cascading failures of VMs.

>> > However, after performing hard shutdowns on the VMs and restarting them,
>> > they seemed to be OK.

>> > At this stage, I have a strong suspicion that it is the introduction of
>> > "require_feature_tunables5 = 1" in the tunables. This seems to require
>> > all RADOS connections to be re-established.

>> Do you have any evidence of that besides the one restart?

>> I guess it's possible that we aren't kicking requests if the crush map
>> but not the rest of the osdmap changes, but I'd be surprised.
>> -Greg



>> > On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> >> Hi Oliver,

>>>> I am also seeing this as a strange behavriour indeed! I was going through 
>>>> the
>>>> logs and I was not able to find any errors or issues. There was also no
>> >> slow/blocked requests that I could see during the recovery process.

>>>> Does anyone has an idea what could be the issue here? I don't want to shut 
>>>> down
>> >> all vms every time there is a new release with updated tunable values.


>> >> Andrei



>> >> - Original Message -
>> >>> From: "Oliver Dzombic" < i...@ip-interactive.de >
>> >>> To: "andrei" < and...@arhont.com >, "ceph-users" < 
>> >>> ceph-users@lists.ceph.com >
>> >>> Sent: Sunday, 19 June, 2016 10:14:35
>>>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>>>> and
>> >>> client IO optimisations

>> >>> Hi,

>> >>> so 

Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Daniel,


> 
> After upgrading the cluster to Jewel, I changed our crushmap to use the
> newer straw2 algorithm, which resulted in a little data movment, but no
> problems at that stage.


I've not done that, instead i've switch the profile to optimal rightaway.


> 
> Once the cluster had settled down again, I set tunables to optimal
> (hammer profile -> jewel profile), which has triggered between 50% and
> 70% misplaced PGs on our clusters. This is when the trouble started each
> time, and when we had cascading failures of VMs.
> 
> However, after performing hard shutdowns on the VMs and restarting them,
> they seemed to be OK.
> 
> At this stage, I have a strong suspicion that it is the introduction of
> "require_feature_tunables5 = 1" in the tunables. This seems to require
> all RADOS connections to be re-established.
> 


In my experience,, shutting down the vm and restarting didn't help. I've waited 
about 30+ minutes for the vm to start, but it was still unable to start.

I've also noticed that it took a while for vms to start failing, initially the 
IO wait on vms went up just a bit and it slowly started to increase over the 
course of about an hour. At the end, there was 100% iowait on all vms. If this 
was the case, wouldn't I see iowait jumping to 100% pretty quickly? Also, I 
wasn't able to start any of my vms until i've rebooted one of my osd / mon 
servers following the successful PGs rebuild.








> 
> On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> Hi Oliver,
>> 
>> I am also seeing this as a strange behavriour indeed! I was going through the
>> logs and I was not able to find any errors or issues. There was also no
>> slow/blocked requests that I could see during the recovery process.
>> 
>> Does anyone has an idea what could be the issue here? I don't want to shut 
>> down
>> all vms every time there is a new release with updated tunable values.
>> 
>> 
>> Andrei
>> 
>> 
>> 
>> ----- Original Message -
>>> From: "Oliver Dzombic" 
>>> To: "andrei" , "ceph-users" 
>>> Sent: Sunday, 19 June, 2016 10:14:35
>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>> and
>>> client IO optimisations
>> 
>>> Hi,
>>>
>>> so far the key values for that are:
>>>
>>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>>> osd_recovery_op_priority = 1
>>>
>>>
>>> In addition i set:
>>>
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>>
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Josef Johansson
Hi,

People ran into this when there were some changes in tunables that caused
70-100% movement, the solution was to find out what values that changed and
increment them in the smallest steps possible.

I've found that with major rearrangement in ceph the VMs does not
neccesarily survive ( last time on a ssd cluster ), so linux and timeouts
doesn't work well os my assumption. Which is true with any other storage
backend out there ;)

Regards,
Josef

On Mon, 20 Jun 2016, 19:51 Gregory Farnum,  wrote:

> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>  wrote:
> > We have just updated our third cluster from Infernalis to Jewel, and are
> > experiencing similar issues.
> >
> > We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
> > have seen a lot of D-state processes and even jbd/2 timeouts and kernel
> > stack traces inside the guests. At first I thought the VMs were being
> > starved of IO, but this is still happening after throttling back the
> > recovery with:
> >
> > osd_max_backfills = 1
> > osd_recovery_max_active = 1
> > osd_recovery_op_priority = 1
> >
> > After upgrading the cluster to Jewel, I changed our crushmap to use the
> > newer straw2 algorithm, which resulted in a little data movment, but no
> > problems at that stage.
> >
> > Once the cluster had settled down again, I set tunables to optimal
> > (hammer profile -> jewel profile), which has triggered between 50% and
> > 70% misplaced PGs on our clusters. This is when the trouble started each
> > time, and when we had cascading failures of VMs.
> >
> > However, after performing hard shutdowns on the VMs and restarting them,
> > they seemed to be OK.
> >
> > At this stage, I have a strong suspicion that it is the introduction of
> > "require_feature_tunables5 = 1" in the tunables. This seems to require
> > all RADOS connections to be re-established.
>
> Do you have any evidence of that besides the one restart?
>
> I guess it's possible that we aren't kicking requests if the crush map
> but not the rest of the osdmap changes, but I'd be surprised.
> -Greg
>
> >
> >
> > On 20/06/16 13:54, Andrei Mikhailovsky wrote:
> >> Hi Oliver,
> >>
> >> I am also seeing this as a strange behavriour indeed! I was going
> through the logs and I was not able to find any errors or issues. There was
> also no slow/blocked requests that I could see during the recovery process.
> >>
> >> Does anyone has an idea what could be the issue here? I don't want to
> shut down all vms every time there is a new release with updated tunable
> values.
> >>
> >>
> >> Andrei
> >>
> >>
> >>
> >> - Original Message -
> >>> From: "Oliver Dzombic" 
> >>> To: "andrei" , "ceph-users" <
> ceph-users@lists.ceph.com>
> >>> Sent: Sunday, 19 June, 2016 10:14:35
> >>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel
> tunables and client IO optimisations
> >>
> >>> Hi,
> >>>
> >>> so far the key values for that are:
> >>>
> >>> osd_client_op_priority = 63 ( anyway default, but i set it to remember
> it )
> >>> osd_recovery_op_priority = 1
> >>>
> >>>
> >>> In addition i set:
> >>>
> >>> osd_max_backfills = 1
> >>> osd_recovery_max_active = 1
> >>>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Gregory Farnum
On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
 wrote:
> We have just updated our third cluster from Infernalis to Jewel, and are
> experiencing similar issues.
>
> We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
> have seen a lot of D-state processes and even jbd/2 timeouts and kernel
> stack traces inside the guests. At first I thought the VMs were being
> starved of IO, but this is still happening after throttling back the
> recovery with:
>
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 1
>
> After upgrading the cluster to Jewel, I changed our crushmap to use the
> newer straw2 algorithm, which resulted in a little data movment, but no
> problems at that stage.
>
> Once the cluster had settled down again, I set tunables to optimal
> (hammer profile -> jewel profile), which has triggered between 50% and
> 70% misplaced PGs on our clusters. This is when the trouble started each
> time, and when we had cascading failures of VMs.
>
> However, after performing hard shutdowns on the VMs and restarting them,
> they seemed to be OK.
>
> At this stage, I have a strong suspicion that it is the introduction of
> "require_feature_tunables5 = 1" in the tunables. This seems to require
> all RADOS connections to be re-established.

Do you have any evidence of that besides the one restart?

I guess it's possible that we aren't kicking requests if the crush map
but not the rest of the osdmap changes, but I'd be surprised.
-Greg

>
>
> On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> Hi Oliver,
>>
>> I am also seeing this as a strange behavriour indeed! I was going through 
>> the logs and I was not able to find any errors or issues. There was also no 
>> slow/blocked requests that I could see during the recovery process.
>>
>> Does anyone has an idea what could be the issue here? I don't want to shut 
>> down all vms every time there is a new release with updated tunable values.
>>
>>
>> Andrei
>>
>>
>>
>> ----- Original Message -----
>>> From: "Oliver Dzombic" 
>>> To: "andrei" , "ceph-users" 
>>> Sent: Sunday, 19 June, 2016 10:14:35
>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>> and client IO optimisations
>>
>>> Hi,
>>>
>>> so far the key values for that are:
>>>
>>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>>> osd_recovery_op_priority = 1
>>>
>>>
>>> In addition i set:
>>>
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Daniel Swarbrick
We have just updated our third cluster from Infernalis to Jewel, and are
experiencing similar issues.

We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
have seen a lot of D-state processes and even jbd/2 timeouts and kernel
stack traces inside the guests. At first I thought the VMs were being
starved of IO, but this is still happening after throttling back the
recovery with:

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

After upgrading the cluster to Jewel, I changed our crushmap to use the
newer straw2 algorithm, which resulted in a little data movment, but no
problems at that stage.

Once the cluster had settled down again, I set tunables to optimal
(hammer profile -> jewel profile), which has triggered between 50% and
70% misplaced PGs on our clusters. This is when the trouble started each
time, and when we had cascading failures of VMs.

However, after performing hard shutdowns on the VMs and restarting them,
they seemed to be OK.

At this stage, I have a strong suspicion that it is the introduction of
"require_feature_tunables5 = 1" in the tunables. This seems to require
all RADOS connections to be re-established.


On 20/06/16 13:54, Andrei Mikhailovsky wrote:
> Hi Oliver,
> 
> I am also seeing this as a strange behavriour indeed! I was going through the 
> logs and I was not able to find any errors or issues. There was also no 
> slow/blocked requests that I could see during the recovery process.
> 
> Does anyone has an idea what could be the issue here? I don't want to shut 
> down all vms every time there is a new release with updated tunable values.
> 
> 
> Andrei
> 
> 
> 
> - Original Message -
>> From: "Oliver Dzombic" 
>> To: "andrei" , "ceph-users" 
>> Sent: Sunday, 19 June, 2016 10:14:35
>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>> and client IO optimisations
> 
>> Hi,
>>
>> so far the key values for that are:
>>
>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>> osd_recovery_op_priority = 1
>>
>>
>> In addition i set:
>>
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Oliver,

I am also seeing this as a strange behavriour indeed! I was going through the 
logs and I was not able to find any errors or issues. There was also no 
slow/blocked requests that I could see during the recovery process.

Does anyone has an idea what could be the issue here? I don't want to shut down 
all vms every time there is a new release with updated tunable values.


Andrei



- Original Message -
> From: "Oliver Dzombic" 
> To: "andrei" , "ceph-users" 
> Sent: Sunday, 19 June, 2016 10:14:35
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> Hi,
> 
> so far the key values for that are:
> 
> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
> osd_recovery_op_priority = 1
> 
> 
> In addition i set:
> 
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> 
> 
> ---
> 
> 
> But according to your settings its all ok.
> 
> According to what you described, the problem was not the backfilling but
> something else inside the cluster. Maybe something was blocked somewhere
> and only a reset could help. The logs would might have given an answer
> about that.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 18.06.2016 um 18:04 schrieb Andrei Mikhailovsky:
>> Hello ceph users,
>> 
>> I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and
>> then 10.2.2). The cluster was running okay after the upgrade. I've
>> decided to use the optimal tunables for Jewel as the ceph status was
>> complaining about the straw version and my cluster settings were not
>> optimal for jewel. I've not touched tunables since the Firefly release I
>> think. After reading the release notes and the tunables section I have
>> decided to set the crush tunables value to optimal. Taking into account
>> that a few weeks ago I have done a /reweight/-by-/utilization /which has
>> moved around about 8% of my cluster objects. This process has not caused
>> any downtime and IO to the virtual machines was available. I have also
>> altered several settings to prioritise client IO in case of repair and
>> backfilling (see config show output below).
>> 
>> Right, so, after i've set tunables to optimal value my cluster indicated
>> that it needs to move around 61% of data in the cluster. The process
>> started and I was seeing speeds of between 800MB/s - 1.5GB/s for
>> recovery. My cluster is pretty small (3 osd servers with 30 osds in
>> total). The load on the osd servers was pretty low. I was seeing a
>> typical load of 4 spiking to around 10. The IO wait values on the osd
>> servers were also pretty reasonable - around 5-15%. There were around
>> 10-15 backfilling processes.
>> 
>> About 10 minutes after the optimal tunables were set i've noticed that
>> IO wait on the vms started to increase. Initially it was 15%, after
>> another 10 mins or so it increased to around 50% and about 30-40 minutes
>> later the iowait became 95-100% on all vms. Shortly after that the vms
>> showed a bunch of hang tasks in dmesg output and shorly stopped
>> responding all together. This kind of behaviour didn't happen after
>> doing reweight-by-utilization, which i've done a few weeks prior. The
>> vms IO wait during the reweithing was around 15-20% and there were no
>> hanged tasks and all vms were running pretty well.
>> 
>> I wasn't sure how to resolve the problem. On one hand I know that
>> recovery and backfilling cause extra load on the cluster, but it should
>> never break client IO. Afterall, this seems to negate one of the key
>> points behind ceph - resilient storage cluster. Looking at the ceph -w
>> output the client IO has decreased to 0-20 IOPs, where as a typical load
>> that I see at that time of the day is around 700-1000 IOPs.
>> 
>> The strange thing is that after the cluster has finished with data move
>> (it took around 11 hours) the client IO was still not available! I was
>> not able to start any new vms despite having OK health status and all
>> PGs in active + clean state. This was pretty strange. All osd servers
>> having almost 0 load, all PGs are active + clean, all osds are up and
>&g

Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-19 Thread Oliver Dzombic
Hi,

so far the key values for that are:

osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
osd_recovery_op_priority = 1


In addition i set:

osd_max_backfills = 1
osd_recovery_max_active = 1


---


But according to your settings its all ok.

According to what you described, the problem was not the backfilling but
something else inside the cluster. Maybe something was blocked somewhere
and only a reset could help. The logs would might have given an answer
about that.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 18.06.2016 um 18:04 schrieb Andrei Mikhailovsky:
> Hello ceph users,
> 
> I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and
> then 10.2.2). The cluster was running okay after the upgrade. I've
> decided to use the optimal tunables for Jewel as the ceph status was
> complaining about the straw version and my cluster settings were not
> optimal for jewel. I've not touched tunables since the Firefly release I
> think. After reading the release notes and the tunables section I have
> decided to set the crush tunables value to optimal. Taking into account
> that a few weeks ago I have done a /reweight/-by-/utilization /which has
> moved around about 8% of my cluster objects. This process has not caused
> any downtime and IO to the virtual machines was available. I have also
> altered several settings to prioritise client IO in case of repair and
> backfilling (see config show output below).
> 
> Right, so, after i've set tunables to optimal value my cluster indicated
> that it needs to move around 61% of data in the cluster. The process
> started and I was seeing speeds of between 800MB/s - 1.5GB/s for
> recovery. My cluster is pretty small (3 osd servers with 30 osds in
> total). The load on the osd servers was pretty low. I was seeing a
> typical load of 4 spiking to around 10. The IO wait values on the osd
> servers were also pretty reasonable - around 5-15%. There were around
> 10-15 backfilling processes.
> 
> About 10 minutes after the optimal tunables were set i've noticed that
> IO wait on the vms started to increase. Initially it was 15%, after
> another 10 mins or so it increased to around 50% and about 30-40 minutes
> later the iowait became 95-100% on all vms. Shortly after that the vms
> showed a bunch of hang tasks in dmesg output and shorly stopped
> responding all together. This kind of behaviour didn't happen after
> doing reweight-by-utilization, which i've done a few weeks prior. The
> vms IO wait during the reweithing was around 15-20% and there were no
> hanged tasks and all vms were running pretty well.
> 
> I wasn't sure how to resolve the problem. On one hand I know that
> recovery and backfilling cause extra load on the cluster, but it should
> never break client IO. Afterall, this seems to negate one of the key
> points behind ceph - resilient storage cluster. Looking at the ceph -w
> output the client IO has decreased to 0-20 IOPs, where as a typical load
> that I see at that time of the day is around 700-1000 IOPs.
> 
> The strange thing is that after the cluster has finished with data move
> (it took around 11 hours) the client IO was still not available! I was
> not able to start any new vms despite having OK health status and all
> PGs in active + clean state. This was pretty strange. All osd servers
> having almost 0 load, all PGs are active + clean, all osds are up and
> all mons are up, yet no client IO. The cluster became operational once
> again after a reboot of one of the osd servers, which seem to have
> brought the cluster to life.
> 
> My question to the community is what ceph options should be implemented
> to make sure the client IO is _always_ available and has the highest
> priority during any recovery/migration/backfilling operations?
> 
> My current settings, which i've gathered over the years from the advice
> of mailing list and irc members are:
> 
> osd_recovery_max_chunk = 8388608
> osd_recovery_op_priority = 1
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_threads = 1
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = idle
> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max = 5
> osd_deep_scrub_stride = 1048576
> mon_osd_min_down_reporters = 6
> mon_osd_report_timeout = 1800
> mon_osd_min_down_reports = 7
> osd_heartbeat_grace = 60
> osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
> osd_mkfs_options_xfs = -f -i size=2048
> filestore_max_sync_interval = 15
> filestore_op_threads = 8
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> osd_disk_threads = 8
> osd_op_threads = 8
> osd_pool_default_pg_num = 1024
> osd_pool_default_pgp_num = 1024
> osd_