Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Nick Fisk
Classic buffer bloat. The deletion process is probably going as fast as it can 
until the filestore queues fill up, only then will it start to back off. 
Problem is that with a queue of 500 ops, any disk is going to busy for 
thousands of milliseconds trying to empty it.

 

Shorter queues may run the risk that you can’t maximise throughput of the OSD, 
but I wonder if 500 is too high? Interested in what you see when you drop the 
queue limit.

 

From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au] 
Sent: 23 September 2016 11:08
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: Snap delete performance impact

 

I am also seeing if reducing filestore queue ops limit from 500 to 250.  On my 
graphs I can see the file store ops queue goes from 1 or 2 to 500 for the 
period of the load.  I am looking to see if throttling down helps spread out 
the load.  The normal ops load is not enough to worry the current limit.

 

 

 

Sent from my SAMSUNG Galaxy S7 on the Telstra Mobile Network

 

 

 Original message 

From: Nick Fisk mailto:n...@fisk.me.uk> > 

Date: 23/09/2016 7:26 PM (GMT+10:00) 

To: Adrian Saul mailto:adrian.s...@tpgtelecom.com.au> >, ceph-users@lists.ceph.com 
<mailto:ceph-users@lists.ceph.com>  

Subject: RE: Snap delete performance impact 

 

Looking back through my graphs when this happened to me I can see that the 
queue on the disks was up as high as 30 during the period when the snapshot was 
removed, this would explain the high latencies, the disk is literally having 
fits trying to jump all over the place. 

I need to test with the higher osd_snap_trim_sleep to see if that helps. What 
I'm interested in finding out is why so much disk activity is required for 
deleting an object. It feels to me that the process is async, in that Ceph will 
quite happily flood the Filestore with delete requests without any feedback to 
the higher layers.


> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 23 September 2016 10:04
> To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; ceph-users@lists.ceph.com 
> <mailto:ceph-users@lists.ceph.com> 
> Subject: RE: Snap delete performance impact
> 
> 
> I did some observation today - with the reduced filestore_op_threads it seems 
> to ride out the storm better, not ideal but better.
> 
> The main issue is that for the 10 minutes from the moment the rbd snap rm 
> command is issued, the SATA systems in my configuration
> load up massively on disk IO and I think this is what is rolling on to all 
> other issues (OSDs unresponsive, queue backlogs). The disks all
> go 100% busy - the average SATA write latency goes from 14ms to 250ms.  I was 
> observing disks doing 400, 700 and higher service
> times.  After those few minutes it tapers down and goes back to normal.
> 
> There are all ST6000VN0001 disks - anyone aware of anything that might 
> explain this sort of behaviour?  It seems odd that even if the
> disks were hit with high write traffic (average of 50 write IOPS going up to 
> 270-300 during this activity) that the service times would
> blow out that much.
> 
> Cheers,
>  Adrian
> 
> 
> 
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Thursday, 22 September 2016 7:15 PM
> > To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; ceph-users@lists.ceph.com 
> > <mailto:ceph-users@lists.ceph.com> 
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > I tried 2 this afternoon and saw the same results.  Essentially the
> > disks appear to go to 100% busy doing very small but high numbers of IO and 
> > incur massive
> > service times (300-400ms).   During that period I get blocked request errors
> > continually.
> >
> > I suspect part of that might be the SATA servers had
> > filestore_op_threads set too high and hammering the disks with too
> > much concurrent work.  As they have inherited a setting targeted for
> > SSDs, so I have wound that back to defaults on those machines see if it 
> > makes a difference.
> >
> > But I suspect going by the disk activity there is a lot of very small
> > FS metadata updates going on and that is what is killing it.
> >
> > Cheers,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Thursday, 22 September 2016 7:06 PM
> > > To: Adrian Saul; ceph-users@lists.ceph.com 
> > > <mailto:ceph-users@lists.ceph.com> 
> > > Subject: RE: Snap delete performance impact
> > >
> > > Hi Adrian,
> > >
> >

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul
I am also seeing if reducing filestore queue ops limit from 500 to 250.  On my 
graphs I can see the file store ops queue goes from 1 or 2 to 500 for the 
period of the load.  I am looking to see if throttling down helps spread out 
the load.  The normal ops load is not enough to worry the current limit.



Sent from my SAMSUNG Galaxy S7 on the Telstra Mobile Network


 Original message 
From: Nick Fisk 
Date: 23/09/2016 7:26 PM (GMT+10:00)
To: Adrian Saul , ceph-users@lists.ceph.com
Subject: RE: Snap delete performance impact

Looking back through my graphs when this happened to me I can see that the 
queue on the disks was up as high as 30 during the period when the snapshot was 
removed, this would explain the high latencies, the disk is literally having 
fits trying to jump all over the place.

I need to test with the higher osd_snap_trim_sleep to see if that helps. What 
I'm interested in finding out is why so much disk activity is required for 
deleting an object. It feels to me that the process is async, in that Ceph will 
quite happily flood the Filestore with delete requests without any feedback to 
the higher layers.


> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 23 September 2016 10:04
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
>
> I did some observation today - with the reduced filestore_op_threads it seems 
> to ride out the storm better, not ideal but better.
>
> The main issue is that for the 10 minutes from the moment the rbd snap rm 
> command is issued, the SATA systems in my configuration
> load up massively on disk IO and I think this is what is rolling on to all 
> other issues (OSDs unresponsive, queue backlogs). The disks all
> go 100% busy - the average SATA write latency goes from 14ms to 250ms.  I was 
> observing disks doing 400, 700 and higher service
> times.  After those few minutes it tapers down and goes back to normal.
>
> There are all ST6000VN0001 disks - anyone aware of anything that might 
> explain this sort of behaviour?  It seems odd that even if the
> disks were hit with high write traffic (average of 50 write IOPS going up to 
> 270-300 during this activity) that the service times would
> blow out that much.
>
> Cheers,
>  Adrian
>
>
>
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Thursday, 22 September 2016 7:15 PM
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > I tried 2 this afternoon and saw the same results.  Essentially the
> > disks appear to go to 100% busy doing very small but high numbers of IO and 
> > incur massive
> > service times (300-400ms).   During that period I get blocked request errors
> > continually.
> >
> > I suspect part of that might be the SATA servers had
> > filestore_op_threads set too high and hammering the disks with too
> > much concurrent work.  As they have inherited a setting targeted for
> > SSDs, so I have wound that back to defaults on those machines see if it 
> > makes a difference.
> >
> > But I suspect going by the disk activity there is a lot of very small
> > FS metadata updates going on and that is what is killing it.
> >
> > Cheers,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Thursday, 22 September 2016 7:06 PM
> > > To: Adrian Saul; ceph-users@lists.ceph.com
> > > Subject: RE: Snap delete performance impact
> > >
> > > Hi Adrian,
> > >
> > > I have also hit this recently and have since increased the
> > > osd_snap_trim_sleep to try and stop this from happening again.
> > > However, I haven't had an opportunity to actually try and break it
> > > again yet, but your mail seems to suggest it might not be the silver
> > > bullet I
> > was looking for.
> > >
> > > I'm wondering if the problem is not with the removal of the
> > > snapshot, but actually down to the amount of object deletes that
> > > happen, as I see similar results when doing fstrim's or deleting
> > > RBD's. Either way I agree that a settable throttle to allow it to
> > > process more slowly would be a
> > good addition.
> > > Have you tried that value set to higher than 1, maybe 10?
> > >
> > > Nick
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-bou

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Nick Fisk
Looking back through my graphs when this happened to me I can see that the 
queue on the disks was up as high as 30 during the period when the snapshot was 
removed, this would explain the high latencies, the disk is literally having 
fits trying to jump all over the place. 

I need to test with the higher osd_snap_trim_sleep to see if that helps. What 
I'm interested in finding out is why so much disk activity is required for 
deleting an object. It feels to me that the process is async, in that Ceph will 
quite happily flood the Filestore with delete requests without any feedback to 
the higher layers.


> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 23 September 2016 10:04
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
> 
> 
> I did some observation today - with the reduced filestore_op_threads it seems 
> to ride out the storm better, not ideal but better.
> 
> The main issue is that for the 10 minutes from the moment the rbd snap rm 
> command is issued, the SATA systems in my configuration
> load up massively on disk IO and I think this is what is rolling on to all 
> other issues (OSDs unresponsive, queue backlogs). The disks all
> go 100% busy - the average SATA write latency goes from 14ms to 250ms.  I was 
> observing disks doing 400, 700 and higher service
> times.  After those few minutes it tapers down and goes back to normal.
> 
> There are all ST6000VN0001 disks - anyone aware of anything that might 
> explain this sort of behaviour?  It seems odd that even if the
> disks were hit with high write traffic (average of 50 write IOPS going up to 
> 270-300 during this activity) that the service times would
> blow out that much.
> 
> Cheers,
>  Adrian
> 
> 
> 
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Thursday, 22 September 2016 7:15 PM
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > I tried 2 this afternoon and saw the same results.  Essentially the
> > disks appear to go to 100% busy doing very small but high numbers of IO and 
> > incur massive
> > service times (300-400ms).   During that period I get blocked request errors
> > continually.
> >
> > I suspect part of that might be the SATA servers had
> > filestore_op_threads set too high and hammering the disks with too
> > much concurrent work.  As they have inherited a setting targeted for
> > SSDs, so I have wound that back to defaults on those machines see if it 
> > makes a difference.
> >
> > But I suspect going by the disk activity there is a lot of very small
> > FS metadata updates going on and that is what is killing it.
> >
> > Cheers,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Thursday, 22 September 2016 7:06 PM
> > > To: Adrian Saul; ceph-users@lists.ceph.com
> > > Subject: RE: Snap delete performance impact
> > >
> > > Hi Adrian,
> > >
> > > I have also hit this recently and have since increased the
> > > osd_snap_trim_sleep to try and stop this from happening again.
> > > However, I haven't had an opportunity to actually try and break it
> > > again yet, but your mail seems to suggest it might not be the silver
> > > bullet I
> > was looking for.
> > >
> > > I'm wondering if the problem is not with the removal of the
> > > snapshot, but actually down to the amount of object deletes that
> > > happen, as I see similar results when doing fstrim's or deleting
> > > RBD's. Either way I agree that a settable throttle to allow it to
> > > process more slowly would be a
> > good addition.
> > > Have you tried that value set to higher than 1, maybe 10?
> > >
> > > Nick
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Adrian Saul
> > > > Sent: 22 September 2016 05:19
> > > > To: 'ceph-users@lists.ceph.com' 
> > > > Subject: Re: [ceph-users] Snap delete performance impact
> > > >
> > > >
> > > > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > > > seems to have tempered some of the issues but its still bad
> > > enough
> > > > that NFS storage off RBD volum

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul

I did some observation today - with the reduced filestore_op_threads it seems 
to ride out the storm better, not ideal but better.

The main issue is that for the 10 minutes from the moment the rbd snap rm 
command is issued, the SATA systems in my configuration load up massively on 
disk IO and I think this is what is rolling on to all other issues (OSDs 
unresponsive, queue backlogs). The disks all go 100% busy - the average SATA 
write latency goes from 14ms to 250ms.  I was observing disks doing 400, 700 
and higher service times.  After those few minutes it tapers down and goes back 
to normal.

There are all ST6000VN0001 disks - anyone aware of anything that might explain 
this sort of behaviour?  It seems odd that even if the disks were hit with high 
write traffic (average of 50 write IOPS going up to 270-300 during this 
activity) that the service times would blow out that much.

Cheers,
 Adrian






> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Thursday, 22 September 2016 7:15 PM
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Snap delete performance impact
>
>
> I tried 2 this afternoon and saw the same results.  Essentially the disks 
> appear
> to go to 100% busy doing very small but high numbers of IO and incur massive
> service times (300-400ms).   During that period I get blocked request errors
> continually.
>
> I suspect part of that might be the SATA servers had filestore_op_threads
> set too high and hammering the disks with too much concurrent work.  As
> they have inherited a setting targeted for SSDs, so I have wound that back to
> defaults on those machines see if it makes a difference.
>
> But I suspect going by the disk activity there is a lot of very small FS 
> metadata
> updates going on and that is what is killing it.
>
> Cheers,
>  Adrian
>
>
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Thursday, 22 September 2016 7:06 PM
> > To: Adrian Saul; ceph-users@lists.ceph.com
> > Subject: RE: Snap delete performance impact
> >
> > Hi Adrian,
> >
> > I have also hit this recently and have since increased the
> > osd_snap_trim_sleep to try and stop this from happening again.
> > However, I haven't had an opportunity to actually try and break it
> > again yet, but your mail seems to suggest it might not be the silver bullet 
> > I
> was looking for.
> >
> > I'm wondering if the problem is not with the removal of the snapshot,
> > but actually down to the amount of object deletes that happen, as I
> > see similar results when doing fstrim's or deleting RBD's. Either way
> > I agree that a settable throttle to allow it to process more slowly would 
> > be a
> good addition.
> > Have you tried that value set to higher than 1, maybe 10?
> >
> > Nick
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: 22 September 2016 05:19
> > > To: 'ceph-users@lists.ceph.com' 
> > > Subject: Re: [ceph-users] Snap delete performance impact
> > >
> > >
> > > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > > seems to have tempered some of the issues but its still bad
> > enough
> > > that NFS storage off RBD volumes become unavailable for over 3
> minutes.
> > >
> > > It seems that the activity which the snapshot deletes are actioned
> > > triggers massive disk load for around 30 minutes.  The logs
> > show
> > > OSDs marking each other out, OSDs complaining they are wrongly
> > > marked out and blocked requests errors for around 10 minutes at the
> > > start of this
> > activity.
> > >
> > > Is there any way to throttle snapshot deletes to make them much more
> > > of a background activity?  It really should not make the
> > entire
> > > platform unusable for 10 minutes.
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Adrian Saul
> > > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > > To: 'ceph-users@lists.ceph.com'
> > > > Subject: [ceph-users] Snap delete performance impact
> > > >
> > > >
> > > > I recently started a process of using rbd snapshots to setup a
> > > > backup regime for a few file systems contained in RBD images.
> >

Re: [ceph-users] Snap delete performance impact

2016-09-22 Thread Adrian Saul

I tried 2 this afternoon and saw the same results.  Essentially the disks 
appear to go to 100% busy doing very small but high numbers of IO and incur 
massive service times (300-400ms).   During that period I get blocked request 
errors continually.

I suspect part of that might be the SATA servers had filestore_op_threads set 
too high and hammering the disks with too much concurrent work.  As they have 
inherited a setting targeted for SSDs, so I have wound that back to defaults on 
those machines see if it makes a difference.

But I suspect going by the disk activity there is a lot of very small FS 
metadata updates going on and that is what is killing it.

Cheers,
 Adrian


> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Thursday, 22 September 2016 7:06 PM
> To: Adrian Saul; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
> Hi Adrian,
>
> I have also hit this recently and have since increased the
> osd_snap_trim_sleep to try and stop this from happening again. However, I
> haven't had an opportunity to actually try and break it again yet, but your
> mail seems to suggest it might not be the silver bullet I was looking for.
>
> I'm wondering if the problem is not with the removal of the snapshot, but
> actually down to the amount of object deletes that happen, as I see similar
> results when doing fstrim's or deleting RBD's. Either way I agree that a
> settable throttle to allow it to process more slowly would be a good addition.
> Have you tried that value set to higher than 1, maybe 10?
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: 22 September 2016 05:19
> > To: 'ceph-users@lists.ceph.com' 
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > seems to have tempered some of the issues but its still bad
> enough
> > that NFS storage off RBD volumes become unavailable for over 3 minutes.
> >
> > It seems that the activity which the snapshot deletes are actioned
> > triggers massive disk load for around 30 minutes.  The logs
> show
> > OSDs marking each other out, OSDs complaining they are wrongly marked
> > out and blocked requests errors for around 10 minutes at the start of this
> activity.
> >
> > Is there any way to throttle snapshot deletes to make them much more
> > of a background activity?  It really should not make the
> entire
> > platform unusable for 10 minutes.
> >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > To: 'ceph-users@lists.ceph.com'
> > > Subject: [ceph-users] Snap delete performance impact
> > >
> > >
> > > I recently started a process of using rbd snapshots to setup a
> > > backup regime for a few file systems contained in RBD images.  While
> > > this generally works well at the time of the snapshots there is a
> > > massive increase in latency (10ms to multiple seconds of rbd device
> > > latency) across the entire cluster.  This has flow on effects for
> > > some cluster timeouts as well as general performance hits to applications.
> > >
> > > In research I have found some references to osd_snap_trim_sleep being
> the
> > > way to throttle this activity but no real guidance on values for it.   I 
> > > also
> see
> > > some other osd_snap_trim tunables  (priority and cost).
> > >
> > > Is there any recommendations around setting these for a Jewel cluster?
> > >
> > > cheers,
> > >  Adrian
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional
> > privilege. They are intended solely for the attention and use of the
> > named addressee(s). They may only be copied, distributed or disclosed
> > with the consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the sender
> immediately by return email and delete or destroy all copies of the email.
> Any confidentiality, privilege or copyright is not waived or lost because t

Re: [ceph-users] Snap delete performance impact

2016-09-22 Thread Nick Fisk
Hi Adrian,

I have also hit this recently and have since increased the osd_snap_trim_sleep 
to try and stop this from happening again. However, I
haven't had an opportunity to actually try and break it again yet, but your 
mail seems to suggest it might not be the silver bullet
I was looking for.

I'm wondering if the problem is not with the removal of the snapshot, but 
actually down to the amount of object deletes that happen,
as I see similar results when doing fstrim's or deleting RBD's. Either way I 
agree that a settable throttle to allow it to process
more slowly would be a good addition. Have you tried that value set to higher 
than 1, maybe 10?

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Adrian Saul
> Sent: 22 September 2016 05:19
> To: 'ceph-users@lists.ceph.com' 
> Subject: Re: [ceph-users] Snap delete performance impact
> 
> 
> Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it seems to 
> have tempered some of the issues but its still bad
enough
> that NFS storage off RBD volumes become unavailable for over 3 minutes.
> 
> It seems that the activity which the snapshot deletes are actioned triggers 
> massive disk load for around 30 minutes.  The logs
show
> OSDs marking each other out, OSDs complaining they are wrongly marked out and 
> blocked requests errors for around 10 minutes at
> the start of this activity.
> 
> Is there any way to throttle snapshot deletes to make them much more of a 
> background activity?  It really should not make the
entire
> platform unusable for 10 minutes.
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Wednesday, 6 July 2016 3:41 PM
> > To: 'ceph-users@lists.ceph.com'
> > Subject: [ceph-users] Snap delete performance impact
> >
> >
> > I recently started a process of using rbd snapshots to setup a backup
> > regime for a few file systems contained in RBD images.  While this
> > generally works well at the time of the snapshots there is a massive
> > increase in latency (10ms to multiple seconds of rbd device latency)
> > across the entire cluster.  This has flow on effects for some cluster
> > timeouts as well as general performance hits to applications.
> >
> > In research I have found some references to osd_snap_trim_sleep being the
> > way to throttle this activity but no real guidance on values for it.   I 
> > also see
> > some other osd_snap_trim tunables  (priority and cost).
> >
> > Is there any recommendations around setting these for a Jewel cluster?
> >
> > cheers,
> >  Adrian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the named 
> addressee(s). They may only be copied, distributed or
> disclosed with the consent of the copyright owner. If you have received this 
> email by mistake or by breach of the confidentiality
> clause, please notify the sender immediately by return email and delete or 
> destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent 
> to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap delete performance impact

2016-09-21 Thread Adrian Saul

Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it seems to have 
tempered some of the issues but its still bad enough that NFS storage off RBD 
volumes become unavailable for over 3 minutes.

It seems that the activity which the snapshot deletes are actioned triggers 
massive disk load for around 30 minutes.  The logs show OSDs marking each other 
out, OSDs complaining they are wrongly marked out and blocked requests errors 
for around 10 minutes at the start of this activity.

Is there any way to throttle snapshot deletes to make them much more of a 
background activity?  It really should not make the entire platform unusable 
for 10 minutes.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Wednesday, 6 July 2016 3:41 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] Snap delete performance impact
>
>
> I recently started a process of using rbd snapshots to setup a backup regime
> for a few file systems contained in RBD images.  While this generally works
> well at the time of the snapshots there is a massive increase in latency (10ms
> to multiple seconds of rbd device latency) across the entire cluster.  This 
> has
> flow on effects for some cluster timeouts as well as general performance hits
> to applications.
>
> In research I have found some references to osd_snap_trim_sleep being the
> way to throttle this activity but no real guidance on values for it.   I also 
> see
> some other osd_snap_trim tunables  (priority and cost).
>
> Is there any recommendations around setting these for a Jewel cluster?
>
> cheers,
>  Adrian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snap delete performance impact

2016-07-05 Thread Adrian Saul

I recently started a process of using rbd snapshots to setup a backup regime 
for a few file systems contained in RBD images.  While this generally works 
well at the time of the snapshots there is a massive increase in latency (10ms 
to multiple seconds of rbd device latency) across the entire cluster.  This has 
flow on effects for some cluster timeouts as well as general performance hits 
to applications.

In research I have found some references to osd_snap_trim_sleep being the way 
to throttle this activity but no real guidance on values for it.   I also see 
some other osd_snap_trim tunables  (priority and cost).

Is there any recommendations around setting these for a Jewel cluster?

cheers,
 Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com