Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-10-02 Thread Mark Kirkwood
We are also becoming interested in understanding and taming the impact 
of deep scrubbing.


We may start running something similar to the cron tasks mentioned. 
Looking at these fine examples of bash + awk I wondered if I could do 
the job using the python rados api. I have attached my initial (untested 
in anger) effort (note it is *longer* than bash + awk).


Cheers

Mark

On 10/06/14 13:42, Mike Dawson wrote:

Craig,

I've struggled with the same issue for quite a while. If your i/o is
similar to mine, I believe you are on the right track. For the past
month or so, I have been running this cronjob:

* * * * *   for strPg in `ceph pg dump | egrep
'^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do
ceph pg deep-scrub $strPg; done

That roughly handles my 20672 PGs that are set to be deep-scrubbed every
7 days. Your script may be a bit better, but this quick and dirty method
has helped my cluster maintain more consistency.

The real key for me is to avoid the "clumpiness" I have observed without
that hack where concurrent deep-scrubs sit at zero for a long period of
time (despite having PGs that were months overdue for a deep-scrub),
then concurrent deep-scrubs suddenly spike up and stay in the teens for
hours, killing client writes/second.

The scrubbing behavior table[0] indicates that a periodic tick initiates
scrubs on a per-PG basis. Perhaps the timing of ticks aren't
sufficiently randomized when you restart lots of OSDs concurrently (for
instance via pdsh).

On my cluster I suffer a significant drag on client writes/second when I
exceed perhaps four or five concurrent PGs in deep-scrub. When
concurrent deep-scrubs get into the teens, I get a massive drop in
client writes/second.

Greg, is there locking involved when a PG enters deep-scrub? If so, is
the entire PG locked for the duration or is each individual object
inside the PG locked as it is processed? Some of my PGs will be in
deep-scrub for minutes at a time.

0: http://ceph.com/docs/master/dev/osd_internals/scrub/

Thanks,
Mike Dawson


On 6/9/2014 6:22 PM, Craig Lewis wrote:

I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time,
spread out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and
tries to do all of the deep scrubs over the weekend.  The secondary
starts loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
$1}' | sort | head -1 | read date time pg
  do
   ceph pg deep-scrub ${pg}
   while ceph status | grep scrubbing+deep
do
 sleep 5
   done
   sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary
finishes replicating from the primary.  Once it's caught up, the write
load should drop enough that opportunistic deep scrubs should have a
chance to run.  It should only take another week or two to catch up.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


#!/usr/bin/python
#
# rados-deep-scrub-pgs:
#
# get a list of pgs which have not been deep scrubbed in over a
# number of days. Initiate a deep scrub for a the oldest few.
#

import sys
import argparse
import json
from datetime import datetime 
from datetime import timedelta 
from rados import Rados

def cmp_deep_scrub_stamp(pg1, pg2):
# compare deep scrub dates, oldest first
if pg1["last_deep_scrub_stamp"] < pg2["last_deep_scrub_stamp"]:
return -1
elif pg1["last_deep_scrub_stamp"] > pg2["last_deep_scrub_stamp"]:
return 1
else:
return 0


parser = argparse.ArgumentParser()
parser.add_argument("-c", "--conf", help="ceph config file to use", default="/etc/ceph/ceph.conf")
parser.add_argument("-n", "--numpgs", help="number of pgs to deep scrub", type=int, default=2)
parser.add_argument("-d", "--days", help="deep scrub pgs older than this many days", type=int, default=14)
args = parser.parse_args()

conn = Rados(conffile=args.conf)

conn.connect()

numpgs = 0
cmd = {"prefix":"pg dump", "format":"json"}
ret, buf, errs = conn.mon_command(json.dumps(cmd), '', timeout=300)
if ret != 0:
print("cmd {:30} failed with {:50}".format(cmd, errs))
sys.exit(1)

# get the detail pgs from the json structure
# and sort 'em by deep scrub date, oldest first
pgs = json.loads(buf)["pg_stats"]
for pg in sorted(pgs, cmp=cmp_deep_scrub_stamp):
pg_stat = pg["stat_sum"]
pg_scrubdat

Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-11 Thread Dan Van Der Ster
On 10 Jun 2014, at 11:59, Dan Van Der Ster  wrote:

> One idea I had was to check the behaviour under different disk io schedulers, 
> trying exploit thread io priorities with cfq. So I have a question for the 
> developers about using ionice or ioprio_set to lower the IO priorities of the 
> threads responsible for scrubbing: 
>   - Are there dedicated threads always used for scrubbing only, and never for 
> client IOs? If so, can an admin identify the thread IDs so he can ionice 
> those? 
>   - If OTOH a disk/op thread is switching between scrubbing and client IO 
> responsibilities, could Ceph use ioprio_set to change the io priorities on 
> the fly??

I just submitted a feature request for this:  
http://tracker.ceph.com/issues/8580

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-10 Thread Craig Lewis
After doing this, I've found that I'm having problems with a few
specific PGs.  If I set nodeep-scrub, then manually deep-scrub one
specific PG, the responsible OSDs get kicked out.  I'm starting a new
discussion, subject: "I have PGs that I can't deep-scrub"

I'll re-test this correlation after I fix the broken PGs.

On Mon, Jun 9, 2014 at 10:20 PM, Gregory Farnum  wrote:
> On Mon, Jun 9, 2014 at 6:42 PM, Mike Dawson  wrote:
>> Craig,
>>
>> I've struggled with the same issue for quite a while. If your i/o is similar
>> to mine, I believe you are on the right track. For the past month or so, I
>> have been running this cronjob:
>>
>> * * * * *   for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' |
>> sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg;
>> done
>>
>> That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7
>> days. Your script may be a bit better, but this quick and dirty method has
>> helped my cluster maintain more consistency.
>>
>> The real key for me is to avoid the "clumpiness" I have observed without
>> that hack where concurrent deep-scrubs sit at zero for a long period of time
>> (despite having PGs that were months overdue for a deep-scrub), then
>> concurrent deep-scrubs suddenly spike up and stay in the teens for hours,
>> killing client writes/second.
>>
>> The scrubbing behavior table[0] indicates that a periodic tick initiates
>> scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently
>> randomized when you restart lots of OSDs concurrently (for instance via
>> pdsh).
>>
>> On my cluster I suffer a significant drag on client writes/second when I
>> exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent
>> deep-scrubs get into the teens, I get a massive drop in client
>> writes/second.
>>
>> Greg, is there locking involved when a PG enters deep-scrub? If so, is the
>> entire PG locked for the duration or is each individual object inside the PG
>> locked as it is processed? Some of my PGs will be in deep-scrub for minutes
>> at a time.
>
> It locks very small regions of the key space, but the expensive part
> is that deep scrub actually has to read all the data off disk, and
> that's often a lot more disk seeks than simply examining the metadata
> is.
> -Greg
>
>>
>> 0: http://ceph.com/docs/master/dev/osd_internals/scrub/
>>
>> Thanks,
>> Mike Dawson
>>
>>
>>
>> On 6/9/2014 6:22 PM, Craig Lewis wrote:
>>>
>>> I've correlated a large deep scrubbing operation to cluster stability
>>> problems.
>>>
>>> My primary cluster does a small amount of deep scrubs all the time,
>>> spread out over the whole week.  It has no stability problems.
>>>
>>> My secondary cluster doesn't spread them out.  It saves them up, and
>>> tries to do all of the deep scrubs over the weekend.  The secondary
>>> starts loosing OSDs about an hour after these deep scrubs start.
>>>
>>> To avoid this, I'm thinking of writing a script that continuously scrubs
>>> the oldest outstanding PG.  In psuedo-bash:
>>> # Sort by the deep-scrub timestamp, taking the single oldest PG
>>> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
>>> $1}' | sort | head -1 | read date time pg
>>>   do
>>>ceph pg deep-scrub ${pg}
>>>while ceph status | grep scrubbing+deep
>>> do
>>>  sleep 5
>>>done
>>>sleep 30
>>> done
>>>
>>>
>>> Does anybody think this will solve my problem?
>>>
>>> I'm also considering disabling deep-scrubbing until the secondary
>>> finishes replicating from the primary.  Once it's caught up, the write
>>> load should drop enough that opportunistic deep scrubs should have a
>>> chance to run.  It should only take another week or two to catch up.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-10 Thread Dan Van Der Ster
Hi,
I’m just starting to get interested in this topic, since today we’ve found that 
a weekly peak in latency correlates with a bulk (~30) of deep scrubbing PGs.

One idea I had was to check the behaviour under different disk io schedulers, 
trying exploit thread io priorities with cfq. So I have a question for the 
developers about using ionice or ioprio_set to lower the IO priorities of the 
threads responsible for scrubbing:
  - Are there dedicated threads always used for scrubbing only, and never for 
client IOs? If so, can an admin identify the thread IDs so he can ionice those?
  - If OTOH a disk/op thread is switching between scrubbing and client IO 
responsibilities, could Ceph use ioprio_set to change the io priorities on the 
fly??

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 10 Jun 2014, at 00:22, Craig Lewis 
mailto:cle...@centraldesktop.com>> wrote:

I've correlated a large deep scrubbing operation to cluster stability problems.

My primary cluster does a small amount of deep scrubs all the time, spread out 
over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and tries to 
do all of the deep scrubs over the weekend.  The secondary starts loosing OSDs 
about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs the 
oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}' | 
sort | head -1 | read date time pg
 do
  ceph pg deep-scrub ${pg}
  while ceph status | grep scrubbing+deep
   do
sleep 5
  done
  sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary finishes 
replicating from the primary.  Once it's caught up, the write load should drop 
enough that opportunistic deep scrubs should have a chance to run.  It should 
only take another week or two to catch up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Gregory Farnum
On Mon, Jun 9, 2014 at 6:42 PM, Mike Dawson  wrote:
> Craig,
>
> I've struggled with the same issue for quite a while. If your i/o is similar
> to mine, I believe you are on the right track. For the past month or so, I
> have been running this cronjob:
>
> * * * * *   for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' |
> sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg;
> done
>
> That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7
> days. Your script may be a bit better, but this quick and dirty method has
> helped my cluster maintain more consistency.
>
> The real key for me is to avoid the "clumpiness" I have observed without
> that hack where concurrent deep-scrubs sit at zero for a long period of time
> (despite having PGs that were months overdue for a deep-scrub), then
> concurrent deep-scrubs suddenly spike up and stay in the teens for hours,
> killing client writes/second.
>
> The scrubbing behavior table[0] indicates that a periodic tick initiates
> scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently
> randomized when you restart lots of OSDs concurrently (for instance via
> pdsh).
>
> On my cluster I suffer a significant drag on client writes/second when I
> exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent
> deep-scrubs get into the teens, I get a massive drop in client
> writes/second.
>
> Greg, is there locking involved when a PG enters deep-scrub? If so, is the
> entire PG locked for the duration or is each individual object inside the PG
> locked as it is processed? Some of my PGs will be in deep-scrub for minutes
> at a time.

It locks very small regions of the key space, but the expensive part
is that deep scrub actually has to read all the data off disk, and
that's often a lot more disk seeks than simply examining the metadata
is.
-Greg

>
> 0: http://ceph.com/docs/master/dev/osd_internals/scrub/
>
> Thanks,
> Mike Dawson
>
>
>
> On 6/9/2014 6:22 PM, Craig Lewis wrote:
>>
>> I've correlated a large deep scrubbing operation to cluster stability
>> problems.
>>
>> My primary cluster does a small amount of deep scrubs all the time,
>> spread out over the whole week.  It has no stability problems.
>>
>> My secondary cluster doesn't spread them out.  It saves them up, and
>> tries to do all of the deep scrubs over the weekend.  The secondary
>> starts loosing OSDs about an hour after these deep scrubs start.
>>
>> To avoid this, I'm thinking of writing a script that continuously scrubs
>> the oldest outstanding PG.  In psuedo-bash:
>> # Sort by the deep-scrub timestamp, taking the single oldest PG
>> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
>> $1}' | sort | head -1 | read date time pg
>>   do
>>ceph pg deep-scrub ${pg}
>>while ceph status | grep scrubbing+deep
>> do
>>  sleep 5
>>done
>>sleep 30
>> done
>>
>>
>> Does anybody think this will solve my problem?
>>
>> I'm also considering disabling deep-scrubbing until the secondary
>> finishes replicating from the primary.  Once it's caught up, the write
>> load should drop enough that opportunistic deep scrubs should have a
>> chance to run.  It should only take another week or two to catch up.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Mike Dawson

Craig,

I've struggled with the same issue for quite a while. If your i/o is 
similar to mine, I believe you are on the right track. For the past 
month or so, I have been running this cronjob:


* * * * *   for strPg in `ceph pg dump | egrep 
'^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do 
ceph pg deep-scrub $strPg; done


That roughly handles my 20672 PGs that are set to be deep-scrubbed every 
7 days. Your script may be a bit better, but this quick and dirty method 
has helped my cluster maintain more consistency.


The real key for me is to avoid the "clumpiness" I have observed without 
that hack where concurrent deep-scrubs sit at zero for a long period of 
time (despite having PGs that were months overdue for a deep-scrub), 
then concurrent deep-scrubs suddenly spike up and stay in the teens for 
hours, killing client writes/second.


The scrubbing behavior table[0] indicates that a periodic tick initiates 
scrubs on a per-PG basis. Perhaps the timing of ticks aren't 
sufficiently randomized when you restart lots of OSDs concurrently (for 
instance via pdsh).


On my cluster I suffer a significant drag on client writes/second when I 
exceed perhaps four or five concurrent PGs in deep-scrub. When 
concurrent deep-scrubs get into the teens, I get a massive drop in 
client writes/second.


Greg, is there locking involved when a PG enters deep-scrub? If so, is 
the entire PG locked for the duration or is each individual object 
inside the PG locked as it is processed? Some of my PGs will be in 
deep-scrub for minutes at a time.


0: http://ceph.com/docs/master/dev/osd_internals/scrub/

Thanks,
Mike Dawson


On 6/9/2014 6:22 PM, Craig Lewis wrote:

I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time,
spread out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and
tries to do all of the deep scrubs over the weekend.  The secondary
starts loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21,
$1}' | sort | head -1 | read date time pg
  do
   ceph pg deep-scrub ${pg}
   while ceph status | grep scrubbing+deep
do
 sleep 5
   done
   sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary
finishes replicating from the primary.  Once it's caught up, the write
load should drop enough that opportunistic deep scrubs should have a
chance to run.  It should only take another week or two to catch up.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Gregory Farnum
On Mon, Jun 9, 2014 at 3:22 PM, Craig Lewis  wrote:
> I've correlated a large deep scrubbing operation to cluster stability
> problems.
>
> My primary cluster does a small amount of deep scrubs all the time, spread
> out over the whole week.  It has no stability problems.
>
> My secondary cluster doesn't spread them out.  It saves them up, and tries
> to do all of the deep scrubs over the weekend.  The secondary starts loosing
> OSDs about an hour after these deep scrubs start.
>
> To avoid this, I'm thinking of writing a script that continuously scrubs the
> oldest outstanding PG.  In psuedo-bash:
> # Sort by the deep-scrub timestamp, taking the single oldest PG
> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}'
> | sort | head -1 | read date time pg
>  do
>   ceph pg deep-scrub ${pg}
>   while ceph status | grep scrubbing+deep
>do
> sleep 5
>   done
>   sleep 30
> done
>
>
> Does anybody think this will solve my problem?
>
> I'm also considering disabling deep-scrubbing until the secondary finishes
> replicating from the primary.  Once it's caught up, the write load should
> drop enough that opportunistic deep scrubs should have a chance to run.  It
> should only take another week or two to catch up.

If the problem is just that your secondary cluster is under a heavy
write load, and so the scrubbing won't run automatically until the PGs
hit their time limit, maybe it's appropriate to change the limits so
they can run earlier. You can bump up "osd scrub load threshold".
Or maybe that would be a terrible thing to do, not sure. But it sounds
like the cluster is just skipping the voluntary scrubs, and then they
all come due at once (probably from some earlier event).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to avoid deep-scrubbing performance hit?

2014-06-09 Thread Craig Lewis
I've correlated a large deep scrubbing operation to cluster stability
problems.

My primary cluster does a small amount of deep scrubs all the time, spread
out over the whole week.  It has no stability problems.

My secondary cluster doesn't spread them out.  It saves them up, and tries
to do all of the deep scrubs over the weekend.  The secondary starts
loosing OSDs about an hour after these deep scrubs start.

To avoid this, I'm thinking of writing a script that continuously scrubs
the oldest outstanding PG.  In psuedo-bash:
# Sort by the deep-scrub timestamp, taking the single oldest PG
while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}'
| sort | head -1 | read date time pg
 do
  ceph pg deep-scrub ${pg}
  while ceph status | grep scrubbing+deep
   do
sleep 5
  done
  sleep 30
done


Does anybody think this will solve my problem?

I'm also considering disabling deep-scrubbing until the secondary finishes
replicating from the primary.  Once it's caught up, the write load should
drop enough that opportunistic deep scrubs should have a chance to run.  It
should only take another week or two to catch up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com