Re: Write Replication on Degraded PGs

2013-02-15 Thread Ben Rowland
Further to my question about reads on a degraded PG, my tests show
that indeed reads from rgw fail when not all OSDs in a PG are up, even
when the data is physically available on an up/in OSD.

I have a size and min_size of 2 on my pool, and 2 hosts with 2
OSDs on each.  Crush map is set to write to 1 OSD on each of 2 hosts.
After writing a file to successfully to rgw via host 1, I then stop
all Ceph services on host 2.  Attempts to read the file I just wrote
time out after 30 seconds.  Starting Ceph again on host 2 allows reads
to proceed from host 1 once again.

I see the following in ceph.log after the read times out:

2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow
request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630:
osd_op(client.4345.0:21511
4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
0~524288] 9.5aaf1592 RETRY) v4 currently reached pg

After stopping Ceph on host 2, ceph -s reports:

   health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs
stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded
(0.647%)
   monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a
   osdmap e155: 4 osds: 2 up, 2 in
pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16
incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail;
44/6804 degraded (0.647%)
   mdsmap e1: 0/0/1 up

OSD tree just in case:

# id weight type name up/down reweight
-1 2 root default
-3 2 rack unknownrack
-2 1 host squeezeceph1
0 1 osd.0 up 1
2 1 osd.2 up 1
-4 1 host squeezeceph2
1 1 osd.1 down 0
3 0 osd.3 down 0

Running osd map on both the container and object names say host 1 is
acting for that PG (not sure if I'm looking at the right pools,
though):

$ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c

osdmap e155 pool '.rgw.buckets' (9) object
'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up
[0] acting [0]

$ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc

osdmap e155 pool '.rgw' (3) object
'91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up
[2] acting [2]

Any thoughts?  It doesn't seem right that taking out a single failure
domain should cause this degradation.

Many thanks,

Ben

On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote:
 On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote:
 
  On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote:

 So it sounds from the rest of your post like you'd want to, for each
 pool that RGW uses (it's not just .rgw), run ceph osd set .rgw
 min_size 2. (and for .rgw.buckets, etc etc)

 Thanks, that did the trick. When the number of up OSDs is less than
 min_size, writes block for 30s then return http 500. Ceph honours my
 crush rule in this case - adding more OSDs to only one of two failure
 domains continues to block writes - all well and good!

  If this is the expected behaviour of Ceph, then it seems to prefer
  write-availability over read-availability (in this case my data is
  only stored on 1 OSD, thus a SPOF).  Is there any way to change this
  trade-off, e.g. as you can in Cassandra with its write quorums?

 I'm not quite sure this is describing it correctly — Ceph guarantees
 that anything that's been written to disk will be readable later on,
 and placement groups won't go active if they can't retrieve all data.
 The sort of flexible policies allowed by Cassandra aren't possible
 within Ceph — it is a strictly consistent system.

 Are objects always readable even if a PG is missing some OSDs, and
 where it cannot recover? Example: 2 hosts each with 1 osd, pool
 min_size is 2, with a crush rule saying to write to both hosts. I
 write a file successfully, then one host goes down, and eventually is
 marked 'out'. Is the file readable on the 'up' host (say if I'm
 running rgw there?) What if the up host does not have the primary
 copy?

 Furthermore, if Ceph is strictly consistent, how would it resolve
 possible stale reads? Say, if in the 2 hosts example, the network
 connection died, but min_size was set to 1. Would it be possible for
 writes to proceed, say making edits to an existing object? Could
 readers at the other host see stale data?

 Thanks again in advance,

 Ben
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD's DOWN ---after upgrade to 0.56.3

2013-02-15 Thread femi anjorin
Hi All,

Pls I got this result after i did an upgrade to 0.56.3. I'm not sure
if its a problem with upgrade or some other things.

# ceph osd tree

# idweight  type name   up/down reweight
-1  96  root default
-3  96  rack unknownrack
-2  4   host server109
0   1   osd.0   DNE
1   1   osd.1   DNE
2   1   osd.2   DNE
3   1   osd.3   up  1
-4  4   host server111
10  1   osd.10  DNE
11  1   osd.11  DNE
8   1   osd.8   DNE
9   1   osd.9   up  1
-5  4   host server112
12  1   osd.12  DNE
13  1   osd.13  DNE
14  1   osd.14  DNE
15  1   osd.15  up  1
-6  4   host server113
16  1   osd.16  DNE
17  1   osd.17  DNE
18  1   osd.18  DNE
19  1   osd.19  up  1
-7  4   host server114
20  1   osd.20  DNE
21  1   osd.21  DNE
22  1   osd.22  DNE
23  1   osd.23  up  1
-8  4   host server115
24  1   osd.24  DNE
25  1   osd.25  DNE
26  1   osd.26  DNE
27  1   osd.27  up  1
-9  4   host server116
28  1   osd.28  DNE
29  1   osd.29  DNE
30  1   osd.30  DNE
31  1   osd.31  up  1
-10 4   host server209
32  1   osd.32  DNE
33  1   osd.33  DNE
34  1   osd.34  DNE
35  1   osd.35  up  1
-11 4   host server210
36  1   osd.36  DNE
37  1   osd.37  DNE
38  1   osd.38  DNE
39  1   osd.39  up  1
-12 4   host server110
4   1   osd.4   DNE
5   1   osd.5   DNE
6   1   osd.6   DNE
7   1   osd.7   up  1
-13 4   host server211
40  1   osd.40  DNE
41  1   osd.41  DNE
42  1   osd.42  DNE
43  1   osd.43  up  1
-14 4   host server212
44  1   osd.44  DNE
45  1   osd.45  DNE
46  1   osd.46  DNE
47  1   osd.47  up  1
-15 4   host server213
48  1   osd.48  DNE
49  1   osd.49  DNE
50  1   osd.50  DNE
51  1   osd.51  up  1
-16 4   host server214
52  1   osd.52  DNE
53  1   osd.53  DNE
54  1   osd.54  DNE
55  1   osd.55  up  1
-17 4   host server215
56  1   osd.56  DNE
57  1   osd.57  DNE
58  1   osd.58  DNE
59  1   osd.59  up  1
-18 4   host server216
60  1   osd.60  DNE
61  1   osd.61  DNE
62  1   osd.62  DNE
63  1   osd.63  up  1
-19 4   host server309
64  1   osd.64  DNE
65  1   osd.65  DNE
66  1   osd.66  DNE
67  1   osd.67  up  1
-20 4   host server310
68  1   osd.68  DNE
69  1   osd.69  DNE
70  1   osd.70  DNE
71  1   osd.71  up  1
-21 4   

Re: OSD's DOWN ---after upgrade to 0.56.3

2013-02-15 Thread Jens Kristian Søgaard

Hi Femi,


Pls I got this result after i did an upgrade to 0.56.3. I'm not sure
-2  4   host server109
0   1   osd.0   DNE
1   1   osd.1   DNE
2   1   osd.2   DNE


Your OSDs are not marked down - they're listed as do not exist.

Did you previously run a ceph osd rm command on those osds?

Note: This topic is probably best suited for the ceph-users list, so I 
have cross-posted to that list.


--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD's DOWN ---after upgrade to 0.56.3

2013-02-15 Thread Patrick McGarry
femi,

CC'ing ceph-user as this discussion probably belongs there.

Could you send a copy of your crushmap?  DNE is typically what we see
when someone explicitly removes an osd with something like: 'ceph osd
rm 90' (Does Not Exist).

Also, out of curiosity, how did you upgrade you cluster? One box at a
time? Take the whole thing down and upgrade everything? something
else? Just interested to see how you got to where you are.  Feel free
to stop by irc and ask for scuttlemonkey if you want a more direct
discussion.  Thanks.


Best Regards,

Patrick

On Fri, Feb 15, 2013 at 8:50 AM, femi anjorin femi.anjo...@gmail.com wrote:
 Hi All,

 Pls I got this result after i did an upgrade to 0.56.3. I'm not sure
 if its a problem with upgrade or some other things.

 # ceph osd tree

 # idweight  type name   up/down reweight
 -1  96  root default
 -3  96  rack unknownrack
 -2  4   host server109
 0   1   osd.0   DNE
 1   1   osd.1   DNE
 2   1   osd.2   DNE
 3   1   osd.3   up  1
 -4  4   host server111
 10  1   osd.10  DNE
 11  1   osd.11  DNE
 8   1   osd.8   DNE
 9   1   osd.9   up  1
 -5  4   host server112
 12  1   osd.12  DNE
 13  1   osd.13  DNE
 14  1   osd.14  DNE
 15  1   osd.15  up  1
 -6  4   host server113
 16  1   osd.16  DNE
 17  1   osd.17  DNE
 18  1   osd.18  DNE
 19  1   osd.19  up  1
 -7  4   host server114
 20  1   osd.20  DNE
 21  1   osd.21  DNE
 22  1   osd.22  DNE
 23  1   osd.23  up  1
 -8  4   host server115
 24  1   osd.24  DNE
 25  1   osd.25  DNE
 26  1   osd.26  DNE
 27  1   osd.27  up  1
 -9  4   host server116
 28  1   osd.28  DNE
 29  1   osd.29  DNE
 30  1   osd.30  DNE
 31  1   osd.31  up  1
 -10 4   host server209
 32  1   osd.32  DNE
 33  1   osd.33  DNE
 34  1   osd.34  DNE
 35  1   osd.35  up  1
 -11 4   host server210
 36  1   osd.36  DNE
 37  1   osd.37  DNE
 38  1   osd.38  DNE
 39  1   osd.39  up  1
 -12 4   host server110
 4   1   osd.4   DNE
 5   1   osd.5   DNE
 6   1   osd.6   DNE
 7   1   osd.7   up  1
 -13 4   host server211
 40  1   osd.40  DNE
 41  1   osd.41  DNE
 42  1   osd.42  DNE
 43  1   osd.43  up  1
 -14 4   host server212
 44  1   osd.44  DNE
 45  1   osd.45  DNE
 46  1   osd.46  DNE
 47  1   osd.47  up  1
 -15 4   host server213
 48  1   osd.48  DNE
 49  1   osd.49  DNE
 50  1   osd.50  DNE
 51  1   osd.51  up  1
 -16 4   host server214
 52  1   osd.52  DNE
 53  1   osd.53  DNE
 54  1   osd.54  DNE
 55  1   osd.55  up  1
 -17 4   host server215
 56  1   osd.56  DNE
 57  1   osd.57  DNE
 58  1   osd.58  DNE
 59  1   osd.59  up  1
 -18 4   host server216
 60  1   osd.60  DNE
 61 

Re: [PATCH] config: Add small note about default number of PGs

2013-02-15 Thread Sam Lang
On Sat, Feb 9, 2013 at 1:55 PM, Wido den Hollander w...@42on.com wrote:
 From: Wido den Hollander w...@widodh.nl

 It's still not clear to end users this should go into the
 mon or global section of ceph.conf

 Until this gets resolved document it here as well for the people
 who look up their settings in the source code.

 Signed-off-by: Wido den Hollander w...@42on.com
 ---
  src/common/config_opts.h |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/src/common/config_opts.h b/src/common/config_opts.h
 index ce3bca2..0fc07a3 100644
 --- a/src/common/config_opts.h
 +++ b/src/common/config_opts.h
 @@ -317,8 +317,8 @@ OPTION(osd_max_rep, OPT_INT, 10)
  OPTION(osd_pool_default_crush_rule, OPT_INT, 0)
  OPTION(osd_pool_default_size, OPT_INT, 2)
  OPTION(osd_pool_default_min_size, OPT_INT, 0)  // 0 means no specific 
 default; ceph will use size-size/2
 -OPTION(osd_pool_default_pg_num, OPT_INT, 8)
 -OPTION(osd_pool_default_pgp_num, OPT_INT, 8)
 +OPTION(osd_pool_default_pg_num, OPT_INT, 8) // number of PGs for new pools. 
 Configure in global or mon section of ceph.conf
 +OPTION(osd_pool_default_pgp_num, OPT_INT, 8) // number of PGs for placement 
 purposes. Should be equal to pg_num
  OPTION(osd_map_dedup, OPT_BOOL, true)
  OPTION(osd_map_cache_size, OPT_INT, 500)
  OPTION(osd_map_message_max, OPT_INT, 100)  // max maps per MOSDMap message
 --
 1.7.9.5

 --

Applied to master.  Thanks!
-sam

 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-02-15 Thread Sam Lang
On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah zmoo...@yahoo.com wrote:


 Yes, there were osd daemons running on the same node that the monitor was
 running on.  If that is the case then i will run a test case with the
 monitor running on a different node where no osd is running and see what 
 happens. Thank you.

Hi Isaac,

Any luck?  Does the problem reproduce with the mon running on a separate host?
-sam


 Isaac

 
 From: Gregory Farnum g...@inktank.com
 To: Isaac Otsiabah zmoo...@yahoo.com
 Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
 Sent: Monday, February 11, 2013 12:29 PM
 Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host 
 to my cluster

 jIsaac,
 I'm sorry I haven't been able to wrangle any time to look into this
 more yet, but Sage pointed out in a related thread that there might be
 some buggy handling of things like this if the OSD and the monitor are
 located on the same host. Am I correct in assuming that with your
 small cluster, all your OSDs are co-located with a monitor daemon?
 -Greg

 On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah zmoo...@yahoo.com wrote:


 Gregory, i recreated the osd down problem again this morning on two nodes 
 (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 
 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 
 minute and half after adding osd 3, 4, 5 were adde4d. i have included the 
 routing table of each node at the time osd.1 went down. ceph.conf and 
 ceph-osd.1.log files are attached. The crush map was default. Also, it could 
 be a timing issue because it does not always fail when  using default crush 
 map, it takes several trials before you see it. Thank you.


 [root@g13ct ~]# netstat -r
 Kernel IP routing table
 Destination Gateway Genmask Flags   MSS Window  irtt 
 Iface
 default 133.164.98.250 0.0.0.0 UG0 0  0 eth2
 133.164.98.0*   255.255.255.0   U 0 0  0 eth2
 link-local  *   255.255.0.0 U 0 0  0 eth3
 link-local  *   255.255.0.0 U 0 0  0 eth0
 link-local  *   255.255.0.0 U 0 0  0 eth2
 192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
 192.0.0.0   *   255.0.0.0   U 0 0  0 eth0
 192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
 192.168.1.0 *   255.255.255.0   U 0 0  0 eth0
 [root@g13ct ~]# ceph osd tree

 # idweight  type name   up/down reweight
 -1  6   root default
 -3  6   rack unknownrack
 -2  3   host g13ct
 0   1   osd.0   up  1
 1   1   osd.1   down1
 2   1   osd.2   up  1
 -4  3   host g14ct
 3   1   osd.3   up  1
 4   1   osd.4   up  1
 5   1   osd.5   up  1



 [root@g14ct ~]# ceph osd tree

 # idweight  type name   up/down reweight
 -1  6   root default
 -3  6   rack unknownrack
 -2  3   host g13ct
 0   1   osd.0   up  1
 1   1   osd.1   down1
 2   1   osd.2   up  1
 -4  3   host g14ct
 3   1   osd.3   up  1
 4   1   osd.4   up  1
 5   1   osd.5   up  1

 [root@g14ct ~]# netstat -r
 Kernel IP routing table
 Destination Gateway Genmask Flags   MSS Window  irtt 
 Iface
 default 133.164.98.250 0.0.0.0 UG0 0  0 eth0
 133.164.98.0*   255.255.255.0   U 0 0  0 eth0
 link-local  *   255.255.0.0 U 0 0  0 eth3
 link-local  *   255.255.0.0 U 0 0  0 eth5
 link-local  *   255.255.0.0 U 0 0  0 eth0
 192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
 192.0.0.0   *   255.0.0.0   U 0 0  0 eth5
 192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
 192.168.1.0 *   255.255.255.0   U 0 0  0 eth5
 [root@g14ct ~]# ceph osd tree

 # idweight  type name   up/down reweight
 -1  6   root default
 -3  6   rack unknownrack
 -2  3   host g13ct
 0   1   osd.0   up  1
 1   1   osd.1   down1
 2   1   

Re: radosgw: Update a key's meta data

2013-02-15 Thread Yehuda Sadeh
On Thu, Feb 14, 2013 at 6:37 AM, Sylvain Munaut
s.mun...@whatever-company.com wrote:
 Hi,

 I was wondering how I could update a key's metadata like the Content-Type.

 The solution on S3 seem to be to copy the key on itself and replacing
 meta data. If I do that in ceph, will it work ? And more importantly,
 will it be done intelligently (i.e. without copying the actual file
 data around).

Same API in S3, copying an object into itself. For Swift there's a
POST request with updated metadata. Data (other than the first chunk)
isn't really copied.


 I tried reading the code, but although part of the code seem to hint
 at support for this (in rgw_rest_s3.cc), some other part seem to not
 look at all if the src == dst  (like rgw_op.cc).


Right. It is actually missing and it's a bug. I've opened ceph issue #4150.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The Ceph Census

2013-02-15 Thread Ross David Turk

Hey folks.  We've gotten nearly 50 responses so far, and the data is proving to 
be quite interesting!  I will share it on the blog early next week.

The survey will be open until next Monday so that everyone has an opportunity 
to participate.  If you haven't gotten around to adding your cluster, you still 
can - it's a pretty short list of questions, shouldn't take more than a minute 
or two.

http://ceph.com/census

Thanks,
Ross


On Feb 13, 2013, at 7:06 PM, Ross David Turk r...@inktank.com wrote:

 
 Hi!  It's been a while since my last poll about Ceph deployments and use 
 cases.  Since there are so many more of us now, I think it's a good time to 
 do it again.  This time, I've set up a survey.
 
 I am particularly interested in how many deployments of Ceph there are, how 
 much underlying storage they manage, whether they're in production, and how 
 people plan to use them.
 
 It's a very short survey (10 questions, mostly multiple-choice), and 
 shouldn't take more than a minute or two.  The results will be public, and I 
 think it will help us all figure out how to focus our efforts.
 
 http://ceph.com/census
 
 This information will be compiled and published on the Ceph blog for all to 
 review and enjoy.  Thanks!
 
 Cheers,
 Ross


--
Ross Turk
Community, Inktank

@rossturk @inktank @ceph

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-15 Thread Chris Dunlop
G'day Sage,

On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
 On Fri, 15 Feb 2013, Chris Dunlop wrote:
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
 mons to lose touch with the osds?
 
 Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
 hopes that this happens again?  It will give us more information to go on.

Debug turned on.

 Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened 
 in
 any of the many previous slow requests intances, and the timing doesn't 
 look
 quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
 the osd.0 log shows nothing problems at all, then the mon complains about not
 having heard from osd.1 since 20:11:21, whereas the first indication of 
 trouble
 on osd.1 was the request from 20:26:20 not being processed in a timely 
 fashion.
 
 My guess is the above was a side-effect of osd.0 being marked out.   On 
 0.56.2 there is some strange peering workqueue laggyness that could 
 potentially contribute as well.  I recommend moving to 0.56.3.

Upgraded to 0.56.3.

 Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
 restarting the osds ('service ceph restart osd' on each osd host).
 
 The immediate issue was resolved by restarting ceph completely on one of the
 mon/osd hosts (service ceph restart). Possibly a restart of just the mon 
 would
 have been sufficient.
 
 Did you notice that the osds you restarted didn't immediately mark 
 themselves in?  Again, it could be explained by the peering wq issue, 
 especially if there are pools in your cluster that are not getting any IO.

Sorry, no. I was kicking myself later for losing the 'ceph -s' output 
when I killed that terminal session but in the heat of the moment...

I can't see anything about osd marking themselves in from the logs from the
time (with no debugging), but I'm on my ipad at the moment so I could easily
have missed it. Should that info be in the logs somewhere?

There's certainly unused pools: we're only using the rbd pool and so the
default data and metadata pools are unused.

Thanks for your attention!

Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-15 Thread Kevin Decherf
On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote:
 On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote:
  On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
  On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
   Furthermore, I observe another strange thing more or less related to the
   storms.
  
   During a rsync command to write ~20G of data on Ceph and during (and
   after) the storm, one OSD sends a lot of data to the active MDS
   (400Mbps peak each 6 seconds). After a quick check, I found that when I
   stop osd.23, osd.14 stops its peaks.
 
  This is consistent with Sam's suggestion that MDS is thrashing its
  cache, and is grabbing a directory object off of the OSDs. How large
  are the directories you're using? If they're a significant fraction of
  your cache size, it might be worth enabling the (sadly less stable)
  directory fragmentation options, which will split them up into smaller
  fragments that can be independently read and written to disk.
 
  I set mds cache size to 40 but now I observe ~900Mbps peaks from
  osd.14 to the active mds, osd.18 and osd.2.
 
  osd.14 shares some pg with osd.18 and osd.2:
  http://pastebin.com/raw.php?i=uBAcTcu4
 
 The high bandwidth from OSD to MDS really isn't a concern — that's the
 MDS asking for data and getting it back quickly! We're concerned about
 client responsiveness; has that gotten better?

It seems better now, I didn't see any storm so far.

But we observe high latency on some of our clients (with no load). Does
it exist any documentation on how to read the perfcounters_dump output?
I would like to know if the MDS still has any problem with its cache or
if the latency comes from elsewhere.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-02-15 Thread Isaac Otsiabah


Hello Sam and Gregory, i got machines today and tested it with the monitor 
process running on a separate system with no osd daemons and i did not see the 
problem. On Monday i will do a few test to confirm.

Isaac



- Original Message -
From: Sam Lang sam.l...@inktank.com
To: Isaac Otsiabah zmoo...@yahoo.com
Cc: Gregory Farnum g...@inktank.com; ceph-devel@vger.kernel.org 
ceph-devel@vger.kernel.org
Sent: Friday, February 15, 2013 9:20 AM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to 
my cluster

On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah zmoo...@yahoo.com wrote:


 Yes, there were osd daemons running on the same node that the monitor was
 running on.  If that is the case then i will run a test case with the
 monitor running on a different node where no osd is running and see what 
 happens. Thank you.

Hi Isaac,

Any luck?  Does the problem reproduce with the mon running on a separate host?
-sam


 Isaac

 
 From: Gregory Farnum g...@inktank.com
 To: Isaac Otsiabah zmoo...@yahoo.com
 Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
 Sent: Monday, February 11, 2013 12:29 PM
 Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host 
 to my cluster

 jIsaac,
 I'm sorry I haven't been able to wrangle any time to look into this
 more yet, but Sage pointed out in a related thread that there might be
 some buggy handling of things like this if the OSD and the monitor are
 located on the same host. Am I correct in assuming that with your
 small cluster, all your OSDs are co-located with a monitor daemon?
 -Greg

 On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah zmoo...@yahoo.com wrote:


 Gregory, i recreated the osd down problem again this morning on two nodes 
 (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 
 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 
 minute and half after adding osd 3, 4, 5 were adde4d. i have included the 
 routing table of each node at the time osd.1 went down. ceph.conf and 
 ceph-osd.1.log files are attached. The crush map was default. Also, it could 
 be a timing issue because it does not always fail when  using default crush 
 map, it takes several trials before you see it. Thank you.


 [root@g13ct ~]# netstat -r
 Kernel IP routing table
 Destination     Gateway         Genmask         Flags   MSS Window  irtt 
 Iface
 default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
 link-local      *               255.255.0.0     U         0 0          0 eth3
 link-local      *               255.255.0.0     U         0 0          0 eth0
 link-local      *               255.255.0.0     U         0 0          0 eth2
 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
 [root@g13ct ~]# ceph osd tree

 # id    weight  type name       up/down reweight
 -1      6       root default
 -3      6               rack unknownrack
 -2      3                       host g13ct
 0       1                               osd.0   up      1
 1       1                               osd.1   down    1
 2       1                               osd.2   up      1
 -4      3                       host g14ct
 3       1                               osd.3   up      1
 4       1                               osd.4   up      1
 5       1                               osd.5   up      1



 [root@g14ct ~]# ceph osd tree

 # id    weight  type name       up/down reweight
 -1      6       root default
 -3      6               rack unknownrack
 -2      3                       host g13ct
 0       1                               osd.0   up      1
 1       1                               osd.1   down    1
 2       1                               osd.2   up      1
 -4      3                       host g14ct
 3       1                               osd.3   up      1
 4       1                               osd.4   up      1
 5       1                               osd.5   up      1

 [root@g14ct ~]# netstat -r
 Kernel IP routing table
 Destination     Gateway         Genmask         Flags   MSS Window  irtt 
 Iface
 default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
 link-local      *               255.255.0.0     U         0 0          0 eth3
 link-local      *               255.255.0.0     U         0 0          0 eth5
 link-local      *               255.255.0.0     U         0 0          0 eth0
 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
 192.0.0.0       *               

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-15 Thread Andrey Korolyov
Can anyone who hit this bug please confirm that your system contains libc 2.15+?

On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote:
 oh nice, the pattern also matches path :D, didn't know that
 thanks Greg
 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote:
 Set your /proc/sys/kernel/core_pattern file. :) 
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html