Re: Write Replication on Degraded PGs
Further to my question about reads on a degraded PG, my tests show that indeed reads from rgw fail when not all OSDs in a PG are up, even when the data is physically available on an up/in OSD. I have a size and min_size of 2 on my pool, and 2 hosts with 2 OSDs on each. Crush map is set to write to 1 OSD on each of 2 hosts. After writing a file to successfully to rgw via host 1, I then stop all Ceph services on host 2. Attempts to read the file I just wrote time out after 30 seconds. Starting Ceph again on host 2 allows reads to proceed from host 1 once again. I see the following in ceph.log after the read times out: 2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630: osd_op(client.4345.0:21511 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read 0~524288] 9.5aaf1592 RETRY) v4 currently reached pg After stopping Ceph on host 2, ceph -s reports: health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded (0.647%) monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a osdmap e155: 4 osds: 2 up, 2 in pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16 incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail; 44/6804 degraded (0.647%) mdsmap e1: 0/0/1 up OSD tree just in case: # id weight type name up/down reweight -1 2 root default -3 2 rack unknownrack -2 1 host squeezeceph1 0 1 osd.0 up 1 2 1 osd.2 up 1 -4 1 host squeezeceph2 1 1 osd.1 down 0 3 0 osd.3 down 0 Running osd map on both the container and object names say host 1 is acting for that PG (not sure if I'm looking at the right pools, though): $ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c osdmap e155 pool '.rgw.buckets' (9) object 'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up [0] acting [0] $ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc osdmap e155 pool '.rgw' (3) object '91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up [2] acting [2] Any thoughts? It doesn't seem right that taking out a single failure domain should cause this degradation. Many thanks, Ben On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote: On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote: On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote: So it sounds from the rest of your post like you'd want to, for each pool that RGW uses (it's not just .rgw), run ceph osd set .rgw min_size 2. (and for .rgw.buckets, etc etc) Thanks, that did the trick. When the number of up OSDs is less than min_size, writes block for 30s then return http 500. Ceph honours my crush rule in this case - adding more OSDs to only one of two failure domains continues to block writes - all well and good! If this is the expected behaviour of Ceph, then it seems to prefer write-availability over read-availability (in this case my data is only stored on 1 OSD, thus a SPOF). Is there any way to change this trade-off, e.g. as you can in Cassandra with its write quorums? I'm not quite sure this is describing it correctly — Ceph guarantees that anything that's been written to disk will be readable later on, and placement groups won't go active if they can't retrieve all data. The sort of flexible policies allowed by Cassandra aren't possible within Ceph — it is a strictly consistent system. Are objects always readable even if a PG is missing some OSDs, and where it cannot recover? Example: 2 hosts each with 1 osd, pool min_size is 2, with a crush rule saying to write to both hosts. I write a file successfully, then one host goes down, and eventually is marked 'out'. Is the file readable on the 'up' host (say if I'm running rgw there?) What if the up host does not have the primary copy? Furthermore, if Ceph is strictly consistent, how would it resolve possible stale reads? Say, if in the 2 hosts example, the network connection died, but min_size was set to 1. Would it be possible for writes to proceed, say making edits to an existing object? Could readers at the other host see stale data? Thanks again in advance, Ben -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD's DOWN ---after upgrade to 0.56.3
Hi All, Pls I got this result after i did an upgrade to 0.56.3. I'm not sure if its a problem with upgrade or some other things. # ceph osd tree # idweight type name up/down reweight -1 96 root default -3 96 rack unknownrack -2 4 host server109 0 1 osd.0 DNE 1 1 osd.1 DNE 2 1 osd.2 DNE 3 1 osd.3 up 1 -4 4 host server111 10 1 osd.10 DNE 11 1 osd.11 DNE 8 1 osd.8 DNE 9 1 osd.9 up 1 -5 4 host server112 12 1 osd.12 DNE 13 1 osd.13 DNE 14 1 osd.14 DNE 15 1 osd.15 up 1 -6 4 host server113 16 1 osd.16 DNE 17 1 osd.17 DNE 18 1 osd.18 DNE 19 1 osd.19 up 1 -7 4 host server114 20 1 osd.20 DNE 21 1 osd.21 DNE 22 1 osd.22 DNE 23 1 osd.23 up 1 -8 4 host server115 24 1 osd.24 DNE 25 1 osd.25 DNE 26 1 osd.26 DNE 27 1 osd.27 up 1 -9 4 host server116 28 1 osd.28 DNE 29 1 osd.29 DNE 30 1 osd.30 DNE 31 1 osd.31 up 1 -10 4 host server209 32 1 osd.32 DNE 33 1 osd.33 DNE 34 1 osd.34 DNE 35 1 osd.35 up 1 -11 4 host server210 36 1 osd.36 DNE 37 1 osd.37 DNE 38 1 osd.38 DNE 39 1 osd.39 up 1 -12 4 host server110 4 1 osd.4 DNE 5 1 osd.5 DNE 6 1 osd.6 DNE 7 1 osd.7 up 1 -13 4 host server211 40 1 osd.40 DNE 41 1 osd.41 DNE 42 1 osd.42 DNE 43 1 osd.43 up 1 -14 4 host server212 44 1 osd.44 DNE 45 1 osd.45 DNE 46 1 osd.46 DNE 47 1 osd.47 up 1 -15 4 host server213 48 1 osd.48 DNE 49 1 osd.49 DNE 50 1 osd.50 DNE 51 1 osd.51 up 1 -16 4 host server214 52 1 osd.52 DNE 53 1 osd.53 DNE 54 1 osd.54 DNE 55 1 osd.55 up 1 -17 4 host server215 56 1 osd.56 DNE 57 1 osd.57 DNE 58 1 osd.58 DNE 59 1 osd.59 up 1 -18 4 host server216 60 1 osd.60 DNE 61 1 osd.61 DNE 62 1 osd.62 DNE 63 1 osd.63 up 1 -19 4 host server309 64 1 osd.64 DNE 65 1 osd.65 DNE 66 1 osd.66 DNE 67 1 osd.67 up 1 -20 4 host server310 68 1 osd.68 DNE 69 1 osd.69 DNE 70 1 osd.70 DNE 71 1 osd.71 up 1 -21 4
Re: OSD's DOWN ---after upgrade to 0.56.3
Hi Femi, Pls I got this result after i did an upgrade to 0.56.3. I'm not sure -2 4 host server109 0 1 osd.0 DNE 1 1 osd.1 DNE 2 1 osd.2 DNE Your OSDs are not marked down - they're listed as do not exist. Did you previously run a ceph osd rm command on those osds? Note: This topic is probably best suited for the ceph-users list, so I have cross-posted to that list. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD's DOWN ---after upgrade to 0.56.3
femi, CC'ing ceph-user as this discussion probably belongs there. Could you send a copy of your crushmap? DNE is typically what we see when someone explicitly removes an osd with something like: 'ceph osd rm 90' (Does Not Exist). Also, out of curiosity, how did you upgrade you cluster? One box at a time? Take the whole thing down and upgrade everything? something else? Just interested to see how you got to where you are. Feel free to stop by irc and ask for scuttlemonkey if you want a more direct discussion. Thanks. Best Regards, Patrick On Fri, Feb 15, 2013 at 8:50 AM, femi anjorin femi.anjo...@gmail.com wrote: Hi All, Pls I got this result after i did an upgrade to 0.56.3. I'm not sure if its a problem with upgrade or some other things. # ceph osd tree # idweight type name up/down reweight -1 96 root default -3 96 rack unknownrack -2 4 host server109 0 1 osd.0 DNE 1 1 osd.1 DNE 2 1 osd.2 DNE 3 1 osd.3 up 1 -4 4 host server111 10 1 osd.10 DNE 11 1 osd.11 DNE 8 1 osd.8 DNE 9 1 osd.9 up 1 -5 4 host server112 12 1 osd.12 DNE 13 1 osd.13 DNE 14 1 osd.14 DNE 15 1 osd.15 up 1 -6 4 host server113 16 1 osd.16 DNE 17 1 osd.17 DNE 18 1 osd.18 DNE 19 1 osd.19 up 1 -7 4 host server114 20 1 osd.20 DNE 21 1 osd.21 DNE 22 1 osd.22 DNE 23 1 osd.23 up 1 -8 4 host server115 24 1 osd.24 DNE 25 1 osd.25 DNE 26 1 osd.26 DNE 27 1 osd.27 up 1 -9 4 host server116 28 1 osd.28 DNE 29 1 osd.29 DNE 30 1 osd.30 DNE 31 1 osd.31 up 1 -10 4 host server209 32 1 osd.32 DNE 33 1 osd.33 DNE 34 1 osd.34 DNE 35 1 osd.35 up 1 -11 4 host server210 36 1 osd.36 DNE 37 1 osd.37 DNE 38 1 osd.38 DNE 39 1 osd.39 up 1 -12 4 host server110 4 1 osd.4 DNE 5 1 osd.5 DNE 6 1 osd.6 DNE 7 1 osd.7 up 1 -13 4 host server211 40 1 osd.40 DNE 41 1 osd.41 DNE 42 1 osd.42 DNE 43 1 osd.43 up 1 -14 4 host server212 44 1 osd.44 DNE 45 1 osd.45 DNE 46 1 osd.46 DNE 47 1 osd.47 up 1 -15 4 host server213 48 1 osd.48 DNE 49 1 osd.49 DNE 50 1 osd.50 DNE 51 1 osd.51 up 1 -16 4 host server214 52 1 osd.52 DNE 53 1 osd.53 DNE 54 1 osd.54 DNE 55 1 osd.55 up 1 -17 4 host server215 56 1 osd.56 DNE 57 1 osd.57 DNE 58 1 osd.58 DNE 59 1 osd.59 up 1 -18 4 host server216 60 1 osd.60 DNE 61
Re: [PATCH] config: Add small note about default number of PGs
On Sat, Feb 9, 2013 at 1:55 PM, Wido den Hollander w...@42on.com wrote: From: Wido den Hollander w...@widodh.nl It's still not clear to end users this should go into the mon or global section of ceph.conf Until this gets resolved document it here as well for the people who look up their settings in the source code. Signed-off-by: Wido den Hollander w...@42on.com --- src/common/config_opts.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/common/config_opts.h b/src/common/config_opts.h index ce3bca2..0fc07a3 100644 --- a/src/common/config_opts.h +++ b/src/common/config_opts.h @@ -317,8 +317,8 @@ OPTION(osd_max_rep, OPT_INT, 10) OPTION(osd_pool_default_crush_rule, OPT_INT, 0) OPTION(osd_pool_default_size, OPT_INT, 2) OPTION(osd_pool_default_min_size, OPT_INT, 0) // 0 means no specific default; ceph will use size-size/2 -OPTION(osd_pool_default_pg_num, OPT_INT, 8) -OPTION(osd_pool_default_pgp_num, OPT_INT, 8) +OPTION(osd_pool_default_pg_num, OPT_INT, 8) // number of PGs for new pools. Configure in global or mon section of ceph.conf +OPTION(osd_pool_default_pgp_num, OPT_INT, 8) // number of PGs for placement purposes. Should be equal to pg_num OPTION(osd_map_dedup, OPT_BOOL, true) OPTION(osd_map_cache_size, OPT_INT, 500) OPTION(osd_map_message_max, OPT_INT, 100) // max maps per MOSDMap message -- 1.7.9.5 -- Applied to master. Thanks! -sam To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah zmoo...@yahoo.com wrote: Yes, there were osd daemons running on the same node that the monitor was running on. If that is the case then i will run a test case with the monitor running on a different node where no osd is running and see what happens. Thank you. Hi Isaac, Any luck? Does the problem reproduce with the mon running on a separate host? -sam Isaac From: Gregory Farnum g...@inktank.com To: Isaac Otsiabah zmoo...@yahoo.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Monday, February 11, 2013 12:29 PM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster jIsaac, I'm sorry I haven't been able to wrangle any time to look into this more yet, but Sage pointed out in a related thread that there might be some buggy handling of things like this if the OSD and the monitor are located on the same host. Am I correct in assuming that with your small cluster, all your OSDs are co-located with a monitor daemon? -Greg On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah zmoo...@yahoo.com wrote: Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when using default crush map, it takes several trials before you see it. Thank you. [root@g13ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG0 0 0 eth2 133.164.98.0* 255.255.255.0 U 0 0 0 eth2 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth2 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 * 255.0.0.0 U 0 0 0 eth0 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 [root@g13ct ~]# ceph osd tree # idweight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# ceph osd tree # idweight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG0 0 0 eth0 133.164.98.0* 255.255.255.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth5 link-local * 255.255.0.0 U 0 0 0 eth0 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 * 255.0.0.0 U 0 0 0 eth5 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 192.168.1.0 * 255.255.255.0 U 0 0 0 eth5 [root@g14ct ~]# ceph osd tree # idweight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down1 2 1
Re: radosgw: Update a key's meta data
On Thu, Feb 14, 2013 at 6:37 AM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, I was wondering how I could update a key's metadata like the Content-Type. The solution on S3 seem to be to copy the key on itself and replacing meta data. If I do that in ceph, will it work ? And more importantly, will it be done intelligently (i.e. without copying the actual file data around). Same API in S3, copying an object into itself. For Swift there's a POST request with updated metadata. Data (other than the first chunk) isn't really copied. I tried reading the code, but although part of the code seem to hint at support for this (in rgw_rest_s3.cc), some other part seem to not look at all if the src == dst (like rgw_op.cc). Right. It is actually missing and it's a bug. I've opened ceph issue #4150. Thanks, Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The Ceph Census
Hey folks. We've gotten nearly 50 responses so far, and the data is proving to be quite interesting! I will share it on the blog early next week. The survey will be open until next Monday so that everyone has an opportunity to participate. If you haven't gotten around to adding your cluster, you still can - it's a pretty short list of questions, shouldn't take more than a minute or two. http://ceph.com/census Thanks, Ross On Feb 13, 2013, at 7:06 PM, Ross David Turk r...@inktank.com wrote: Hi! It's been a while since my last poll about Ceph deployments and use cases. Since there are so many more of us now, I think it's a good time to do it again. This time, I've set up a survey. I am particularly interested in how many deployments of Ceph there are, how much underlying storage they manage, whether they're in production, and how people plan to use them. It's a very short survey (10 questions, mostly multiple-choice), and shouldn't take more than a minute or two. The results will be public, and I think it will help us all figure out how to focus our efforts. http://ceph.com/census This information will be compiled and published on the Ceph blog for all to review and enjoy. Thanks! Cheers, Ross -- Ross Turk Community, Inktank @rossturk @inktank @ceph -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
G'day Sage, On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: On Fri, 15 Feb 2013, Chris Dunlop wrote: In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. Debug turned on. Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in any of the many previous slow requests intances, and the timing doesn't look quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but the osd.0 log shows nothing problems at all, then the mon complains about not having heard from osd.1 since 20:11:21, whereas the first indication of trouble on osd.1 was the request from 20:26:20 not being processed in a timely fashion. My guess is the above was a side-effect of osd.0 being marked out. On 0.56.2 there is some strange peering workqueue laggyness that could potentially contribute as well. I recommend moving to 0.56.3. Upgraded to 0.56.3. Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did restarting the osds ('service ceph restart osd' on each osd host). The immediate issue was resolved by restarting ceph completely on one of the mon/osd hosts (service ceph restart). Possibly a restart of just the mon would have been sufficient. Did you notice that the osds you restarted didn't immediately mark themselves in? Again, it could be explained by the peering wq issue, especially if there are pools in your cluster that are not getting any IO. Sorry, no. I was kicking myself later for losing the 'ceph -s' output when I killed that terminal session but in the heat of the moment... I can't see anything about osd marking themselves in from the logs from the time (with no debugging), but I'm on my ipad at the moment so I could easily have missed it. Should that info be in the logs somewhere? There's certainly unused pools: we're only using the rbd pool and so the default data and metadata pools are unused. Thanks for your attention! Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote: On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. I set mds cache size to 40 but now I observe ~900Mbps peaks from osd.14 to the active mds, osd.18 and osd.2. osd.14 shares some pg with osd.18 and osd.2: http://pastebin.com/raw.php?i=uBAcTcu4 The high bandwidth from OSD to MDS really isn't a concern — that's the MDS asking for data and getting it back quickly! We're concerned about client responsiveness; has that gotten better? It seems better now, I didn't see any storm so far. But we observe high latency on some of our clients (with no load). Does it exist any documentation on how to read the perfcounters_dump output? I would like to know if the MDS still has any problem with its cache or if the latency comes from elsewhere. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
Hello Sam and Gregory, i got machines today and tested it with the monitor process running on a separate system with no osd daemons and i did not see the problem. On Monday i will do a few test to confirm. Isaac - Original Message - From: Sam Lang sam.l...@inktank.com To: Isaac Otsiabah zmoo...@yahoo.com Cc: Gregory Farnum g...@inktank.com; ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Friday, February 15, 2013 9:20 AM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah zmoo...@yahoo.com wrote: Yes, there were osd daemons running on the same node that the monitor was running on. If that is the case then i will run a test case with the monitor running on a different node where no osd is running and see what happens. Thank you. Hi Isaac, Any luck? Does the problem reproduce with the mon running on a separate host? -sam Isaac From: Gregory Farnum g...@inktank.com To: Isaac Otsiabah zmoo...@yahoo.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Monday, February 11, 2013 12:29 PM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster jIsaac, I'm sorry I haven't been able to wrangle any time to look into this more yet, but Sage pointed out in a related thread that there might be some buggy handling of things like this if the OSD and the monitor are located on the same host. Am I correct in assuming that with your small cluster, all your OSDs are co-located with a monitor daemon? -Greg On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah zmoo...@yahoo.com wrote: Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when using default crush map, it takes several trials before you see it. Thank you. [root@g13ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG 0 0 0 eth2 133.164.98.0 * 255.255.255.0 U 0 0 0 eth2 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth2 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 * 255.0.0.0 U 0 0 0 eth0 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 [root@g13ct ~]# ceph osd tree # id weight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down 1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# ceph osd tree # id weight type name up/down reweight -1 6 root default -3 6 rack unknownrack -2 3 host g13ct 0 1 osd.0 up 1 1 1 osd.1 down 1 2 1 osd.2 up 1 -4 3 host g14ct 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 [root@g14ct ~]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default 133.164.98.250 0.0.0.0 UG 0 0 0 eth0 133.164.98.0 * 255.255.255.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 0 0 0 eth3 link-local * 255.255.0.0 U 0 0 0 eth5 link-local * 255.255.0.0 U 0 0 0 eth0 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 192.0.0.0 *
Re: [0.48.3] OSD memory leak when scrubbing
Can anyone who hit this bug please confirm that your system contains libc 2.15+? On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote: oh nice, the pattern also matches path :D, didn't know that thanks Greg -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote: Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html