Re: [ceph-users] PG's incomplete after OSD failure
Would love to hear if you discover a way to get zapping incomplete PGs! Perhaps this is a common enough issue to open an issue? Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg's stuck for 4-5 days after reaching backfill_toofull
Find out which OSD it is: ceph health detail Squeeze blocks off the affected OSD: ceph osd reweight OSDNUM 0.8 Repeat with any OSD which becomes toofull. Your cluster is only about 50% used, so I think this will be enough. Then when it finishes, allow data back on OSD: ceph osd reweight OSDNUM 1 Hopefully ceph will someday be taught to move PGs in a better order! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Thanks Craig, I'll jiggle the OSDs around to see if that helps. Otherwise, I'm almost certain removing the pool will work. :/ Have a good one, Chad. > I had the same experience with force_create_pg too. > > I ran it, and the PGs sat there in creating state. I left the cluster > overnight, and sometime in the middle of the night, they created. The > actual transition from creating to active+clean happened during the > recovery after a single OSD was kicked out. I don't recall if that single > OSD was responsible for the creating PGs. I really can't say what > un-jammed my creating. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] long term support version?
Hi all, Did I notice correctly that firefly is going to be supported "long term" whereas Giant is not going to be supported as long? http://ceph.com/releases/v0-80-firefly-released/ This release will form the basis for our long-term supported release Firefly, v0.80.x. http://ceph.com/uncategorized/v0-87-giant-released/ This release will form the basis for the stable release Giant, v0.87.x. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Craig, > If all of your PGs now have an empty down_osds_we_would_probe, I'd run > through this discussion again. Yep, looks to be true. So I ran: # ceph pg force_create_pg 2.5 and it has been creating for about 3 hours now. :/ # ceph health detail | grep creating pg 2.5 is stuck inactive since forever, current state creating, last acting [] pg 2.5 is stuck unclean since forever, current state creating, last acting [] Then I restart all OSDs. The "creating" label disapears and I'm back with same number of incomplete PGs. :( is the 'force_create_pg' the right command? The 'mark_unfound_lost' complains that 'pg has no unfound objects' . I shall start the 'force_create_pg' again and wait longer. Unless there is a different command to use. ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Craig and list, > > > If you create a real osd.20, you might want to leave it OUT until you > > > get things healthy again. I created a real osd.20 (and it turns out I needed an osd.21 also). ceph pg x.xx query no longer lists down osds for probing: "down_osds_we_would_probe": [], But I cannot find the magic command line which will remove these incomplete PGs. Anyone know how to remove incomplete PGs ? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Craig, > You'll have trouble until osd.20 exists again. > > Ceph really does not want to lose data. Even if you tell it the osd is > gone, ceph won't believe you. Once ceph can probe any osd that claims to > be 20, it might let you proceed with your recovery. Then you'll probably > need to use ceph pg mark_unfound_lost. > > If you don't have a free bay to create a real osd.20, it's possible to fake > it with some small loop-back filesystems. Bring it up and mark it OUT. It > will probably cause some remapping. I would keep it around until you get > things healthy. > > If you create a real osd.20, you might want to leave it OUT until you get > things healthy again. Thanks for the recovery tip! I would guess I could safely remove an OSD (mark OUT, wait for migration to stop, then crush osd rm) and then add back in as osd.20 would work? New switch: --yes-i-really-REALLY-mean-it ;) Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Sam, > > Amusingly, that's what I'm working on this week. > > > > http://tracker.ceph.com/issues/7862 Well, thanks for any bugfixes in advance! :) > Also, are you certain that osd 20 is not up? > -Sam Yep. # ceph osd metadata 20 Error ENOENT: osd.20 does not exist So part of ceph thinks osd.20 doesn't exist, but another part (the down_osds_we_would_probe) thinks the osd exists and is down? In other news, my min_size was set to 1, so the same fix might not apply to me. Instead I set the pool size from 2 to 1, then back again. Looks like the end result is merely going to be that the down+incomplete get converted to incomplete. :/ I'll let you (and future googlers) know. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Sam, > 'ceph pg query'. Thanks. Looks like ceph is looking for and osd.20 which no longer exists: "probing_osds": [ "1", "7", "15", "16"], "down_osds_we_would_probe": [ 20], So perhaps during my attempts to rehabilitate the cluster after the upgrade I removed this OSD before it was fully drained. ? What way forward? Should I ceph osd lost {id} [--yes-i-really-mean-it] and move on? Thanks for your help! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi Sam, > Incomplete usually means the pgs do not have any complete copies. Did > you previously have more osds? No. But could have OSDs quitting after hitting assert(0 == "we got a bad state machine event"), or interacting with kernel 3.14 clients have caused the incomplete copies? How can I probe the fate of one of the incomplete PGs? e.g. pg 4.152 is incomplete, acting [1,11] Also, how can I investigate why one osd has a blocked request? The hardware appears normal and the OSD is performing other requests like scrubs without problems. From its log: 2014-11-05 00:57:26.870867 7f7686331700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 61440.449534 secs 2014-11-05 00:57:26.870873 7f7686331700 0 log [WRN] : slow request 61440.449534 seconds old, received at 2014-11-04 07:53:26.421301: osd_op(client.11334078.1:592 rb.0.206609.238e1f29.000752e8 [read 512~512] 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg 2014-11-05 00:57:31.816534 7f7665e4a700 0 -- 192.168.164.187:6800/7831 >> 192.168.164.191:6806/30336 pipe(0x44a98780 sd=89 :6800 s=0 pgs=0 c s=0 l=0 c=0x42f482c0).accept connect_seq 14 vs existing 13 state standby 2014-11-05 00:59:10.749429 7f7666e5a700 0 -- 192.168.164.187:6800/7831 >> 192.168.164.191:6800/20375 pipe(0x44a99900 sd=169 :6800 s=2 pgs=44 3 cs=29 l=0 c=0x42528b00).fault with nothing to send, going to standby 2014-11-05 01:02:09.746857 7f7664d39700 0 -- 192.168.164.187:6800/7831 >> 192.168.164.192:6802/9779 pipe(0x44a98280 sd=63 :6800 s=0 pgs=0 cs =0 l=0 c=0x42f48c60).accept connect_seq 26 vs existing 25 state standby Greg, I attempted to copy/paste you 'ceph scrub' output. Did I get the releveant bits? Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
On Monday, November 03, 2014 17:34:06 you wrote: > If you have osds that are close to full, you may be hitting 9626. I > pushed a branch based on v0.80.7 with the fix, wip-v0.80.7-9626. > -Sam Thanks Sam I may have been hitting that as well. I certainly hit too_full conditions often. I am able to squeeze PGs off of the too_full OSD by reweighting and then eventually all PGs get to where they want to be. Kind of silly that I have to do this manually though. Could Ceph order the PG movements better? (Is this what your bug fix does in effect?) So, at the moment there are no PG moving around the cluster, but all are not in active+clean. Also, there is one OSD which has blocked requests. The OSD seems idle and restarting the OSD just results in a younger blocked request. ~# ceph -s cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6 health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 210 pgs stuck inactive; 210 pgs stuck unclean; 1 requests are blocked > 32 sec monmap e3: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:67 89/0}, election epoch 2996, quorum 0,1,2 mon01,mon02,mon03 osdmap e115306: 24 osds: 24 up, 24 in pgmap v6630195: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects 12747 GB used, 7848 GB / 20596 GB avail 2 inactive 8494 active+clean 173 incomplete 35 down+incomplete # ceph health detail ... 1 ops are blocked > 8388.61 sec 1 ops are blocked > 8388.61 sec on osd.15 1 osds have slow requests from the log of the osd with the blocked request (osd.15): 2014-11-04 08:57:26.851583 7f7686331700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 3840.430247 secs 2014-11-04 08:57:26.851593 7f7686331700 0 log [WRN] : slow request 3840.430247 seconds old, received at 2014-11-04 07:53:26.421301: osd_op(client.11334078.1:592 rb.0.206609.238e1f29.000752e8 [read 512~512] 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg Other requests (like PG scrubs) are happening without taking a long time on this OSD. Also, this was one of the OSDs which I completely drained, removed from ceph, reformatted, and created again using ceph-deploy. So it is completely created by firefly 0.80.7 code. As Greg requested, output of ceph scrub: 2014-11-04 09:25:58.761602 7f6c0e20b700 0 mon.mon01@0(leader) e3 handle_command mon_command({"prefix": "scrub"} v 0) v1 2014-11-04 09:26:21.320043 7f6c0ea0c700 1 mon.mon01@0(leader).paxos(paxos updating c 11563072..11563575) accept timeout, calling fresh elect ion 2014-11-04 09:26:31.264873 7f6c0ea0c700 0 mon.mon01@0(probing).data_health(2996) update_stats avail 38% total 6948572 used 3891232 avail 268 1328 2014-11-04 09:26:33.529403 7f6c0e20b700 0 log [INF] : mon.mon01 calling new monitor election 2014-11-04 09:26:33.538286 7f6c0e20b700 1 mon.mon01@0(electing).elector(2996) init, last seen epoch 2996 2014-11-04 09:26:38.809212 7f6c0ea0c700 0 log [INF] : mon.mon01@0 won leader election with quorum 0,2 2014-11-04 09:26:40.215095 7f6c0e20b700 0 log [INF] : monmap e3: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03= 144.92.180.139:6789/0} 2014-11-04 09:26:40.215754 7f6c0e20b700 0 log [INF] : pgmap v6630201: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:40.215913 7f6c0e20b700 0 log [INF] : mdsmap e1: 0/0/1 up 2014-11-04 09:26:40.216621 7f6c0e20b700 0 log [INF] : osdmap e115306: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.227010 7f6c0e20b700 0 log [INF] : pgmap v6630202: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:41.367373 7f6c0e20b700 1 mon.mon01@0(leader).osd e115307 e115307: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.437706 7f6c0e20b700 0 log [INF] : osdmap e115307: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.471558 7f6c0e20b700 0 log [INF] : pgmap v6630203: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:41.497318 7f6c0e20b700 1 mon.mon01@0(leader).osd e115308 e115308: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.533965 7f6c0e20b700 0 log [INF] : osdmap e115308: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.553161 7f6c0e20b700 0 log [INF] : pgmap v6630204: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:42.701720 7f6c0e20b700 1 mon.mon01@0(leader).osd e115309 e115309: 24 osds: 24 up, 24 in 2014-11-04 09:26:42.953977 7f6c0e20b700 0 log [INF] : osdmap e115309: 24 osds: 24 up, 24 in 2014-11-04 09:26:45.776411 7f6c0e20b700 0 log [INF] : pgmap v6630205: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 G
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
> > No, it is a change, I just want to make sure I understand the > scenario. So you're reducing CRUSH weights on full OSDs, and then > *other* OSDs are crashing on these bad state machine events? That is right. The other OSDs shutdown sometime later. (Not immediately.) I really haven't tested to see if the OSDs will stay up with if there are no manipulations. Need to wait with the PGs to settle for awhile, which I haven't done yet. > > >> I don't think it should matter, although I confess I'm not sure how > >> much monitor load the scrubbing adds. (It's a monitor check; doesn't > >> hit the OSDs at all.) > > > > $ ceph scrub > > No output. > > Oh, yeah, I think that output goes to the central log at a later time. > (Will show up in ceph -w if you're watching, or can be accessed from > the monitor nodes; in their data directory I think?) OK. Will doing ceph scrub again result in the same output? If so, I'll run it again and look for output in ceph -w when the migrations have stopped. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
On Monday, November 03, 2014 13:50:05 you wrote: > On Mon, Nov 3, 2014 at 11:41 AM, Chad Seys wrote: > > On Monday, November 03, 2014 13:22:47 you wrote: > >> Okay, assuming this is semi-predictable, can you start up one of the > >> OSDs that is going to fail with "debug osd = 20", "debug filestore = > >> 20", and "debug ms = 1" in the config file and then put the OSD log > >> somewhere accessible after it's crashed? > > > > Alas, I have not yet noticed a pattern. Only thing I think is true is > > that they go down when I first make CRUSH changes. Then after > > restarting, they run without going down again. > > All the OSDs are running at the moment. > > Oh, interesting. What CRUSH changes exactly are you making that are > spawning errors? Maybe I miswrote: I've been marking OUT OSDs with blocked requests. Then if a OSD becomes too_full I use 'ceph osd reweight' to squeeze blocks off of the too_full OSD. (Maybe that is not technically a CRUSH map change?) > I don't think it should matter, although I confess I'm not sure how > much monitor load the scrubbing adds. (It's a monitor check; doesn't > hit the OSDs at all.) $ ceph scrub No output. Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
On Monday, November 03, 2014 13:22:47 you wrote: > Okay, assuming this is semi-predictable, can you start up one of the > OSDs that is going to fail with "debug osd = 20", "debug filestore = > 20", and "debug ms = 1" in the config file and then put the OSD log > somewhere accessible after it's crashed? Alas, I have not yet noticed a pattern. Only thing I think is true is that they go down when I first make CRUSH changes. Then after restarting, they run without going down again. All the OSDs are running at the moment. What I've been doing is marking OUT the OSDs on which a request is blocked, letting the PGs recover, (drain the OSD of PGs completely), then remove and readd the OSD. So far OSDs treated this way no longer have blocked requests. Also, seems as though that slowly decreases the number of incomplete and down+incomplete PGs . > > Can you also verify that all of your monitors are running firefly, and > then issue the command "ceph scrub" and report the output? Sure, should I wait until the current rebalancing is finished? Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
> There's a "ceph osd metadata" command, but i don't recall if it's in > Firefly or only giant. :) It's in firefly. Thanks, very handy. All the OSDs are running 0.80.7 at the moment. What next? Thanks again, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem
P.S. The OSDs interacted with some 3.14 krbd clients before I realized that kernel version was too old for the firefly CRUSH map. Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] emperor -> firefly 0.80.7 upgrade problem
Hi All, I upgraded from emperor to firefly. Initial upgrade went smoothly and all placement groups were active+clean . Next I executed 'ceph osd crush tunables optimal' to upgrade CRUSH mapping. Now I keep having OSDs go down or have requests blocked for long periods of time. I start back up the down OSDs and recovery eventually stops, but with 100s of "incomplete" and "down+incomplete" pgs remaining. The ceph web page says "If you see this state [incomplete], report a bug, and try to start any failed OSDs that may contain the needed information." Well, all the OSDs are up, though some have blocked requests. Also, the logs of the OSDs which go down have this message: 2014-11-02 21:46:33.615829 7ffcf0421700 0 -- 192.168.164.192:6810/31314 >> 192.168.164.186:6804/20934 pipe(0x2faa0280 sd=261 :6810 s=2 pgs=9 19 cs=25 l=0 c=0x2ed022c0).fault with nothing to send, going to standby 2014-11-02 21:49:11.440142 7ffce4cf3700 0 -- 192.168.164.192:6810/31314 >> 192.168.164.186:6804/20934 pipe(0xe512a00 sd=249 :6810 s=0 pgs=0 cs=0 l=0 c=0x2a308b00).accept connect_seq 26 vs existing 25 state standby 2014-11-02 21:51:20.085676 7ffcf6e3e700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 7ffcf6e3e700 time 2014-11-02 21:51:20.052242 osd/PG.cc: 5424: FAILED assert(0 == "we got a bad state machine event") ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state, (boost::statechart::history_mode)0>::my_context)+0x12f) [0x87c6ef] 2: /usr/bin/ceph-osd() [0x8aeae9] 3: (boost::statechart::detail::reaction_result boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, boost::statechart::transition, &boost::statechart::detail::no_context::no_functi on> >, boost::statechart::simple_state >(boost::statechart::simple_state&, boost::statechart::event_base const&, void const*)+0xbf) [0x8dd3ff] 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, boost::statechart::custom_reaction, boost::statechart::transition, &boost::statechart::detail:: no_context::no_function> >, boost::statechart::simple_state >(boost::statechart::simple_state&, boost::statechart::event_base const&, void c onst*)+0x57) [0x8dd4e7] 5: (boost::statechart::detail::reaction_result boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, boost::statechart::custom_reaction, boost::statechart::custom_reaction, boos t::statechart::custom_reaction, boost::statechart::transition, &boost::statechart::detail::no_context::n o_function> >, boost::statechart::simple_state >(boost::statechart::simple_state&, boost::statechart::event_base const&, void const*)+0x57) [0x8dd637] 6: (boost::statechart::detail::reaction_result boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, boost::statechart::custom_reaction, boost::statechart::custom_reaction, boost::statechart::custom_reaction, boost::statechart::custom_reaction, boost::statechart::transition, &boost::statechart::detail::no_context::no_function>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state >(boost::statechart::simple_state&, boost::statechart::event_base const&, void const*)+0x57) [0x8dd6e7] 7: (boost::statechart::state_machine, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x8bcc1b] 8: (boost::statechart::state_machine, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x19) [0x8bcca9] 9: (PG::RecoveryState::handle_event(std::tr1::shared_ptr, PG::RecoveryCtx*)+0x31) [0x8bcd41] 10: (PG::handle_peering_event(std::tr1::shared_ptr, PG::RecoveryCtx*)+0x368) [0x872a08] 11: (OSD::process_peering_events(std::list > const&, ThreadPool::TPHandle&)+0x40c) [0x77619c] 12: (OSD::PeeringWQ::_process(std::list > const&, ThreadPool::TPHandle&)+0x14) [0x7d31e4] 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8173a] 14: (ThreadPool::WorkThread::entry()+0x10) [0xb82980] 15: (()+0x6b50) [0x7ffd10f98b50] 16: (clone()+0x6d) [0x7ffd0fbbc7bd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- Any ideas? Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH depends on host + OSD?
Hi Craig, > It's part of the way the CRUSH hashing works. Any change to the CRUSH map > causes the algorithm to change slightly. Dan@cern could not replicate my observations, so I plan to follow his procedure (fake create an OSD, wait for rebalance, remove fake OSD) in the near future to see if I can replicate his! :) > BTW, it's safer to remove OSDs and hosts by first marking the OSDs UP and > OUT (ceph osd out OSDID). That will trigger the remapping, while keeping > the OSDs in the pool so you have all of your replicas. I am under the impression that the procedure I posted does leave the OSDs in the pool while an additional replication takes place: After "ceph osd crush remove osd.osdnum" I see that the used % on the removed OSD slowly decreases as the relocation of blocks takes place. If my ceph-fu were strong enough I would try to find some block replicated num_replicas+1 times so that my belief would be well-founded. :) Also "ceph osd crush remove osd.osdnum" still shows the OSD in "ceph osd tree", but it is not attached to any server. I think it might even be marked UP and DOWN, but I cannot confirm. So I believe so far the approaches are equivalent. BUT, I think that to keep an OSD out after using "ceph osd out OSDID" one needs to turn off "auto in" or something. I don't want to turn that off b/c in the past I had some slow drives which would occasionally be marked "out". If they stayed "out" that could increase load on other drives, making them unresponsive, getting them marked "out" as well, leading to a domino effect where too many drives get marked "out" and the cluster goes down. Now I have better hardware, but since the scenario exists, I'd rather avoid it! :) > If you mark the OSDs OUT, wait for the remapping to finish, and remove the > OSDs and host from the CRUSH map, there will still be some data migration. Yep, this is what I see. But I find it weird. > > > Ceph is also really good at handling multiple changes in a row. For > example, I had to reformat all of my OSDs because I chose my mkfs.xfs > parameters poorly. I removed the OSDS, without draining them first, which > caused a lot of remapping. I then quickly formatted the OSDs, and put them > back in. The CRUSH map went back to what it started with, and the only > remapping required was to re-populate the newly formatted OSDs. In this case you'd be living with num_replicas-1 for a while. Sounds exciting! :) Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH depends on host + OSD?
Hi Dan, I'd like to decommission a node to reproduce the problem and post enough information for you (at least) to understand what is going on. Unfortunately I'm a ceph newbie, so I'm not sure what info would be of interest before/during the drain. Probably the crushmap would be of interest. Pre-decommision (the interesting parts?): root default { id -1 # do not change unnecessarily # weight 21.890 alg straw hash 0 # rjenkins1 item osd01 weight 2.700 item osd03 weight 3.620 item osd05 weight 1.350 item osd06 weight 2.260 item osd07 weight 2.710 item osd08 weight 2.030 item osd09 weight 1.800 item osd02 weight 1.350 item osd10 weight 4.070 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } Should I gather anything else? Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH depends on host + OSD?
Hi Dan, I'm using Emperor (0.72). Though I would think CRUSH maps have not changed that much btw versions? > That sounds bizarre to me, and I can't reproduce it. I added an osd (which > was previously not in the crush map) to a fake host=test: > >ceph osd crush create-or-move osd.52 1.0 rack=RJ45 host=test I have flatter failure domain with only servers/drives. Looks like you would have at least rack/server/drive. Would that make the difference? > As far as I've experienced, an entry in the crush map with a _crush_ weight > of zero is equivalent to that entry not being in the map. (In fact, I use > this to drain OSDs ... I just ceph osd crush reweight osd.X 0, then > sometime later I crush rm the osd, without incurring any secondary data > movement). Is the crush weight the second column of ceph osd tree ? I'll have to pay attention to that next time I drain a node. Thanks for investigating! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH depends on host + OSD?
Hi Mariusz, > Usually removing OSD without removing host happens when you > remove/replace dead drives. > > Hosts are in map so > > * CRUSH wont put 2 copies on same node > * you can balance around network interface speed That does not answer the original question IMO: "Why does the CRUSH map depend on hosts that no longer have OSDs on them?" But I think it does answer the question "Why does the CRUSH map depend on OSDs AND hosts?" > The question should be "why you remove all OSDs if you are going to > remove host anyway" :) This is your question, not mine! :) I am decommissioning the entire node. What is the recommended (fastest yet safe) way of doing this? I am currently follwing the current procedure for all osdnum on server: ceph osd crush remove osd.osdnum #wait for health to not be degraded, migration stops for all osdnum on server: stop osdnum on server ceph auth del osd.osdnum ceph osd rm osdnum # no new migration # remove server with no OSD from CRUSH ceph osd crush remove server # lots of migration! Thanks! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CRUSH depends on host + OSD?
Hi all, When I remove all OSDs on a given host, then wait for all objects (PGs?) to be to be active+clean, then remove the host (ceph osd crush remove hostname), that causes the objects to shuffle around the cluster again. Why does the CRUSH map depend on hosts that no longer have OSDs on them? A wonderment question, C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole
Hi All, Does anyone have a script or sequence of commands to prepare all drives on a single computer for use by ceph, and then start up all OSDs on the computer at one time? I feel this would be faster and less network traffic than adding one drive at a time, which is what the current script does. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] decreasing pg_num?
Hi All, Is it possible to decrease pg_num? I was able to decrease pgp_num, but when I try to decrease pg_num I get an error: # ceph osd pool set tibs pg_num 1024 specified pg_num 1024 <= current 2048 Thanks! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] limitations of erasure coded pools
Thanks for the link Blairo! I can think of a use case already! (combo replicated pool / erasure pool for a virtual tape library) ! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] limitations of erasure coded pools
Hi All, Could someone point me to a document (possibly a FAQ :) ) describing the limitations of erasure coded pools? Hopefully it would contain the when and how to use them as well. E.g. I read about people using replicated pools as a front end to erasure coded pools, but I don't know why they're deciding to do this, or how they are setting this up. THanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu/librbd versus qemu/kernel module rbd
Hi John, Thanks for the reply! Yes, I agree Ceph is exciting! Keep up the good work! > Using librbd, as you've pointed out, doesn't run afoul of potential Linux > kernel deadlocks; however, you normally wouldn't encounter this type of > situation in a production cluster anyway as you'd likely never use the same > host for client and server components. We're planning to do this (host VMs on Ceph OSDs). What should we be wary of other than the loopback deadlock problem? > See: http://ceph.com/docs/master/rbd/rbd-openstack/ and notice that cloud > platforms generally feed Ceph block devices via QEMU and libvirt to the > cloud computing platform. At the moment we're using ganeti, which can either librbd or module rbd, hence my questions. :) Eventually I'll post performance comparisons for those two options. > In other words, you create a "golden > image" that you can snapshot and then use copy-on-write cloning to bring up > VMs using an RBD-based image snapshot quickly. > OS image sizes are often sizable. So downloading them each time would be > time-consuming and slow. If you can do that once and snapshot the image; > then, clone the snapshot, that's dramatically faster. Good idea! We haven't really explored Ceph's snapshotting / cloning etc. Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] qemu/librbd versus qemu/kernel module rbd
Hi All, What are the pros and cons of running a virtual machine (with qemu-kvm) whose image is accessed via librbd or by mounting /dev/rbdX ? I've heard that the librbd method has the advantage of not being vulnerable to deadlocks due to memory allocation problems. ? Would one also benefit if using backported librbd to older kernels? E.g. 0.80 ceph with running on a 3.2.51 kernel should have bug fixes that the rbd module would not. ? Would one expect performance differences between librbd and module rbd? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] /etc/ceph/rbdmap
> This is for mapping kernel rbd devices on system startup, and belong with > ceph-common (which hasn't yet been but soon will be split out from ceph) Great! Yeah, I was hoping to map /dev/rbd without installing all the ceph daemons! > along with the 'rbd' cli utility. It isn't directly related to librbd1. Oh, I guess librbd1 is for fuse? C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] /etc/ceph/rbdmap
Hi all, Also /etc/ceph/rbdmap in librbd1 rather than ceph? Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] /etc/init.d/rbdmap
Hi all, Shouldn't /etc/init.d/rbdmap be in the librbd package rather than in "ceph"? Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_recovery_max_single_start
Hi David, Thanks for the reply. I'm a little confused by OSD versus PGs in the description of the two options osd_recovery_max_single_start and osd_recovery_max_active . The ceph webpage describes osd_recovery_max_active as "The number of active recovery requests per OSD at one time." It does not mention PGs. ? Assuming you meant OSD instead of PG, is this a rephrase of your message: "osd_recovery_max_active (default 15)" recovery operations will run total and will be started in groups of "osd_recovery_max_single_start (default 5)" So if I set osd_recovery_max_active = 1 then osd_recovery_max_single_start will effectively = 1 ? Thanks! Chad. On Thursday, April 24, 2014 11:43:47 you wrote: > The value of osd_recovery_max_single_start (default 5) is used in > conjunction with osd_recovery_max_active (default 15). This means that a > given PG will start up to 5 recovery operations at time of a total of 15 > operations active at a time. This allows recovery to spread operations > across more or less PGs at any given time. > > David Zafman > Senior Developer > http://www.inktank.com > > On Apr 24, 2014, at 8:09 AM, Chad Seys wrote: > > Hi All, > > > > What does osd_recovery_max_single_start do? I could not find a > > description > > > > of it. > > > > Thanks! > > Chad. > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd_recovery_max_single_start
Hi All, What does osd_recovery_max_single_start do? I could not find a description of it. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] newb question: how to apply and check config
Thanks for the tip Brian! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] newb question: how to apply and check config
Hello all, I want to set the following value for ceph: osd recovery max active = 1 Where do I place this setting? And how do I ensure that it is active? Do I place it only in /etc/ceph/ceph.conf on the monitor in a section like so: [osd] osd recovery max active = 1 Or do I have to place it on each of the OSD nodes as well? Do I need to restart the OSDs, mons, both? How do I verify that the setting is being used? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] create multiple OSDs without changing CRUSH until one last step
Hi Greg, > How many monitors do you have? 1 . :) > It's also possible that re-used numbers won't get caught in this, > depending on the process you went through to clean them up, but I > don't remember the details of the code here. Yeah, too bad. I'm following the standard removal procedure in the URL below, except that instead of marking it out I just "crush remove" it as suggested by CERN (to avoid rebalancing twice): https://ceph.com/docs/master/rados/operations/add-or-rm-osds/ I considered the "noin" command, but that would be global and I wouldn't want some transient outing of an OSD to domino as more an more OSDs become active to recover. Too bad there is not a "noin osdnum" command. One idea that might work is to record the new OSD's properties right after creating it, then "ceph osd crush remove osd.osdnum". Later when all the drives are added, "ceph osd crush add " them back. Any smoother way of doing this? Is there a crush move command that does the equivalent of crush rm ? ("ceph osd tree" makes it look like it got moved out of the tree. :) ) Any good way to get an OSD's vitals? "ceph osd crush dump" looks like it contains some info, but the weights are some kind of rescaled integers... TTYL, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fuse or kernel to mount rbd?
Hi Sage et al, Thanks for the info! How stable are the cutting edge kernels like 3.13 ? Is 3.8 (e.g. from Ubuntu Raring) a better choice? Thanks again! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] fuse or kernel to mount rbd?
Hi, I'm running Debian Wheezy which has kernel version 3.2.54-2 . Should I be using rbd-fuse 0.72.2 or the kernel client to mount rbd devices? I.e. This is an old kernel relative to Emperor, but maybe bugs are backported to the kernel? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] out then rm / just rm an OSD?
On Thursday, April 03, 2014 07:57:58 Dan Van Der Ster wrote: > Hi, > By my observation, I don't think that marking it out before crush rm would > be any safer. > > Normally what I do (when decommissioning an OSD or whole server) is stop > the OSD process, then crush rm / osd rm / auth del the OSD shortly > afterwards, Huh! I am using a replication = 2, so I'd be worried about the other drive dying before a replication can occur. For my on the edge cluster, it seems safer to mark the OSD out, then remove from CRUSH, then turn off the OSD daemon. Looks like when an OSD is marked out, reweight is set to 0. Is this the same as weight being set to 0? I assume in either case the data is still available to be replicated elsewhere. If one removes an OSD from CRUSH but not turn off the OSD, is the data available for replication. (I would guess "no".) > The main thing to note is that crush rm of an out or DNE OSD will trigger > backfilling, even though intuitively that shouldn't require any data > movement. This was confirmed by the developers as a sort of side effect of > the current CRUSH implementation. I guess changing the CRUSH does not preserve current data locations (like a non-stable sorting algorithm). Thanks! Chad. > > Cheers, Dan > > On Apr 3, 2014 4:00 AM, Chad William Seys wrote: > Hi All, > Slide 19 of Ceph at CERN presentation > http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern > says that when removing an OSD from Ceph it is faster to > just "ceph osd crush rm " rather than marking the > osd as "out", waiting for data migration, and then "rm" the > OSD. > The reason they give is that "out then rm" leads to two modifications > to CRUSH and two data migrations, which takes more time. > I have observed this to be true! > > However, is it safer to do the "out then rm"? Doesn't just doing an "rm" > make replicas unavailable? > > (BTW, they used replica = 4, so maybe they were less concerned!) > > Thanks! > Chad. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] degraded objects after adding OSD?
> Backfilling process can be stopped/paused at some point due to config > settings or other reasons, so ceph reflects current state of PGs that are > in fact degraded because replica is missing on fresh OSD. Those PGs > actually being backfilled display 'degraded+backfilling' state. Also makes sense! 'degraded+backfilling' will give me confidence rather than despair. :) Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] degraded objects after adding OSD?
Hi Sergey, Thanks much for the explanation! That is reassuring and is very sensible. A wishlist suggestion would be to call this situation something other than "degraded". Maybe merely "backfilling"? Well, I am unfamiliar with all the possible states that already exist and might be appropriate. Thanks again, Chad. On Friday, March 28, 2014 04:49:02 you wrote: > On 28.03.14, 0:38, Chad Seys wrote: > > Hi all, > > > >Beginning with a cluster with only "active+clean" PGS, adding an OSD > >causes > > > > objects to be "degraded". > > > >Does this mean that ceph deletes replicas before copying them to the > >new > > > > OSD? > > No. Ceph adds the new OSD to the acting set of PGs going to be > rebalanced, and number of replicas increase by 1. Replica n+1 is > obviously missing on the new OSD so PG enters 'degraded' state. > Once backfilling process has completed, one of OSDs that previously > served particluar PG is removed from acting set and PG returns to > active+clean state. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] degraded objects after adding OSD?
Hi all, Beginning with a cluster with only "active+clean" PGS, adding an OSD causes objects to be "degraded". Does this mean that ceph deletes replicas before copying them to the new OSD? Or does degraded also mean that there are not replicas on the target OSD, even though there are already the desired number of replicas in the cluster? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com