Re: [ceph-users] PG's incomplete after OSD failure

2014-11-12 Thread Chad Seys
Would love to hear if you discover a way to get zapping incomplete PGs!

Perhaps this is a common enough issue to open an issue?

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's stuck for 4-5 days after reaching backfill_toofull

2014-11-11 Thread Chad Seys
Find out which OSD it is:

ceph health detail

Squeeze blocks off the affected OSD:

ceph osd reweight OSDNUM 0.8

Repeat with any OSD which becomes toofull.

Your cluster is only about 50% used, so I think this will be enough.

Then when it finishes, allow data back on OSD:

ceph osd reweight OSDNUM 1

Hopefully ceph will someday be taught to move PGs in a better order!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-11 Thread Chad Seys
Thanks Craig,

I'll jiggle the OSDs around to see if that helps.

Otherwise, I'm almost certain removing the pool will work. :/

Have a good one,
Chad.

> I had the same experience with force_create_pg too.
> 
> I ran it, and the PGs sat there in creating state.  I left the cluster
> overnight, and sometime in the middle of the night, they created.  The
> actual transition from creating to active+clean happened during the
> recovery after a single OSD was kicked out.  I don't recall if that single
> OSD was responsible for the creating PGs.  I really can't say what
> un-jammed my creating.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] long term support version?

2014-11-11 Thread Chad Seys
Hi all,

Did I notice correctly that firefly is going to be supported "long term" 
whereas Giant is not going to be supported as long?

http://ceph.com/releases/v0-80-firefly-released/
This release will form the basis for our long-term supported release Firefly, 
v0.80.x.

http://ceph.com/uncategorized/v0-87-giant-released/
This release will form the basis for the stable release Giant, v0.87.x.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-10 Thread Chad Seys
Hi Craig,

> If all of your PGs now have an empty down_osds_we_would_probe, I'd run
> through this discussion again.

Yep, looks to be true.

So I ran:

# ceph pg force_create_pg 2.5

and it has been creating for about 3 hours now. :/


# ceph health detail | grep creating
pg 2.5 is stuck inactive since forever, current state creating, last acting []
pg 2.5 is stuck unclean since forever, current state creating, last acting []

Then I restart all OSDs.  The "creating" label disapears and I'm back with 
same number of incomplete PGs.  :(

is the 'force_create_pg' the right command?  The 'mark_unfound_lost' complains 
that 'pg has no unfound objects' .

I shall start the 'force_create_pg' again and wait longer.  Unless there is a 
different command to use. ?

Thanks!
Chad.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-10 Thread Chad Seys
Hi Craig and list,

> > > If you create a real osd.20, you might want to leave it OUT until you
> > > get things healthy again.

I created a real osd.20 (and it turns out I needed an osd.21 also).  

ceph pg x.xx query no longer lists down osds for probing:
"down_osds_we_would_probe": [],

But I cannot find the magic command line which will remove these incomplete 
PGs.

Anyone know how to remove incomplete PGs ?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-06 Thread Chad Seys
Hi Craig,

> You'll have trouble until osd.20 exists again.
> 
> Ceph really does not want to lose data.  Even if you tell it the osd is
> gone, ceph won't believe you.  Once ceph can probe any osd that claims to
> be 20, it might let you proceed with your recovery.  Then you'll probably
> need to use ceph pg  mark_unfound_lost.
> 
> If you don't have a free bay to create a real osd.20, it's possible to fake
> it with some small loop-back filesystems.  Bring it up and mark it OUT.  It
> will probably cause some remapping.  I would keep it around until you get
> things healthy.
> 
> If you create a real osd.20, you might want to leave it OUT until you get
> things healthy again.

Thanks for the recovery tip!

I would guess I could safely remove an OSD (mark OUT, wait for migration to 
stop, then crush osd rm) and then add back in as osd.20 would work?

New switch:
--yes-i-really-REALLY-mean-it

;)
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-06 Thread Chad Seys
Hi Sam,

> > Amusingly, that's what I'm working on this week.
> > 
> > http://tracker.ceph.com/issues/7862

Well, thanks for any bugfixes in advance!  :)

> Also, are you certain that osd 20 is not up?
> -Sam

Yep.

# ceph osd metadata 20
Error ENOENT: osd.20 does not exist

So part of ceph thinks osd.20 doesn't exist, but another part (the 
down_osds_we_would_probe) thinks the osd exists and is down?

In other news, my min_size was set to 1, so the same fix might not apply to 
me.  Instead I set the pool size from 2 to 1, then back again.  Looks like the 
end result is merely going to be that the down+incomplete get converted to 
incomplete.  :/  I'll let you (and future googlers) know.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-05 Thread Chad Seys
Hi Sam,
> 'ceph pg  query'.

Thanks.

Looks like ceph is looking for and osd.20 which no longer exists:

 "probing_osds": [
"1",
"7",
"15",
"16"],
  "down_osds_we_would_probe": [
20],

So perhaps during my attempts to rehabilitate the cluster after the upgrade I 
removed this OSD before it was fully drained. ?  

What way forward?
Should I
ceph osd lost {id} [--yes-i-really-mean-it]
and move on? 

Thanks for your help!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-05 Thread Chad Seys
Hi Sam,

> Incomplete usually means the pgs do not have any complete copies.  Did
> you previously have more osds?

No.  But could have OSDs quitting after hitting assert(0 == "we got a bad 
state machine event"), or interacting with kernel 3.14 clients have caused the 
incomplete copies?

How can I probe the fate of one of the incomplete PGs? e.g.
pg 4.152 is incomplete, acting [1,11]

Also, how can I investigate why one osd has a blocked request?  The hardware 
appears normal and the OSD is performing other requests like scrubs without 
problems.  From its log:

2014-11-05 00:57:26.870867 7f7686331700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 61440.449534 secs
2014-11-05 00:57:26.870873 7f7686331700  0 log [WRN] : slow request 
61440.449534 seconds old, received at 2014-11-04 07:53:26.421301: 
osd_op(client.11334078.1:592 rb.0.206609.238e1f29.000752e8 [read 512~512] 
4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg
2014-11-05 00:57:31.816534 7f7665e4a700  0 -- 192.168.164.187:6800/7831 >> 
192.168.164.191:6806/30336 pipe(0x44a98780 sd=89 :6800 s=0 pgs=0 c
s=0 l=0 c=0x42f482c0).accept connect_seq 14 vs existing 13 state standby
2014-11-05 00:59:10.749429 7f7666e5a700  0 -- 192.168.164.187:6800/7831 >> 
192.168.164.191:6800/20375 pipe(0x44a99900 sd=169 :6800 s=2 pgs=44
3 cs=29 l=0 c=0x42528b00).fault with nothing to send, going to standby
2014-11-05 01:02:09.746857 7f7664d39700  0 -- 192.168.164.187:6800/7831 >> 
192.168.164.192:6802/9779 pipe(0x44a98280 sd=63 :6800 s=0 pgs=0 cs
=0 l=0 c=0x42f48c60).accept connect_seq 26 vs existing 25 state standby

Greg, I attempted to copy/paste you 'ceph scrub' output.  Did I get the 
releveant bits?

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-04 Thread Chad Seys
On Monday, November 03, 2014 17:34:06 you wrote:
> If you have osds that are close to full, you may be hitting 9626.  I
> pushed a branch based on v0.80.7 with the fix, wip-v0.80.7-9626.
> -Sam

Thanks Sam  I may have been hitting that as well.  I certainly hit too_full 
conditions often.  I am able to squeeze PGs off of the too_full OSD by 
reweighting and then eventually all PGs get to where they want to be.  Kind of 
silly that I have to do this manually though.  Could Ceph order the PG 
movements better? (Is this what your bug fix does in effect?)


So, at the moment there are no PG moving around the cluster, but all are not 
in active+clean. Also, there is one OSD which has blocked requests.  The OSD 
seems idle and restarting the OSD just results in a younger blocked request.

~# ceph -s
cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6
 health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 210 pgs stuck 
inactive; 210 pgs stuck unclean; 1 requests are blocked > 32 sec
 monmap e3: 3 mons at 
{mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:67
89/0}, election epoch 2996, quorum 0,1,2 mon01,mon02,mon03
 osdmap e115306: 24 osds: 24 up, 24 in
  pgmap v6630195: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects
12747 GB used, 7848 GB / 20596 GB avail
   2 inactive
8494 active+clean
 173 incomplete
  35 down+incomplete

# ceph health detail
...
1 ops are blocked > 8388.61 sec
1 ops are blocked > 8388.61 sec on osd.15
1 osds have slow requests

from the log of the osd with the blocked request (osd.15):
2014-11-04 08:57:26.851583 7f7686331700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 3840.430247 secs
2014-11-04 08:57:26.851593 7f7686331700  0 log [WRN] : slow request 
3840.430247 seconds old, received at 2014-11-04 07:53:26.421301: 
osd_op(client.11334078.1:592 rb.0.206609.238e1f29.000752e8 [read 512~512] 
4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg


Other requests (like PG scrubs) are happening without taking a long time on 
this OSD.
Also, this was one of the OSDs which I completely drained, removed from ceph, 
reformatted, and created again using ceph-deploy.  So it is completely created 
by firefly 0.80.7 code.


As Greg requested, output of ceph scrub:

2014-11-04 09:25:58.761602 7f6c0e20b700  0 mon.mon01@0(leader) e3 
handle_command mon_command({"prefix": "scrub"} v 0) v1
2014-11-04 09:26:21.320043 7f6c0ea0c700  1 mon.mon01@0(leader).paxos(paxos 
updating c 11563072..11563575) accept timeout, calling fresh elect
ion
2014-11-04 09:26:31.264873 7f6c0ea0c700  0 
mon.mon01@0(probing).data_health(2996) update_stats avail 38% total 6948572 
used 3891232 avail 268
1328
2014-11-04 09:26:33.529403 7f6c0e20b700  0 log [INF] : mon.mon01 calling new 
monitor election
2014-11-04 09:26:33.538286 7f6c0e20b700  1 mon.mon01@0(electing).elector(2996) 
init, last seen epoch 2996
2014-11-04 09:26:38.809212 7f6c0ea0c700  0 log [INF] : mon.mon01@0 won leader 
election with quorum 0,2
2014-11-04 09:26:40.215095 7f6c0e20b700  0 log [INF] : monmap e3: 3 mons at 
{mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=
144.92.180.139:6789/0}
2014-11-04 09:26:40.215754 7f6c0e20b700  0 log [INF] : pgmap v6630201: 8704 
pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
2014-11-04 09:26:40.215913 7f6c0e20b700  0 log [INF] : mdsmap e1: 0/0/1 up
2014-11-04 09:26:40.216621 7f6c0e20b700  0 log [INF] : osdmap e115306: 24 
osds: 24 up, 24 in
2014-11-04 09:26:41.227010 7f6c0e20b700  0 log [INF] : pgmap v6630202: 8704 
pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
2014-11-04 09:26:41.367373 7f6c0e20b700  1 mon.mon01@0(leader).osd e115307 
e115307: 24 osds: 24 up, 24 in
2014-11-04 09:26:41.437706 7f6c0e20b700  0 log [INF] : osdmap e115307: 24 
osds: 24 up, 24 in
2014-11-04 09:26:41.471558 7f6c0e20b700  0 log [INF] : pgmap v6630203: 8704 
pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
2014-11-04 09:26:41.497318 7f6c0e20b700  1 mon.mon01@0(leader).osd e115308 
e115308: 24 osds: 24 up, 24 in
2014-11-04 09:26:41.533965 7f6c0e20b700  0 log [INF] : osdmap e115308: 24 
osds: 24 up, 24 in
2014-11-04 09:26:41.553161 7f6c0e20b700  0 log [INF] : pgmap v6630204: 8704 
pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
2014-11-04 09:26:42.701720 7f6c0e20b700  1 mon.mon01@0(leader).osd e115309 
e115309: 24 osds: 24 up, 24 in
2014-11-04 09:26:42.953977 7f6c0e20b700  0 log [INF] : osdmap e115309: 24 
osds: 24 up, 24 in
2014-11-04 09:26:45.776411 7f6c0e20b700  0 log [INF] : pgmap v6630205: 8704 
pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
plete; 6344 G

Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys
> 
> No, it is a change, I just want to make sure I understand the
> scenario. So you're reducing CRUSH weights on full OSDs, and then
> *other* OSDs are crashing on these bad state machine events?

That is right.  The other OSDs shutdown sometime later.  (Not immediately.)

I really haven't tested to see if the OSDs will stay up with if there are no 
manipulations.  Need to wait with the PGs to settle for awhile, which I 
haven't done yet.

> 
> >> I don't think it should matter, although I confess I'm not sure how
> >> much monitor load the scrubbing adds. (It's a monitor check; doesn't
> >> hit the OSDs at all.)
> > 
> > $ ceph scrub
> > No output.
> 
> Oh, yeah, I think that output goes to the central log at a later time.
> (Will show up in ceph -w if you're watching, or can be accessed from
> the monitor nodes; in their data directory I think?)

OK.  Will doing ceph scrub again result in the same output? If so, I'll run it 
again and look for output in ceph -w when the migrations have stopped.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys
On Monday, November 03, 2014 13:50:05 you wrote:
> On Mon, Nov 3, 2014 at 11:41 AM, Chad Seys  wrote:
> > On Monday, November 03, 2014 13:22:47 you wrote:
> >> Okay, assuming this is semi-predictable, can you start up one of the
> >> OSDs that is going to fail with "debug osd = 20", "debug filestore =
> >> 20", and "debug ms = 1" in the config file and then put the OSD log
> >> somewhere accessible after it's crashed?
> > 
> > Alas, I have not yet noticed a pattern.  Only thing I think is true is
> > that they go down when I first make CRUSH changes.  Then after
> > restarting, they run without going down again.
> > All the OSDs are running at the moment.
> 
> Oh, interesting. What CRUSH changes exactly are you making that are
> spawning errors?

Maybe I miswrote:  I've been marking OUT OSDs with blocked requests.  Then if 
a OSD becomes too_full I use 'ceph osd reweight' to squeeze blocks off of the 
too_full OSD.  (Maybe that is not technically a CRUSH map change?)


> I don't think it should matter, although I confess I'm not sure how
> much monitor load the scrubbing adds. (It's a monitor check; doesn't
> hit the OSDs at all.)

$ ceph scrub
No output.

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys
On Monday, November 03, 2014 13:22:47 you wrote:
> Okay, assuming this is semi-predictable, can you start up one of the
> OSDs that is going to fail with "debug osd = 20", "debug filestore =
> 20", and "debug ms = 1" in the config file and then put the OSD log
> somewhere accessible after it's crashed?

Alas, I have not yet noticed a pattern.  Only thing I think is true is that 
they go down when I first make CRUSH changes.  Then after restarting, they run 
without going down again.
All the OSDs are running at the moment.

What I've been doing is marking OUT the OSDs on which a request is blocked, 
letting the PGs recover, (drain the OSD of PGs completely), then remove and 
readd the OSD.

So far OSDs treated this way no longer have blocked requests.

Also, seems as though that slowly decreases the number of incomplete and 
down+incomplete PGs .

> 
> Can you also verify that all of your monitors are running firefly, and
> then issue the command "ceph scrub" and report the output?

Sure, should I wait until the current rebalancing is finished?

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys

> There's a "ceph osd metadata" command, but i don't recall if it's in
> Firefly or only giant. :)

It's in firefly.  Thanks, very handy.

All the OSDs are running 0.80.7 at the moment.

What next?

Thanks again,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys
P.S.  The OSDs interacted with some 3.14 krbd clients before I realized that 
kernel version was too old for the firefly CRUSH map.

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] emperor -> firefly 0.80.7 upgrade problem

2014-11-03 Thread Chad Seys
Hi All,
   I upgraded from emperor to firefly.  Initial upgrade went smoothly and all 
placement groups were active+clean .
  Next I executed
'ceph osd crush tunables optimal'
  to upgrade CRUSH mapping.
  Now I keep having OSDs go down or have requests blocked for long periods of 
time.
  I start back up the down OSDs and recovery eventually stops, but with 100s 
of "incomplete" and "down+incomplete" pgs remaining.
  The ceph web page says "If you see this state [incomplete], report a bug, 
and try to start any failed OSDs that may contain the needed information."  
Well, all the OSDs are up, though some have blocked requests.

Also, the logs of the OSDs which go down have this message:
2014-11-02 21:46:33.615829 7ffcf0421700  0 -- 192.168.164.192:6810/31314 >> 
192.168.164.186:6804/20934 pipe(0x2faa0280 sd=261 :6810 s=2 pgs=9
19 cs=25 l=0 c=0x2ed022c0).fault with nothing to send, going to standby
2014-11-02 21:49:11.440142 7ffce4cf3700  0 -- 192.168.164.192:6810/31314 >> 
192.168.164.186:6804/20934 pipe(0xe512a00 sd=249 :6810 s=0 pgs=0 
cs=0 l=0 c=0x2a308b00).accept connect_seq 26 vs existing 25 state standby
2014-11-02 21:51:20.085676 7ffcf6e3e700 -1 osd/PG.cc: In function 
'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 
7ffcf6e3e700 time 2014-11-02 21:51:20.052242
osd/PG.cc: 5424: FAILED assert(0 == "we got a bad state machine event")

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: 
(PG::RecoveryState::Crashed::Crashed(boost::statechart::state, 
(boost::statechart::history_mode)0>::my_context)+0x12f) [0x87c6ef]
 2: /usr/bin/ceph-osd() [0x8aeae9]
 3: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, 
boost::statechart::transition, 
&boost::statechart::detail::no_context::no_functi
on> >, boost::statechart::simple_state 
>(boost::statechart::simple_state&, boost::statechart::event_base 
const&, void const*)+0xbf) [0x8dd3ff]
 4: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, 
boost::statechart::custom_reaction, 
boost::statechart::transition, 
&boost::statechart::detail::
no_context::no_function> >, 
boost::statechart::simple_state 
>(boost::statechart::simple_state&, boost::statechart::event_base const&, 
void c
onst*)+0x57) [0x8dd4e7]
 5: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl, 
boost::statechart::custom_reaction, 
boost::statechart::custom_reaction, boos
t::statechart::custom_reaction, 
boost::statechart::transition, 
&boost::statechart::detail::no_context::n
o_function> >, boost::statechart::simple_state 
>(boost::statechart::simple_state&, boost::statechart::event_base const&, 
void const*)+0x57) [0x8dd637]
 6: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state::local_react_impl_non_empty::local_react_impl,
 
boost::statechart::custom_reaction, 
boost::statechart::custom_reaction, 
boost::statechart::custom_reaction, 
boost::statechart::custom_reaction, 
boost::statechart::transition, 
&boost::statechart::detail::no_context::no_function>,
 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
boost::statechart::simple_state 
>(boost::statechart::simple_state&, boost::statechart::event_base const&, 
void const*)+0x57) [0x8dd6e7]
 7: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
 
const&)+0x5b) [0x8bcc1b]
 8: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 
const&)+0x19) [0x8bcca9]
 9: (PG::RecoveryState::handle_event(std::tr1::shared_ptr, 
PG::RecoveryCtx*)+0x31) [0x8bcd41]
 10: (PG::handle_peering_event(std::tr1::shared_ptr, 
PG::RecoveryCtx*)+0x368) [0x872a08]
 11: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x40c) [0x77619c]
 12: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x7d31e4]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8173a]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xb82980]
 15: (()+0x6b50) [0x7ffd10f98b50]
 16: (clone()+0x6d) [0x7ffd0fbbc7bd]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---

Any ideas?
Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH depends on host + OSD?

2014-10-21 Thread Chad Seys
Hi Craig,

> It's part of the way the CRUSH hashing works.  Any change to the CRUSH map
> causes the algorithm to change slightly.

Dan@cern could not replicate my observations, so I plan to follow his 
procedure (fake create an OSD, wait for rebalance, remove fake OSD) in the 
near future to see if I can replicate his! :)


> BTW, it's safer to remove OSDs and hosts by first marking the OSDs UP and
> OUT (ceph osd out OSDID).  That will trigger the remapping, while keeping
> the OSDs in the pool so you have all of your replicas.

I am under the impression that the procedure I posted does leave the OSDs in 
the pool while an additional replication takes place: After "ceph osd crush 
remove osd.osdnum" I see that the used % on the removed OSD slowly decreases 
as the relocation of blocks takes place.  

If my ceph-fu were strong enough I would try to find some block replicated 
num_replicas+1 times so that my belief would be well-founded. :)

Also "ceph osd crush remove osd.osdnum" still shows the OSD in "ceph osd 
tree", but it is not attached to any server.  I think it might even be marked 
UP and DOWN, but I cannot confirm.

So I believe so far the approaches are equivalent.

BUT, I think that to keep an OSD out after using "ceph osd out OSDID" one 
needs to turn off "auto in" or something.

I don't want to turn that off b/c in the past I had some slow drives which 
would occasionally be marked "out".  If they stayed "out" that could increase 
load on other drives, making them unresponsive, getting them marked "out" as 
well, leading to a domino effect where too many drives get marked "out" and 
the cluster goes down.

Now I have better hardware, but since the scenario exists, I'd rather avoid 
it! :)


> If you mark the OSDs OUT, wait for the remapping to finish, and remove the
> OSDs and host from the CRUSH map, there will still be some data migration.

Yep, this is what I see.  But I find it weird.

> 
> 
> Ceph is also really good at handling multiple changes in a row.  For
> example, I had to reformat all of my OSDs because I chose my mkfs.xfs
> parameters poorly.   I removed the OSDS, without draining them first, which
> caused a lot of remapping.  I then quickly formatted the OSDs, and put them
> back in.  The CRUSH map went back to what it started with, and the only
> remapping required was to re-populate the newly formatted OSDs.

In this case you'd be living with num_replicas-1 for a while.  Sounds 
exciting!  :)

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH depends on host + OSD?

2014-10-16 Thread Chad Seys
Hi Dan,
  I'd like to decommission a node to reproduce the problem and post enough 
information for you (at least) to understand what is going on.
  Unfortunately I'm a ceph newbie, so I'm not sure what info would be of 
interest before/during the drain.
  Probably the crushmap would be of interest.  Pre-decommision (the 
interesting parts?):

root default {
  id -1   # do not change unnecessarily
  # weight 21.890
  alg straw
  hash 0  # rjenkins1
  item osd01 weight 2.700
  item osd03 weight 3.620
  item osd05 weight 1.350
  item osd06 weight 2.260
  item osd07 weight 2.710
  item osd08 weight 2.030
  item osd09 weight 1.800
  item osd02 weight 1.350
  item osd10 weight 4.070
}

# rules
rule data {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
}

Should I gather anything else?
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH depends on host + OSD?

2014-10-15 Thread Chad Seys
Hi Dan,
  I'm using Emperor (0.72).  Though I would think CRUSH maps have not changed 
that much btw versions?

> That sounds bizarre to me, and I can't reproduce it. I added an osd (which
> was previously not in the crush map) to a fake host=test:
> 
>ceph osd crush create-or-move osd.52 1.0 rack=RJ45 host=test

I have flatter failure domain with only servers/drives.  Looks like you would 
have at least rack/server/drive.  Would that make the difference?

> As far as I've experienced, an entry in the crush map with a _crush_ weight
> of zero is equivalent to that entry not being in the map. (In fact, I use
> this to drain OSDs ... I just ceph osd crush reweight osd.X 0, then
> sometime later I crush rm the osd, without incurring any secondary data
> movement).

Is the crush weight the second column of ceph osd tree ?
I'll have to pay attention to that next time I drain a node.

Thanks for investigating!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH depends on host + OSD?

2014-10-15 Thread Chad Seys
Hi Mariusz,

> Usually removing OSD without removing host happens when you
> remove/replace dead drives.
> 
> Hosts are in map so
> 
> * CRUSH wont put 2 copies on same node
> * you can balance around network interface speed

That does not answer the original question IMO: "Why does the CRUSH map depend 
on hosts that no longer have OSDs on them?"

But I think it does answer the question "Why does the CRUSH map depend on OSDs 
AND hosts?"

> The question should be "why you remove all OSDs if you are going to
> remove host anyway" :)

This is your question, not mine!  :)
I am decommissioning the entire node.  What is the recommended (fastest yet 
safe) way of doing this?  I am currently follwing the current procedure

for all osdnum on server:
  ceph osd crush remove osd.osdnum

#wait for health to not be degraded, migration stops

for all osdnum on server:
  stop osdnum on server
  ceph auth del osd.osdnum
  ceph osd rm osdnum

# no new migration

# remove server with no OSD from CRUSH
ceph osd crush remove server
# lots of migration!

Thanks!
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH depends on host + OSD?

2014-10-15 Thread Chad Seys
Hi all,
  When I remove all OSDs on a given host, then wait for all objects (PGs?) to 
be to be active+clean, then remove the host (ceph osd crush remove hostname), 
that causes the objects to shuffle around the cluster again.
  Why does the CRUSH map depend on hosts that no longer have OSDs on them?

A wonderment question,
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] script for commissioning a node with multiple osds, added to cluster as a whole

2014-08-29 Thread Chad Seys
Hi All,
  Does anyone have a script or sequence of commands to prepare all drives on a 
single computer for use by ceph, and then start up all OSDs on the computer at 
one time?
  I feel this would be faster and less network traffic than adding one drive 
at a time, which is what the current script does.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] decreasing pg_num?

2014-07-22 Thread Chad Seys
Hi All,
  Is it possible to decrease pg_num?  I was able to decrease pgp_num, but when 
I try to decrease pg_num I get an error:

# ceph osd pool set tibs pg_num 1024
specified pg_num 1024 <= current 2048

Thanks!
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] limitations of erasure coded pools

2014-06-26 Thread Chad Seys
Thanks for the link Blairo!

I can think of a use case already!  (combo replicated pool / erasure pool for 
a virtual tape library)

! Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] limitations of erasure coded pools

2014-06-24 Thread Chad Seys
Hi All,
  Could someone point me to a document (possibly a FAQ :) ) describing the 
limitations of erasure coded pools?  Hopefully it would contain the when and 
how to use them as well.
   E.g. I read about people using replicated pools as a front end to erasure 
coded pools, but I don't know why they're deciding to do this, or how they are 
setting this up.

THanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu/librbd versus qemu/kernel module rbd

2014-06-20 Thread Chad Seys
Hi John,
   Thanks for the reply!  Yes, I agree Ceph is exciting!  Keep up the good 
work!

> Using librbd, as you've pointed out, doesn't run afoul of potential Linux
> kernel deadlocks; however, you normally wouldn't encounter this type of
> situation in a production cluster anyway as you'd likely never use the same
> host for client and server components.

We're planning to do this (host VMs on Ceph OSDs).  What should we be wary of 
other than the loopback deadlock problem?

> See: http://ceph.com/docs/master/rbd/rbd-openstack/ and notice that cloud
> platforms generally feed Ceph block devices via QEMU and libvirt to the
> cloud computing platform.

At the moment we're using ganeti, which can either librbd or module rbd, hence 
my questions.  :)

Eventually I'll post performance comparisons for those two options.

> In other words, you create a "golden
> image" that you can snapshot and then use copy-on-write cloning to bring up
> VMs using an RBD-based image snapshot quickly.

> OS image sizes are often sizable. So downloading them each time would be
> time-consuming and slow. If you can do that once and snapshot the image;
> then, clone the snapshot, that's dramatically faster.

Good idea!  We haven't really explored Ceph's snapshotting / cloning etc.

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] qemu/librbd versus qemu/kernel module rbd

2014-06-20 Thread Chad Seys
Hi All,
  What are the pros and cons of running a virtual machine (with qemu-kvm) 
whose image is accessed via librbd or by mounting /dev/rbdX ?
  I've heard that the librbd method has the advantage of not being vulnerable 
to deadlocks due to memory allocation problems. ?
  Would one also benefit if using backported librbd to older kernels?  E.g. 
0.80 ceph with running on a 3.2.51 kernel should have bug fixes that the rbd 
module would not. ?
  Would one expect performance differences between librbd and module rbd?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] /etc/ceph/rbdmap

2014-06-19 Thread Chad Seys

> This is for mapping kernel rbd devices on system startup, and belong with
> ceph-common (which hasn't yet been but soon will be split out from ceph)

Great!  Yeah, I was hoping to map /dev/rbd without installing all the ceph 
daemons!

> along with the 'rbd' cli utility.  It isn't directly related to librbd1.

Oh, I guess librbd1 is for fuse?
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] /etc/ceph/rbdmap

2014-06-19 Thread Chad Seys
Hi all,
  Also /etc/ceph/rbdmap in librbd1 rather than ceph?

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] /etc/init.d/rbdmap

2014-06-19 Thread Chad Seys
Hi all,
  Shouldn't /etc/init.d/rbdmap be in the librbd package rather than in "ceph"?

Thanks,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_recovery_max_single_start

2014-04-24 Thread Chad Seys
Hi David,
  Thanks for the reply.
  I'm a little confused by OSD versus PGs in the description of the two 
options osd_recovery_max_single_start and osd_recovery_max_active .

The ceph webpage describes osd_recovery_max_active as "The number of active 
recovery requests per OSD at one time." It does not mention PGs. ?

Assuming you meant OSD instead of PG, is this a rephrase of your message:

"osd_recovery_max_active (default 15)" recovery operations will run total and 
will be started in groups of "osd_recovery_max_single_start (default 5)"

So if I set osd_recovery_max_active = 1 then osd_recovery_max_single_start 
will effectively = 1 ?

Thanks!
Chad.

On Thursday, April 24, 2014 11:43:47 you wrote:
> The value of osd_recovery_max_single_start (default 5) is used in
> conjunction with osd_recovery_max_active (default 15).   This means that a
> given PG will start up to 5 recovery operations at time of a total of 15
> operations active at a time.  This allows recovery to spread operations
> across more or less PGs at any given time.
> 
> David Zafman
> Senior Developer
> http://www.inktank.com
> 
> On Apr 24, 2014, at 8:09 AM, Chad Seys  wrote:
> > Hi All,
> > 
> >   What does osd_recovery_max_single_start do?  I could not find a
> >   description
> > 
> > of it.
> > 
> > Thanks!
> > Chad.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd_recovery_max_single_start

2014-04-24 Thread Chad Seys
Hi All,
   What does osd_recovery_max_single_start do?  I could not find a description 
of it.

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] newb question: how to apply and check config

2014-04-23 Thread Chad Seys
Thanks for the tip Brian!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] newb question: how to apply and check config

2014-04-23 Thread Chad Seys
Hello all,
  I want to set the following value for ceph:

osd recovery max active = 1

  Where do I place this setting?  And how do I ensure that it is active?

Do I place it only in /etc/ceph/ceph.conf on the monitor in a section like so:

[osd]
osd recovery max active = 1

Or do I have to place it on each of the OSD nodes as well?

Do I need to restart the OSDs, mons, both?

How do I verify that the setting is being used?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] create multiple OSDs without changing CRUSH until one last step

2014-04-11 Thread Chad Seys
Hi Greg,
> How many monitors do you have?

1 .  :)

> It's also possible that re-used numbers won't get caught in this,
> depending on the process you went through to clean them up, but I
> don't remember the details of the code here.

Yeah, too bad.  I'm following the standard removal procedure in the URL below, 
except that instead of marking it out I just "crush remove" it as suggested by 
CERN (to avoid rebalancing twice):
https://ceph.com/docs/master/rados/operations/add-or-rm-osds/

I considered the "noin" command, but that would be global and I wouldn't want 
some transient outing of an OSD to domino as more an more OSDs become active 
to recover.

Too bad there is not a "noin osdnum" command.

One idea that might work is to record the new OSD's properties right after 
creating it, then "ceph osd crush remove osd.osdnum".  Later when all the 
drives are added, "ceph osd crush add " them back.

Any smoother way of doing this?  Is there a crush move command that does the 
equivalent of crush rm ?  ("ceph osd tree" makes it look like it got moved out 
of the tree. :) )  Any good way to get an OSD's vitals?  "ceph osd crush dump" 
looks like it contains some info, but the weights are some kind of rescaled 
integers...

TTYL,
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fuse or kernel to mount rbd?

2014-04-07 Thread Chad Seys
Hi Sage et al,
  Thanks for the info!  How stable are the cutting edge kernels like 3.13 ?  
Is 3.8 (e.g. from Ubuntu Raring) a better choice?

Thanks again!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fuse or kernel to mount rbd?

2014-04-04 Thread Chad Seys
Hi,
  I'm running Debian Wheezy which has kernel version 3.2.54-2 .

  Should I be using rbd-fuse 0.72.2 or the kernel client to mount rbd devices?
  I.e. This is an old kernel relative to Emperor, but maybe bugs are 
backported to the kernel?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] out then rm / just rm an OSD?

2014-04-03 Thread Chad Seys
On Thursday, April 03, 2014 07:57:58 Dan Van Der Ster wrote:
> Hi,
> By my observation, I don't think that marking it out before crush rm would
> be any safer.
> 
> Normally what I do (when decommissioning an OSD or whole server) is stop
> the OSD process, then crush rm / osd rm / auth del the OSD shortly
> afterwards,

Huh!  I am using a replication = 2, so I'd be worried about the other drive 
dying before a replication can occur.

For my on the edge cluster, it seems safer to mark the OSD out, then remove 
from CRUSH, then turn off the OSD daemon.

Looks like when an OSD is marked out, reweight is set to 0.  Is this the same 
as weight being set to 0?  I assume in either case the data is still available 
to be replicated elsewhere.

If one removes an OSD from CRUSH but not turn off the OSD, is the data 
available for replication.  (I would guess "no".)

> The main thing to note is that crush rm of an out or DNE OSD will trigger
> backfilling, even though intuitively that shouldn't require any data
> movement. This was confirmed by the developers as a sort of side effect of
> the current CRUSH implementation.

I guess changing the CRUSH does not preserve current data locations (like a 
non-stable sorting algorithm).

Thanks!
Chad.



> 
> Cheers, Dan
> 
> On Apr 3, 2014 4:00 AM, Chad William Seys  wrote:
> Hi All,
>   Slide 19 of Ceph at CERN presentation
> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
> says that when removing an OSD from Ceph it is faster to
> just "ceph osd crush rm " rather than marking the
> osd as "out", waiting for data migration, and then "rm" the
> OSD.
>   The reason they give is that "out then rm" leads to two modifications
> to CRUSH and two data migrations, which takes more time.
>   I have observed this to be true!
> 
>   However, is it safer to do the "out then rm"?  Doesn't just doing an "rm"
> make replicas unavailable?
> 
> (BTW, they used replica = 4, so maybe they were less concerned!)
> 
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] degraded objects after adding OSD?

2014-03-28 Thread Chad Seys

> Backfilling process can be stopped/paused at some point due to config
> settings or other reasons, so ceph reflects current state of PGs that are
> in fact degraded because replica is missing on fresh OSD. Those PGs
> actually being backfilled display 'degraded+backfilling' state.

Also makes sense!  'degraded+backfilling' will give me confidence rather than 
despair. :)
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] degraded objects after adding OSD?

2014-03-28 Thread Chad Seys
Hi Sergey,
  Thanks much for the explanation!  That is reassuring and is very sensible.
  A wishlist suggestion would be to call this situation something other than 
"degraded".  Maybe merely "backfilling"?  Well, I am unfamiliar with all the 
possible states that already exist and might be appropriate.

Thanks again,
Chad.

On Friday, March 28, 2014 04:49:02 you wrote:
> On 28.03.14, 0:38, Chad Seys wrote:
> > Hi all,
> > 
> >Beginning with a cluster with only "active+clean" PGS, adding an OSD
> >causes
> > 
> > objects to be "degraded".
> > 
> >Does this mean that ceph deletes replicas before copying them to the
> >new
> > 
> > OSD?
> 
> No. Ceph adds the new OSD to the acting set of PGs going to be
> rebalanced, and number of replicas increase by 1. Replica n+1 is
> obviously missing on the new OSD so PG enters 'degraded' state.
> Once backfilling process has completed, one of OSDs that previously
> served particluar PG is removed from acting set and PG returns to
> active+clean state.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] degraded objects after adding OSD?

2014-03-27 Thread Chad Seys
Hi all,
  Beginning with a cluster with only "active+clean" PGS, adding an OSD causes 
objects to be "degraded". 
  Does this mean that ceph deletes replicas before copying them to the new 
OSD?
  Or does degraded also mean that there are not replicas on the target OSD, 
even though there are already the desired number of replicas in the cluster?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com