-users-boun...@lists.ceph.com] on behalf of Carl-Johan
Schenström [carl-johan.schenst...@gu.se]
Sent: 09 April 2015 07:34
To: Francois Lafont; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cascading Failure of OSDs
Francois Lafont wrote:
> Just in case it could be useful, I have noticed the
Francois Lafont wrote:
> Just in case it could be useful, I have noticed the -s option (on my
> Ubuntu) that offer an output probably easier to parse:
>
> # "column -t" is just to make it's nice for the human eyes.
> ifconfig -s | column -t
Since ifconfig is deprecated, one should use ip
Hi,
01/04/2015 17:28, Quentin Hartman wrote:
> Right now we're just scraping the output of ifconfig:
>
> ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'
>
> It clunky, but it works. I'm sure there's a cleaner way, but this was
> expedient.
>
> QH
Ok, thx for the information
Right now we're just scraping the output of ifconfig:
ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'
It clunky, but it works. I'm sure there's a cleaner way, but this was
expedient.
QH
On Tue, Mar 31, 2015 at 5:05 PM, Francois Lafont wrote:
> Hi,
>
> Quentin Hartman wrote
Hi,
Quentin Hartman wrote:
> Since I have been in ceph-land today, it reminded me that I needed to close
> the loop on this. I was finally able to isolate this problem down to a
> faulty NIC on the ceph cluster network. It "worked", but it was
> accumulating a huge number of Rx errors. My best gu
Since I have been in ceph-land today, it reminded me that I needed to close
the loop on this. I was finally able to isolate this problem down to a
faulty NIC on the ceph cluster network. It "worked", but it was
accumulating a huge number of Rx errors. My best guess is some receive
buffer cache fail
So I'm not sure what has changed, but in the last 30 minutes the errors
which were all over the place, have finally settled down to this:
http://pastebin.com/VuCKwLDp
The only thing I can think of is that I also net the noscrub flag in
addition to the nodeep-scrub when I first got here, and that f
Now that I have a better understanding of what's happening, I threw
together a little one-liner to create a report of the errors that the OSDs
are seeing. Lots of missing / corrupted pg shards:
https://gist.github.com/qhartman/174cc567525060cb462e
I've experimented with exporting / importing the
Here's more information I have been able to glean:
pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
acting [24]
pg 3.690 is stuck inactive for 11991.281739, current state incomplete, last
acting [24]
pg 4.ca is stuck inactive for 15905.499058, current state incomplete, las
Thanks for the response. Is this the post you are referring to?
http://ceph.com/community/incomplete-pgs-oh-my/
For what it's worth, this cluster was running happily for the better part
of a year until the event from this weekend that I described in my first
post, so I doubt it's configuration iss
This might be related to the backtrace assert, but that's the problem
you need to focus on. In particular, both of these errors are caused
by the scrub code, which Sage suggested temporarily disabling — if
you're still getting these messages, you clearly haven't done so
successfully.
That said, it
Alright, tried a few suggestions for repairing this state, but I don't seem
to have any PG replicas that have good copies of the missing / zero length
shards. What do I do now? telling the pg's to repair doesn't seem to help
anything? I can deal with data loss if I can figure out which images might
Finally found an error that seems to provide some direction:
-1> 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e
e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
not match object info size (4120576) ajusted for ondisk to (4120576)
I'm diving into googl
Thanks for the suggestion, but that doesn't seem to have made a difference.
I've shut the entire cluster down and brought it back up, and my config
management system seems to have upgraded ceph to 0.80.8 during the reboot.
Everything seems to have come back up, but I am still seeing the crash
loop
It looks like you may be able to work around the issue for the moment with
ceph osd set nodeep-scrub
as it looks like it is scrub that is getting stuck?
sage
On Fri, 6 Mar 2015, Quentin Hartman wrote:
> Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
> active+clean pgs
Ceph health detail - http://pastebin.com/5URX9SsQ
pg dump summary (with active+clean pgs removed) -
http://pastebin.com/Y5ATvWDZ
an osd crash log (in github gist because it was too big for pastebin) -
https://gist.github.com/qhartman/cb0e290df373d284cfb5
And now I've got four OSDs that are looping
So I'm in the middle of trying to triage a problem with my ceph cluster
running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
been running happily for about a year. This last weekend, something caused
the box running the MDS to sieze hard, and when we came in on monday,
several O
17 matches
Mail list logo