[ceph-users] Why is this pg incomplete?

2016-01-01 Thread Bryan Wright
Hi folks,

"ceph pg dump_stuck inactive" shows:

0.e8incomplete  [406,504]   406 [406,504]   406

Each of the osds above is alive and well, and idle.

The output of "ceph pg 0.e8 query" is shown below.  All of the osds it refers 
to are alive and well, with the exception of osd 102 which died and has been 
removed from the cluster.

Can anyone look at this and tell me why this pg is incomplete?

Bryan

"ceph pg query" output is here, because it's so large:

http://ayesha.phys.virginia.edu/~bryan/errant-pg.txt



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is this pg incomplete?

2016-01-04 Thread Gregory Farnum
On Fri, Jan 1, 2016 at 12:15 PM, Bryan Wright  wrote:
> Hi folks,
>
> "ceph pg dump_stuck inactive" shows:
>
> 0.e8incomplete  [406,504]   406 [406,504]   406
>
> Each of the osds above is alive and well, and idle.
>
> The output of "ceph pg 0.e8 query" is shown below.  All of the osds it refers
> to are alive and well, with the exception of osd 102 which died and has been
> removed from the cluster.
>
> Can anyone look at this and tell me why this pg is incomplete?
>
> Bryan
>
> "ceph pg query" output is here, because it's so large:
>
> http://ayesha.phys.virginia.edu/~bryan/errant-pg.txt

I can't parse all of that output, but the most important and
easiest-to-understand bit is:
"blocked_by": [
102
],

And indeed in the past_intervals section there are a bunch where it's
just 102. You really want min_size >=2 for exactly this reason. :/ But
if you get 102 up stuff should recover; if you can't you can mark it
as "lost" and RADOS ought to resume processing, with potential
data/metadata loss...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is this pg incomplete?

2016-01-04 Thread Bryan Wright
Gregory Farnum  writes:

> I can't parse all of that output, but the most important and
> easiest-to-understand bit is:
> "blocked_by": [
> 102
> ],
> 
> And indeed in the past_intervals section there are a bunch where it's
> just 102. You really want min_size >=2 for exactly this reason. :/ But
> if you get 102 up stuff should recover; if you can't you can mark it
> as "lost" and RADOS ought to resume processing, with potential
> data/metadata loss...
> -Greg
> 


Ack!  I thought min_size was 2, but I see:

ceph osd pool get data min_size
min_size: 1

Well that's a fine kettle of fish.

The osd in question (102) has actually already been marked as lost, via
"ceph osd lost 102 --yes-i-really-mean-it", and it shows up in "ceph osd
tree" as "DNE".  If I can manage to read the disk, how should I try to add
it back in?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is this pg incomplete?

2016-01-04 Thread Michael Kidd
Bryan,
  If you can read the disk that was osd.102, you may wish to attempt this
process to recover your data:
https://ceph.com/community/incomplete-pgs-oh-my/

Good luck!

Michael J. Kidd
Sr. Software Maintenance Engineer
Red Hat Ceph Storage

On Mon, Jan 4, 2016 at 8:32 AM, Bryan Wright  wrote:

> Gregory Farnum  writes:
>
> > I can't parse all of that output, but the most important and
> > easiest-to-understand bit is:
> > "blocked_by": [
> > 102
> > ],
> >
> > And indeed in the past_intervals section there are a bunch where it's
> > just 102. You really want min_size >=2 for exactly this reason. :/ But
> > if you get 102 up stuff should recover; if you can't you can mark it
> > as "lost" and RADOS ought to resume processing, with potential
> > data/metadata loss...
> > -Greg
> >
>
>
> Ack!  I thought min_size was 2, but I see:
>
> ceph osd pool get data min_size
> min_size: 1
>
> Well that's a fine kettle of fish.
>
> The osd in question (102) has actually already been marked as lost, via
> "ceph osd lost 102 --yes-i-really-mean-it", and it shows up in "ceph osd
> tree" as "DNE".  If I can manage to read the disk, how should I try to add
> it back in?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is this pg incomplete?

2016-01-04 Thread Bryan Wright
Michael Kidd  writes:

>   If you can read the disk that was osd.102, you may wish to attempt this
process to recover your data:https://ceph.com/community/incomplete-pgs-oh-my/
> Good luck!

Hi Michael,

Thanks for the pointer.  After looking at it, I'm wondering if the necessity
to copy the pgs to a new osd could be avoided it I can get the original disk
running again temporarily.  Is there a way to re-add an osd after it's been
removed?

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com