Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-09-07 Thread John Morris
Thanks Greg, that helped get the last stuck PGs back online, and 
everything looks normal again.


Here's the promised post-mortem.  It might contain only a little of 
value to developers, but certainly a bunch of face-palming for readers 
(and a big red face for me).


This mess started during a reorganization:  migrating OSDs to better 
disk controllers and from xfs to btrfs, moving journals from unreliable 
SSDs to partitions on the OSD disks (better SSDs added later), size=4 
CRUSH maps distributing two replicas to each of two nodes (see thread 
from last month) and unrelated things such as a 10gb network upgrade.


Most OSDs were migrated in a reasonable way, but for forgotten reasons I 
took a short-cut with osd.1, and copied the data outside of Ceph from 
the old xfs partition to the new btrfs partition, figuring if it didn't 
work, with size=3 there would be two other replicas, right?  Indeed, the 
newly copied OSD did start up and appear to be working.


In the meantime, I clobbered two more (full) OSDs by carelessly 
migrating the journal device from the old SSD to the OSD *data* 
partition rather than the journal partition.  Oops!  The filesystems 
were immediately clobbered upon starting the OSDs, and with two OSDs 
down and size=3, many PGs had only one replica.


While the clobbered OSDs were recovering, I restarted osd.1, apparently 
functioning normally after its initial start.  However, since the OSD's 
'cluster' directory didn't exist as a btrfs subvolume, Ceph's snapshots 
contained the whole filesystem instead of just the data directory. 
After the restart, ceph-osd tried to roll back the snapshot into the 
'cluster' subvolume, but since 'cluster' existed instead as a normal 
directory, ceph-osd errored out with the problem described in the 
earlier thread, "error ioctl(BTRFS_IOC_SNAP_CREATE) failed: (17) File 
exists".


At this point, with three OSDs offline, a number of PGs were stuck in 
'incomplete' state.


I made a careful copy of osd.1 (data + journal) to experiment with. Once 
the snapshots were sorted out, starting osd.1 failed with complaints 
about a corrupt journal, which I can't explain except by the recurring 
pattern of stupidity.  Zapping the journal and rolling back the snapshot 
got the OSD back online, but with the remaining eight 'incomplete' PGs 
described in the below quoted message.


It turns out that Ceph marked these PGs 'incomplete' because there were 
already partial (with 0MB-100MB of 6GB-14GB each) replicas on other OSDs 
with newer versions from the brief initial period osd.1 was online, 
before it was restarted and before its journal was zapped.  Even though 
the complete PG data was still on osd.1, presumably because it was 
marked as current from earlier epochs, Ceph considered it 'stray'.


Greg's tip about the dev tools led to the resolution of these final 
incomplete PGs.  The ceph_objectstore_tool, built from latest git 
sources, was used to remove the newer, incomplete PG replicas from all 
OSDs except osd.1. With only the single copy of those PGs on osd.1, Ceph 
then happily backfilled them to the rest of the cluster.  Since the VMs 
had been shut off during the whole mess, data consistency has apparently 
been maintained, and everything is back to normal.  Whew!


I have personally learned a lot of lessons, both about Ceph 
specifically, and in general about my cowboy approach that led to this 
fiasco.  These don't need to be told here (but readers feeling the urge 
to lecture me should feel free).  If I had any lame 'blame the 
equipment' excuses, I'd suggest some idiot-proofing:  OSDs should refuse 
to start when the btrfs 'cluster' subvolume doesn't exist, and ceph-osd 
should refuse to write the journal and data onto the same partition.


John


On 08/29/2014 12:56 PM, Gregory Farnum wrote:

Hmm, so you've got PGs which are out-of-date on disk (by virtue of being
an older snapshot?) but still have records of them being newer in the
OSD journal?
That's a new failure node for me and I don't think we have any tools
designed for solving. If you can *back up the disk* before doing this, I
think I'd try flushing the OSD journal, then zapping it and creating a
new one.
You'll probably still have an inconsistent store, so you'll then need
to, for each incomplete pg (and for every OSD at the same point on it)
use the OSD store tool extract the "pginfo" entry and rewrite it to the
versions which the pg actually is on disk.

These are uncharted waters and you're going to need to explore some dev
tools; good luck!
(Unless Sam or somebody has a better solution.)
-Greg

On Thursday, August 28, 2014, John Morris mailto:j...@zultron.com>> wrote:

Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
thread.  They were enough to get PGs 'incomplete' due to 'not enough
OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
to write a full post-mortem (embarrassing as it will be) after the
cluster is f

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-09-07 Thread John Morris
Thanks Greg, that helped get the last stuck PGs back online, and 
everything looks normal again.


Here's the promised post-mortem.  It might contain only a little of 
value to developers, but certainly a bunch of face-palming for readers 
(and a big red face for me).


This mess started during a reorganization:  migrating OSDs to better 
disk controllers and from xfs to btrfs, moving journals from unreliable 
SSDs to partitions on the OSD disks (better SSDs added later), size=4 
CRUSH maps distributing two replicas to each of two nodes (see thread 
from last month) and unrelated things such as a 10gb network upgrade.


Most OSDs were migrated in a reasonable way, but for forgotten reasons I 
took a short-cut with osd.1, and copied the data outside of Ceph from 
the old xfs partition to the new btrfs partition, figuring if it didn't 
work, with size=3 there would be two other replicas, right?  Indeed, the 
newly copied OSD did start up and appear to be working.


In the meantime, I clobbered two more (full) OSDs by carelessly 
migrating the journal device from the old SSD to the OSD *data* 
partition rather than the journal partition.  Oops!  The filesystems 
were immediately clobbered upon starting the OSDs, and with two OSDs 
down and size=3, many PGs had only one replica.


While the clobbered OSDs were recovering, I restarted osd.1, apparently 
functioning normally after its initial start.  However, since the OSD's 
'cluster' directory didn't exist as a btrfs subvolume, Ceph's snapshots 
contained the whole filesystem instead of just the data directory. 
After the restart, ceph-osd tried to roll back the snapshot into the 
'cluster' subvolume, but since 'cluster' existed instead as a normal 
directory, ceph-osd errored out with the problem described in the 
earlier thread, "error ioctl(BTRFS_IOC_SNAP_CREATE) failed: (17) File 
exists".


At this point, with three OSDs offline, a number of PGs were stuck in 
'incomplete' state.


I made a careful copy of osd.1 (data + journal) to experiment with. Once 
the snapshots were sorted out, starting osd.1 failed with complaints 
about a corrupt journal, which I can't explain except by the recurring 
pattern of stupidity.  Zapping the journal and rolling back the snapshot 
got the OSD back online, but with the remaining eight 'incomplete' PGs 
described in the below quoted message.


It turns out that Ceph marked these PGs 'incomplete' because there were 
already partial (with 0MB-100MB of 6GB-14GB each) replicas on other OSDs 
with newer versions from the brief initial period osd.1 was online, 
before it was restarted and before its journal was zapped.  Even though 
the complete PG data was still on osd.1, presumably because it was 
marked as current from earlier epochs, Ceph considered it 'stray'.


Greg's tip about the dev tools led to the resolution of these final 
incomplete PGs.  The ceph_objectstore_tool, built from latest git 
sources, was used to remove the newer, incomplete PG replicas from all 
OSDs except osd.1. With only the single copy of those PGs on osd.1, Ceph 
then happily backfilled them to the rest of the cluster.  Since the VMs 
had been shut off during the whole mess, data consistency has apparently 
been maintained, and everything is back to normal.  Whew!


I have personally learned a lot of lessons, both about Ceph 
specifically, and in general about my cowboy approach that led to this 
fiasco.  These don't need to be told here (but readers feeling the urge 
to lecture me should feel free).  If I had any lame 'blame the 
equipment' excuses, I'd suggest some idiot-proofing:  OSDs should refuse 
to start when the btrfs 'cluster' subvolume doesn't exist, and ceph-osd 
should refuse to write the journal and data onto the same partition.


John


On 08/29/2014 12:56 PM, Gregory Farnum wrote:

Hmm, so you've got PGs which are out-of-date on disk (by virtue of being
an older snapshot?) but still have records of them being newer in the
OSD journal?
That's a new failure node for me and I don't think we have any tools
designed for solving. If you can *back up the disk* before doing this, I
think I'd try flushing the OSD journal, then zapping it and creating a
new one.
You'll probably still have an inconsistent store, so you'll then need
to, for each incomplete pg (and for every OSD at the same point on it)
use the OSD store tool extract the "pginfo" entry and rewrite it to the
versions which the pg actually is on disk.

These are uncharted waters and you're going to need to explore some dev
tools; good luck!
(Unless Sam or somebody has a better solution.)
-Greg

On Thursday, August 28, 2014, John Morris mailto:j...@zultron.com>> wrote:

Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
thread.  They were enough to get PGs 'incomplete' due to 'not enough
OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
to write a full post-mortem (embarrassing as it will be) after the
cluster is f

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-29 Thread Gregory Farnum
Hmm, so you've got PGs which are out-of-date on disk (by virtue of being an
older snapshot?) but still have records of them being newer in the OSD
journal?
That's a new failure node for me and I don't think we have any tools
designed for solving. If you can *back up the disk* before doing this, I
think I'd try flushing the OSD journal, then zapping it and creating a new
one.
You'll probably still have an inconsistent store, so you'll then need to,
for each incomplete pg (and for every OSD at the same point on it) use the
OSD store tool extract the "pginfo" entry and rewrite it to the versions
which the pg actually is on disk.

These are uncharted waters and you're going to need to explore some dev
tools; good luck!
(Unless Sam or somebody has a better solution.)
-Greg

On Thursday, August 28, 2014, John Morris  wrote:

> Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
> thread.  They were enough to get PGs 'incomplete' due to 'not enough
> OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
> to write a full post-mortem (embarrassing as it will be) after the
> cluster is fully healthy.
>
> As is my kind of luck, the cluster *also* suffers from eight of the
> *other* 'missing log' sort of 'incomplete' PGs:
>
> 2014-08-28 23:42:03.350612 7f1cc9d82700 20 osd.1 pg_epoch: 13085
>   pg[2.5c( v 10143'300715 (9173'297714,10143'300715] local-les=11890
>   n=3404 ec=1 les/c 11890/10167 12809/12809/12809) [7,3,0,4]
>   r=-1 lpr=12809 pi=8453-12808/138 lcod 0'0
>   inactive NOTIFY]
>   handle_activate_map: Not dirtying info:
>   last_persisted is 13010 while current is 13085
>
> The data clearly exists in the renegade OSD's data directory.  There
> are no reported 'unfound' objects, so 'mark_unfound_lost revert' doesn't
> apply.  No apparently useful data from 'ceph pg ... query', but an
> example is pasted below.
>
> Since the beginning of the cluster rebuild, all ceph clients have been
> turned off, so I believe there's no fear of lost data by reverting to
> these PGs, and besides there're always backup tapes.
>
> How can Ceph be told to accept the versions it sees on osd.1 as the
> most current version, and forget the missing log history?
>
>John
>
>
> $ ceph pg 2.5c query
> { "state": "incomplete",
>   "epoch": 13241,
>   "up": [
> 7,
> 1,
> 0,
> 4],
>   "acting": [
> 7,
> 1,
> 0,
> 4],
>   "info": { "pgid": "2.5c",
>   "last_update": "10143'300715",
>   "last_complete": "10143'300715",
>   "log_tail": "9173'297715",
>   "last_backfill": "0\/\/0\/\/-1",
>   "purged_snaps": "[]",
>   "history": { "epoch_created": 1,
>   "last_epoch_started": 11890,
>   "last_epoch_clean": 10167,
>   "last_epoch_split": 0,
>   "same_up_since": 13229,
>   "same_interval_since": 13229,
>   "same_primary_since": 13118,
>   "last_scrub": "10029'298459",
>   "last_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "last_deep_scrub": "8323'284793",
>   "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
>   "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649"},
>   "stats": { "version": "10143'300715",
>   "reported_seq": "1764",
>   "reported_epoch": "13241",
>   "state": "incomplete",
>   "last_fresh": "2014-08-29 01:35:44.196909",
>   "last_change": "2014-08-29 01:22:50.298880",
>   "last_active": "0.00",
>   "last_clean": "0.00",
>   "last_became_active": "0.00",
>   "last_unstale": "2014-08-29 01:35:44.196909",
>   "mapping_epoch": 13223,
>   "log_start": "9173'297715",
>   "ondisk_log_start": "9173'297715",
>   "created": 1,
>   "last_epoch_clean": 10167,
>   "parent": "0.0",
>   "parent_split_bits": 0,
>   "last_scrub": "10029'298459",
>   "last_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "last_deep_scrub": "8323'284793",
>   "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
>   "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649",
>   "log_size": 3000,
>   "ondisk_log_size": 3000,
>   "stats_invalid": "0",
>   "stat_sum": { "num_bytes": 0,
>   "num_objects": 0,
>   "num_object_clones": 0,
>   "num_object_copies": 0,
>   "num_objects_missing_on_primary": 0,
>   "num_objects_degraded": 0,
>   "num_objects_unfound": 0,
>   "num_read": 0,
>   "num_read_kb": 0,
>   "num_write": 0,
>   "num_write_kb": 0,
>   "num_scrub_errors": 0,
>   "num_shallow_scrub_errors": 0,
>   "num_deep_scrub_errors": 0,
>   "num_objects_recovered": 0,
>   "num_bytes_recovered": 0,
>   "num_keys_recovered": 0},
>   "stat

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-28 Thread John Morris
Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
thread.  They were enough to get PGs 'incomplete' due to 'not enough
OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
to write a full post-mortem (embarrassing as it will be) after the
cluster is fully healthy.

As is my kind of luck, the cluster *also* suffers from eight of the
*other* 'missing log' sort of 'incomplete' PGs:

2014-08-28 23:42:03.350612 7f1cc9d82700 20 osd.1 pg_epoch: 13085
  pg[2.5c( v 10143'300715 (9173'297714,10143'300715] local-les=11890
  n=3404 ec=1 les/c 11890/10167 12809/12809/12809) [7,3,0,4]
  r=-1 lpr=12809 pi=8453-12808/138 lcod 0'0
  inactive NOTIFY]
  handle_activate_map: Not dirtying info:
  last_persisted is 13010 while current is 13085

The data clearly exists in the renegade OSD's data directory.  There
are no reported 'unfound' objects, so 'mark_unfound_lost revert' doesn't
apply.  No apparently useful data from 'ceph pg ... query', but an
example is pasted below.

Since the beginning of the cluster rebuild, all ceph clients have been
turned off, so I believe there's no fear of lost data by reverting to
these PGs, and besides there're always backup tapes.

How can Ceph be told to accept the versions it sees on osd.1 as the
most current version, and forget the missing log history?

   John


$ ceph pg 2.5c query
{ "state": "incomplete",
  "epoch": 13241,
  "up": [
7,
1,
0,
4],
  "acting": [
7,
1,
0,
4],
  "info": { "pgid": "2.5c",
  "last_update": "10143'300715",
  "last_complete": "10143'300715",
  "log_tail": "9173'297715",
  "last_backfill": "0\/\/0\/\/-1",
  "purged_snaps": "[]",
  "history": { "epoch_created": 1,
  "last_epoch_started": 11890,
  "last_epoch_clean": 10167,
  "last_epoch_split": 0,
  "same_up_since": 13229,
  "same_interval_since": 13229,
  "same_primary_since": 13118,
  "last_scrub": "10029'298459",
  "last_scrub_stamp": "2014-08-18 17:36:01.079649",
  "last_deep_scrub": "8323'284793",
  "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
  "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649"},
  "stats": { "version": "10143'300715",
  "reported_seq": "1764",
  "reported_epoch": "13241",
  "state": "incomplete",
  "last_fresh": "2014-08-29 01:35:44.196909",
  "last_change": "2014-08-29 01:22:50.298880",
  "last_active": "0.00",
  "last_clean": "0.00",
  "last_became_active": "0.00",
  "last_unstale": "2014-08-29 01:35:44.196909",
  "mapping_epoch": 13223,
  "log_start": "9173'297715",
  "ondisk_log_start": "9173'297715",
  "created": 1,
  "last_epoch_clean": 10167,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "10029'298459",
  "last_scrub_stamp": "2014-08-18 17:36:01.079649",
  "last_deep_scrub": "8323'284793",
  "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
  "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649",
  "log_size": 3000,
  "ondisk_log_size": 3000,
  "stats_invalid": "0",
  "stat_sum": { "num_bytes": 0,
  "num_objects": 0,
  "num_object_clones": 0,
  "num_object_copies": 0,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 0,
  "num_write_kb": 0,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 0,
  "num_bytes_recovered": 0,
  "num_keys_recovered": 0},
  "stat_cat_sum": {},
  "up": [
7,
1,
0,
4],
  "acting": [
7,
1,
0,
4]},
  "empty": 0,
  "dne": 0,
  "incomplete": 1,
  "last_epoch_started": 10278},
  "recovery_state": [
{ "name": "Started\/Primary\/Peering",
  "enter_time": "2014-08-29 01:22:50.132826",
  "past_intervals": [
[...]
{ "first": 12809,
  "last": 13101,
  "maybe_went_rw": 1,
  "up": [
7,
3,
0,
4],
  "acting": [
7,
3,
0,
4]},
[...]
],
  "probing_osds": [
0,
1,
2,
3,
4,
5,
7],
  "down_osds_w

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-27 Thread Gregory Farnum
On Tue, Aug 26, 2014 at 10:46 PM, John Morris  wrote:
> In the docs [1], 'incomplete' is defined thusly:
>
>   Ceph detects that a placement group is missing a necessary period of
>   history from its log. If you see this state, report a bug, and try
>   to start any failed OSDs that may contain the needed information.
>
> However, during an extensive review of list postings related to
> incomplete PGs, an alternate and oft-repeated definition is something
> like 'the number of existing replicas is less than the min_size of the
> pool'.  In no list posting was there any acknowledgement of the
> definition from the docs.
>
> While trying to understand what 'incomplete' PGs are, I simply set
> min_size = 1 on this cluster with incomplete PGs, and they continue to
> be 'incomplete'.  Does this mean that definition #2 is incorrect?
>
> In case #1 is correct, how can the cluster be told to forget the lapse
> in history?  In our case, there was nothing writing to the cluster
> during the OSD reorganization that could have caused this lapse.

Yeah, these two meanings can (unfortunately) both lead to the
INCOMPLETE state being reported. I think that's going to be fixed in
our next major release (so that INCOMPLETE means "not enough OSDs
hosting" and "missing log" will translate into something else), but
for now the "not enough OSDs" is by far the more common. In your case
you probably are missing history, but you don't want to recover from
it using any of the cluster tools because they're likely to lose more
data than necessary. (Hopefully, you can just roll back to a slightly
older btrfs snapshot, but we'll see).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'incomplete' PGs: what does it mean?

2014-08-26 Thread John Morris
In the docs [1], 'incomplete' is defined thusly:

  Ceph detects that a placement group is missing a necessary period of
  history from its log. If you see this state, report a bug, and try
  to start any failed OSDs that may contain the needed information.

However, during an extensive review of list postings related to
incomplete PGs, an alternate and oft-repeated definition is something
like 'the number of existing replicas is less than the min_size of the
pool'.  In no list posting was there any acknowledgement of the
definition from the docs.

While trying to understand what 'incomplete' PGs are, I simply set
min_size = 1 on this cluster with incomplete PGs, and they continue to
be 'incomplete'.  Does this mean that definition #2 is incorrect?

In case #1 is correct, how can the cluster be told to forget the lapse
in history?  In our case, there was nothing writing to the cluster
during the OSD reorganization that could have caused this lapse.

[1] http://ceph.com/docs/master/rados/operations/pg-states/

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com