Re: enable old OSD snapshot to re-join a cluster

2013-12-19 Thread Gregory Farnum
On Wed, Dec 18, 2013 at 11:32 PM, Alexandre Oliva ol...@gnu.org wrote:
 On Dec 18, 2013, Gregory Farnum g...@inktank.com wrote:

 On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva ol...@gnu.org wrote:
 Here's an updated version of the patch, that makes it much faster than
 the earlier version, particularly when the gap between the latest osdmap
 known by the osd and the earliest osdmap known by the cluster is large.

 Is this actually still necessary in the latest dumpling and emperor
 branches?

 I can't tell for sure, I don't recall when I last rolled back to an old
 snapshot without this kind of patch.

 I thought sufficiently-old OSDs would go through backfill with the new
 PG members in order to get up-to-date without copying all the data.

 That much is true, for sure.  The problem was getting to that point.

 If the latest osdmap known by the osd snapshot turns out to be older
 than the earliest map known by the monitors, the osd would give up
 because it couldn't make the ends meet: no incremental osdmaps were
 available in the cluster, and the osd refused to jump over gaps in the
 osdmap sequence.  That's why I fudged the unavailable intermediate
 osdmaps as clones of the latest one known by the osd: then it would
 apply the incremental changes as nops until it got to an actual newer
 map, in which it would notice a number of changes, apply them all, and
 get on its happy way towards recovery over each of the newer osdmaps ;-)

 I can give a try without the patch if you tell me there's any chance the
 osd might now be able to jump over gaps in the osdmap sequence.  That
 said, the posted patch, ugly as it is, is meant as a stopgap rather than
 as a proper solution; dealing with osdmap gaps rather than dying would
 be surely a more desirable implementation.

I don't remember exactly when it got changed, but I think so. Right Sam?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enable old OSD snapshot to re-join a cluster

2013-12-19 Thread Alexandre Oliva
On Dec 19, 2013, Gregory Farnum g...@inktank.com wrote:

 On Wed, Dec 18, 2013 at 11:32 PM, Alexandre Oliva ol...@gnu.org wrote:
 On Dec 18, 2013, Gregory Farnum g...@inktank.com wrote:
 
 On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva ol...@gnu.org wrote:
 Here's an updated version of the patch, that makes it much faster than
 the earlier version, particularly when the gap between the latest osdmap
 known by the osd and the earliest osdmap known by the cluster is large.
 
 Is this actually still necessary in the latest dumpling and emperor
 branches?
 
 I can't tell for sure, I don't recall when I last rolled back to an old
 snapshot without this kind of patch.
 
 I thought sufficiently-old OSDs would go through backfill with the new
 PG members in order to get up-to-date without copying all the data.
 
 That much is true, for sure.  The problem was getting to that point.
 
 If the latest osdmap known by the osd snapshot turns out to be older
 than the earliest map known by the monitors, the osd would give up
 because it couldn't make the ends meet: no incremental osdmaps were
 available in the cluster, and the osd refused to jump over gaps in the
 osdmap sequence.

 I can give a try without the patch if you tell me there's any chance the
 osd might now be able to jump over gaps in the osdmap sequence.

 I don't remember exactly when it got changed, but I think so.

Excellent, I've just confirmed that recovery from an old snapshot works,
even without the proposed patch.  Yay!

Thanks!

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enable old OSD snapshot to re-join a cluster

2013-12-18 Thread Gregory Farnum
On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva ol...@gnu.org wrote:
 On Feb 20, 2013, Gregory Farnum g...@inktank.com wrote:

 On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote:
 It recently occurred to me that I messed up an OSD's storage, and
 decided that the easiest way to bring it back was to roll it back to an
 earlier snapshot I'd taken (along the lines of clustersnap) and let it
 recover from there.

 The problem with that idea was that the cluster had advanced too much
 since the snapshot was taken: the latest OSDMap known by that snapshot
 was far behind the range still carried by the monitors.

 Determined to let that osd recover from all the data it already had,
 rather than restarting from scratch, I hacked up a “solution” that
 appears to work: with the patch below, the OSD will use the contents of
 an earlier OSDMap (presumably the latest one it has) for a newer OSDMap
 it can't get any more.

 A single run of osd with this patch was enough for it to pick up the
 newer state and join the cluster; from then on, the patched osd was no
 longer necessary, and presumably should not be used except for this sort
 of emergency.

 Of course this can only possibly work reliably if other nodes are up
 with same or newer versions of each of the PGs (but then, rolling back
 the OSD to an older snapshot would't be safe otherwise).  I don't know
 of any other scenarios in which this patch will not recover things
 correctly, but unless someone far more familiar with ceph internals than
 I am vows for it, I'd recommend using this only if you're really
 desperate to avoid a recovery from scratch, and you save snapshots of
 the other osds (as you probably already do, or you wouldn't have older
 snapshots to rollback to :-) and the mon *before* you get the patched
 ceph-osd to run, and that you stop the mds or otherwise avoid changes
 that you're not willing to lose should the patch not work for you and
 you have to go back to the saved state and let the osd recover from
 scratch.  If it works, lucky us; if it breaks, well, I told you :-)

 Yeah, this ought to basically work but it's very dangerous —
 potentially breaking invariants about cluster state changes, etc. I
 wouldn't use it if the cluster wasn't otherwise healthy; other nodes
 breaking in the middle of this operation could cause serious problems,
 etc. I'd much prefer that one just recovers over the wire using normal
 recovery paths... ;)

 Here's an updated version of the patch, that makes it much faster than
 the earlier version, particularly when the gap between the latest osdmap
 known by the osd and the earliest osdmap known by the cluster is large.
 There are some #if0-ed out portions of the code for experiments that
 turned out to be unnecessary, but that I didn't quite want to throw
 away.  I've used this patch for quite a while, and I wanted to post a
 working version, rather than some cleaned-up version in which I might
 accidentally introduce errors.

Is this actually still necessary in the latest dumpling and emperor
branches? I thought sufficiently-old OSDs would go through backfill
with the new PG members in order to get up-to-date without copying all
the data.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enable old OSD snapshot to re-join a cluster

2013-12-18 Thread Alexandre Oliva
On Dec 18, 2013, Gregory Farnum g...@inktank.com wrote:

 On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva ol...@gnu.org wrote:
 Here's an updated version of the patch, that makes it much faster than
 the earlier version, particularly when the gap between the latest osdmap
 known by the osd and the earliest osdmap known by the cluster is large.

 Is this actually still necessary in the latest dumpling and emperor
 branches?

I can't tell for sure, I don't recall when I last rolled back to an old
snapshot without this kind of patch.

 I thought sufficiently-old OSDs would go through backfill with the new
 PG members in order to get up-to-date without copying all the data.

That much is true, for sure.  The problem was getting to that point.

If the latest osdmap known by the osd snapshot turns out to be older
than the earliest map known by the monitors, the osd would give up
because it couldn't make the ends meet: no incremental osdmaps were
available in the cluster, and the osd refused to jump over gaps in the
osdmap sequence.  That's why I fudged the unavailable intermediate
osdmaps as clones of the latest one known by the osd: then it would
apply the incremental changes as nops until it got to an actual newer
map, in which it would notice a number of changes, apply them all, and
get on its happy way towards recovery over each of the newer osdmaps ;-)

I can give a try without the patch if you tell me there's any chance the
osd might now be able to jump over gaps in the osdmap sequence.  That
said, the posted patch, ugly as it is, is meant as a stopgap rather than
as a proper solution; dealing with osdmap gaps rather than dying would
be surely a more desirable implementation.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enable old OSD snapshot to re-join a cluster

2013-12-17 Thread Alexandre Oliva
On Feb 20, 2013, Gregory Farnum g...@inktank.com wrote:

 On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote:
 It recently occurred to me that I messed up an OSD's storage, and
 decided that the easiest way to bring it back was to roll it back to an
 earlier snapshot I'd taken (along the lines of clustersnap) and let it
 recover from there.
 
 The problem with that idea was that the cluster had advanced too much
 since the snapshot was taken: the latest OSDMap known by that snapshot
 was far behind the range still carried by the monitors.
 
 Determined to let that osd recover from all the data it already had,
 rather than restarting from scratch, I hacked up a “solution” that
 appears to work: with the patch below, the OSD will use the contents of
 an earlier OSDMap (presumably the latest one it has) for a newer OSDMap
 it can't get any more.
 
 A single run of osd with this patch was enough for it to pick up the
 newer state and join the cluster; from then on, the patched osd was no
 longer necessary, and presumably should not be used except for this sort
 of emergency.
 
 Of course this can only possibly work reliably if other nodes are up
 with same or newer versions of each of the PGs (but then, rolling back
 the OSD to an older snapshot would't be safe otherwise).  I don't know
 of any other scenarios in which this patch will not recover things
 correctly, but unless someone far more familiar with ceph internals than
 I am vows for it, I'd recommend using this only if you're really
 desperate to avoid a recovery from scratch, and you save snapshots of
 the other osds (as you probably already do, or you wouldn't have older
 snapshots to rollback to :-) and the mon *before* you get the patched
 ceph-osd to run, and that you stop the mds or otherwise avoid changes
 that you're not willing to lose should the patch not work for you and
 you have to go back to the saved state and let the osd recover from
 scratch.  If it works, lucky us; if it breaks, well, I told you :-)

 Yeah, this ought to basically work but it's very dangerous —
 potentially breaking invariants about cluster state changes, etc. I
 wouldn't use it if the cluster wasn't otherwise healthy; other nodes
 breaking in the middle of this operation could cause serious problems,
 etc. I'd much prefer that one just recovers over the wire using normal
 recovery paths... ;)

Here's an updated version of the patch, that makes it much faster than
the earlier version, particularly when the gap between the latest osdmap
known by the osd and the earliest osdmap known by the cluster is large.
There are some #if0-ed out portions of the code for experiments that
turned out to be unnecessary, but that I didn't quite want to throw
away.  I've used this patch for quite a while, and I wanted to post a
working version, rather than some cleaned-up version in which I might
accidentally introduce errors.

Ugly work around to enable osds to recover from old snapshots

From: Alexandre Oliva ol...@gnu.org

Use the contents of the latest OSDMap that we have as if they were the
contents of more recent OSDMaps that we don't have and that have
already been removed in the cluster.  I hope this should work fine as
long as there haven't been major changes to the cluster.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 src/common/shared_cache.hpp |5 +
 src/common/simple_cache.hpp |5 +
 src/osd/OSD.cc  |   34 +++---
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/src/common/shared_cache.hpp b/src/common/shared_cache.hpp
index 178d100..ac3a347 100644
--- a/src/common/shared_cache.hpp
+++ b/src/common/shared_cache.hpp
@@ -105,6 +105,11 @@ public:
 }
   }
 
+  void ensure_size(size_t min_size) {
+if (max_size  min_size)
+  set_size(min_size);
+  }
+
   // Returns K key s.t. key = k for all currently cached k,v
   K cached_key_lower_bound() {
 Mutex::Locker l(lock);
diff --git a/src/common/simple_cache.hpp b/src/common/simple_cache.hpp
index 60919fd..c067062 100644
--- a/src/common/simple_cache.hpp
+++ b/src/common/simple_cache.hpp
@@ -68,6 +68,11 @@ public:
 trim_cache();
   }
 
+  void ensure_size(size_t min_size) {
+if (max_size  min_size)
+  set_size(min_size);
+  }
+
   bool lookup(K key, V *out) {
 Mutex::Locker l(lock);
 typename listpairK, V ::iterator loc = contents.count(key) ?
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 1a60de6..8da4d96 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5690,9 +5690,37 @@ OSDMapRef OSDService::try_get_map(epoch_t epoch)
   if (epoch  0) {
 dout(20)  get_map   epoch   - loading and decoding   map  dendl;
 bufferlist bl;
-if (!_get_map_bl(epoch, bl)) {
-  delete map;
-  return OSDMapRef();
+if(!_get_map_bl(epoch, bl)) {
+  epoch_t older = epoch;
+  while(--older) {
+	OSDMapRef retval = map_cache.lookup(older);
+	if (retval) {
+	  *map = *retval;
+#if 0
+	  

Re: enable old OSD snapshot to re-join a cluster

2013-02-20 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote:
 It recently occurred to me that I messed up an OSD's storage, and
 decided that the easiest way to bring it back was to roll it back to an
 earlier snapshot I'd taken (along the lines of clustersnap) and let it
 recover from there.

 The problem with that idea was that the cluster had advanced too much
 since the snapshot was taken: the latest OSDMap known by that snapshot
 was far behind the range still carried by the monitors.

 Determined to let that osd recover from all the data it already had,
 rather than restarting from scratch, I hacked up a “solution” that
 appears to work: with the patch below, the OSD will use the contents of
 an earlier OSDMap (presumably the latest one it has) for a newer OSDMap
 it can't get any more.

 A single run of osd with this patch was enough for it to pick up the
 newer state and join the cluster; from then on, the patched osd was no
 longer necessary, and presumably should not be used except for this sort
 of emergency.

 Of course this can only possibly work reliably if other nodes are up
 with same or newer versions of each of the PGs (but then, rolling back
 the OSD to an older snapshot would't be safe otherwise).  I don't know
 of any other scenarios in which this patch will not recover things
 correctly, but unless someone far more familiar with ceph internals than
 I am vows for it, I'd recommend using this only if you're really
 desperate to avoid a recovery from scratch, and you save snapshots of
 the other osds (as you probably already do, or you wouldn't have older
 snapshots to rollback to :-) and the mon *before* you get the patched
 ceph-osd to run, and that you stop the mds or otherwise avoid changes
 that you're not willing to lose should the patch not work for you and
 you have to go back to the saved state and let the osd recover from
 scratch.  If it works, lucky us; if it breaks, well, I told you :-)

Yeah, this ought to basically work but it's very dangerous —
potentially breaking invariants about cluster state changes, etc. I
wouldn't use it if the cluster wasn't otherwise healthy; other nodes
breaking in the middle of this operation could cause serious problems,
etc. I'd much prefer that one just recovers over the wire using normal
recovery paths... ;)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html