Re: enable old OSD snapshot to re-join a cluster
On Dec 18, 2013, Gregory Farnum wrote: > On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva wrote: >> Here's an updated version of the patch, that makes it much faster than >> the earlier version, particularly when the gap between the latest osdmap >> known by the osd and the earliest osdmap known by the cluster is large. > Is this actually still necessary in the latest dumpling and emperor > branches? I can't tell for sure, I don't recall when I last rolled back to an old snapshot without this kind of patch. > I thought sufficiently-old OSDs would go through backfill with the new > PG members in order to get up-to-date without copying all the data. That much is true, for sure. The problem was getting to that point. If the latest osdmap known by the osd snapshot turns out to be older than the earliest map known by the monitors, the osd would give up because it couldn't make the ends meet: no incremental osdmaps were available in the cluster, and the osd refused to jump over gaps in the osdmap sequence. That's why I fudged the unavailable intermediate osdmaps as clones of the latest one known by the osd: then it would apply the incremental changes as nops until it got to an actual newer map, in which it would notice a number of changes, apply them all, and get on its happy way towards recovery over each of the newer osdmaps ;-) I can give a try without the patch if you tell me there's any chance the osd might now be able to jump over gaps in the osdmap sequence. That said, the posted patch, ugly as it is, is meant as a stopgap rather than as a proper solution; dealing with osdmap gaps rather than dying would be surely a more desirable implementation. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mds: fix Resetter locking
For some weird reason I couldn't figure out, after I simultaneously brought down all components of my ceph cluster and then brought them back up, the mds wouldn't come back, complaining about a zero-sized entry in its journal some 8+MB behind the end of the journal. I hadn't ever got this problem, and it's not entirely unusual for me to restart all cluster components at once after some configuration change. Anyway... Long story short, after some poking at the mds journal to see if I could figure out how to get it back up, I gave up and decided to use the --reset-journal hammer. Except that it just sat there, never completing or even getting noticed by the cluster. After a bit of additional investigation, the following patch was born, and now my Emperor cluster is back up. Phew! :-) mds: fix Resetter locking From: Alexandre Oliva ceph-mds --reset-journal didn't work; it would deadlock waiting for the osdmap. Comparing the init code in the Dumper (that worked) with that in the Resetter (that didn't), I noticed the lock had to be released before waiting for the osdmap. Now the resetter works. However, both the resetter and the dumper fail an assertion after they've performed their task; I didn't look into it: ../../src/msg/SimpleMessenger.cc: In function 'void SimpleMessenger::reaper()' t hread 7fdc188d27c0 time 2013-12-19 04:48:16.930895 ../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared) ceph version 0.72.1-6-g6bca44e (6bca44ec129d11f1c4f38357db8ae435616f2c7c) 1: (SimpleMessenger::reaper()+0x706) [0x880da6] 2: (SimpleMessenger::wait()+0x36f) [0x88180f] 3: (Resetter::reset()+0x714) [0x56e664] 4: (main()+0x1359) [0x562769] 5: (__libc_start_main()+0xf5) [0x3632e21b45] 6: /l/tmp/build/ceph/build/src/ceph-mds() [0x564e49] NOTE: a copy of the executable, or `objdump -rdS ` is needed to int erpret this. 2013-12-19 04:48:16.934093 7fdc188d27c0 -1 ../../src/msg/SimpleMessenger.cc: In function 'void SimpleMessenger::reaper()' thread 7fdc188d27c0 time 2013-12-19 04 :48:16.930895 ../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared) Signed-off-by: Alexandre Oliva --- src/mds/Resetter.cc |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/mds/Resetter.cc b/src/mds/Resetter.cc index e968cdc..ed409a4 100644 --- a/src/mds/Resetter.cc +++ b/src/mds/Resetter.cc @@ -79,9 +79,9 @@ void Resetter::init(int rank) objecter->init_unlocked(); lock.Lock(); objecter->init_locked(); + lock.Unlock(); objecter->wait_for_osd_map(); timer.init(); - lock.Unlock(); } void Resetter::shutdown() -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: Ceph Messaging on Accelio (libxio) RDMA
(Sorry for the delay getting back on this.) On Wed, Dec 11, 2013 at 5:13 PM, Matt W. Benjamin wrote: > Hi Greg, > > I haven't fixed the decision to reify replies in the Messenger at this > point, but it is what the current prototype code tries to do. > > The request/response model is more general than my language implied, and > also is not the only one available. However, it is the richest model in > Accelio, and I'm currently exploring how best to exploit it. > > The most general model available sends one-way messages in both directions, > and obviously looks the most like current Messenger model. Under the covers, > both Accelio models are built on the same primitives. The one-way model > is not incompatible with zero-copy RDMA operations, though I believe it's > at least trivially true that the one-way model uses only send/recv and > read operations. Behind the scenes, the underlying Accelio framework > requires a more or less exchange of state between the endpoints to maintain > a balance of RDMA resources in each direction, and to complete RDMA read > and write transactions (which use registered memory at the sender/receiver, > respectively). This sounds more like the acks which the SimpleMessenger already does (unless I'm misunderstanding what you mean), in that the recipient has to tell the sender "I have received this message and you don't need to buffer it any more". Surely we want to let them move on before we've (for instance) written the submitted data to disk? Or is it also about telling the sender when they can re-consume the memory in the recipient's box? > This isn't of course something the Messenger consumer needs > to be aware of, except precisely so as to permit the framework to know when > the upper layer operations on registered memory have completed. > > As for the higher level semantics, the first thing to note is that all the > Accelio primitives provide for delivery receipts, and one of my goals is to > unify Message acks completely with recepts. A second key point is that the > primary property of the current reply hook is not it's ability to reply, but > it's completion semantics, and these can be articulated on any of the Accelio > models. It's possible that's all that's desired. > > I'm still exploring is whether the request/response model provides additional > value to the caller that one-way would not. The third available model would > would use xio response messages to deliver any message available at sthe > responder, so perhaps permitting greater application utilization of the > underlying resources in some circumstances. I think a lot of this will be > clearer as I connect the XioMessenger to Ceph callers. As we've worked on the > prototype we've found a number of places where we could tweak the Accelio APIs > to good effect, and I think we'll find more places as continue work. I'm still a little fuzzy about how we'd even use a request-reply model when there are two (or zero) replies to a given request. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fix for issue with RGW on larger than 528k downloads
hello! At first glance it seems to work :-) We will test it in next hours, but now looks great. Thank you very, very, very much best regards! On Wed, Dec 18, 2013 at 5:11 PM, Yehuda Sadeh wrote: > On Wed, Dec 18, 2013 at 1:53 AM, Pawel Stefanski wrote: >> hello! >> >> We are struggling with issue on downloading bigger files since some >> time, on ceph-users there were some complains about this but without >> any conclusions or with misunderstood cause (as a solution there was >> proposed new http serwer for example nginx, where the cause is >> somewhere in radosgw). >> >> I see that cluster at dreamhost suffered the same issue recently, >> could community get the same path/workaround for the stalling download >> issue ? >> > > Sure. It's not on a proper release yet. but it's out there if you want > to give it a try (wip-rgw-getobj-cb). We'll get it on a proper release > soon. > > Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: drop unused find_ino_dir
Sage applied this in commit f5d32a33d25a5f9ddccadb4c3ebbd5ccd211204f; thanks! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Dec 17, 2013 at 3:00 AM, Alexandre Oliva wrote: > I was looking at inconsistencies in xattrs in my OSDs, and found out > that only old dirs had the user.ceph.path attribute set. Trying to > figure out what this was about, I noticed the code that set this > attribute was still present, but it was not called anywhere. I decided > to clean up the attribute from my cluster, and to wipe out the code that > used to set it. Here's the resulting patch. > > > > -- > Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ > You must be the change you wish to see in the world. -- Gandhi > Be Free! -- http://FSFLA.org/ FSF Latin America board member > Free Software Evangelist Red Hat Brazil Compiler Engineer > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: handle setxattr ceph.parent
On Wed, Dec 18, 2013 at 9:09 AM, Sage Weil wrote: > On Wed, 18 Dec 2013, Alexandre Oliva wrote: >> On Dec 18, 2013, "Yan, Zheng" wrote: >> >> > On Tue, Dec 17, 2013 at 7:25 PM, Alexandre Oliva wrote: >> >> # setfattr -n ceph.parent /cephfs/mount/path/name > > Can we add an additional prefix indicating that this is a debug/developer > kludge that is not intended to be supported in the long-term? > ceph.dev.force_rewrite_backtrace or something? While the "ceph.dev" namespace might be a good idea for other things, I'm not sure we should merge anything that we can't support long-term. This probably wouldn't be too hard to get working properly, but if it's only going to impact people who have been using CephFS for so long they can build their own clusters I'm not sure it's worth the effort? In particular, I don't think this interacts properly with projection, so we can't make any guarantees that it's actually done what the user wants if there are ongoing changes, and it's sort of weird to not pay attention to if the update actually succeeds... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com > > sage > >> >> > This seems like a good solution for fixing cephfs that was created >> > before dumpling. >> >> There's more to it than just that, actually. Renaming an entire subtree >> won't update the parent attribute of files in there, so they will appear >> to be incorrect (*). This patch introduces a mechanism that could be >> used to force them to be updated. >> >> (*) I'm well aware that they contain enough information to find the >> updated information, so the redundant info in this attribute can be >> harmlessly out-of-date, but if someone plans to use the data for other >> purposes (like I sometimes do), it's useful to have them fully up to >> date. I also move large trees around, which makes this issue visible. >> >> -- >> Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ >> You must be the change you wish to see in the world. -- Gandhi >> Be Free! -- http://FSFLA.org/ FSF Latin America board member >> Free Software Evangelist Red Hat Brazil Compiler Engineer >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: enable old OSD snapshot to re-join a cluster
On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva wrote: > On Feb 20, 2013, Gregory Farnum wrote: > >> On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva wrote: >>> It recently occurred to me that I messed up an OSD's storage, and >>> decided that the easiest way to bring it back was to roll it back to an >>> earlier snapshot I'd taken (along the lines of clustersnap) and let it >>> recover from there. >>> >>> The problem with that idea was that the cluster had advanced too much >>> since the snapshot was taken: the latest OSDMap known by that snapshot >>> was far behind the range still carried by the monitors. >>> >>> Determined to let that osd recover from all the data it already had, >>> rather than restarting from scratch, I hacked up a “solution” that >>> appears to work: with the patch below, the OSD will use the contents of >>> an earlier OSDMap (presumably the latest one it has) for a newer OSDMap >>> it can't get any more. >>> >>> A single run of osd with this patch was enough for it to pick up the >>> newer state and join the cluster; from then on, the patched osd was no >>> longer necessary, and presumably should not be used except for this sort >>> of emergency. >>> >>> Of course this can only possibly work reliably if other nodes are up >>> with same or newer versions of each of the PGs (but then, rolling back >>> the OSD to an older snapshot would't be safe otherwise). I don't know >>> of any other scenarios in which this patch will not recover things >>> correctly, but unless someone far more familiar with ceph internals than >>> I am vows for it, I'd recommend using this only if you're really >>> desperate to avoid a recovery from scratch, and you save snapshots of >>> the other osds (as you probably already do, or you wouldn't have older >>> snapshots to rollback to :-) and the mon *before* you get the patched >>> ceph-osd to run, and that you stop the mds or otherwise avoid changes >>> that you're not willing to lose should the patch not work for you and >>> you have to go back to the saved state and let the osd recover from >>> scratch. If it works, lucky us; if it breaks, well, I told you :-) > >> Yeah, this ought to basically work but it's very dangerous — >> potentially breaking invariants about cluster state changes, etc. I >> wouldn't use it if the cluster wasn't otherwise healthy; other nodes >> breaking in the middle of this operation could cause serious problems, >> etc. I'd much prefer that one just recovers over the wire using normal >> recovery paths... ;) > > Here's an updated version of the patch, that makes it much faster than > the earlier version, particularly when the gap between the latest osdmap > known by the osd and the earliest osdmap known by the cluster is large. > There are some #if0-ed out portions of the code for experiments that > turned out to be unnecessary, but that I didn't quite want to throw > away. I've used this patch for quite a while, and I wanted to post a > working version, rather than some cleaned-up version in which I might > accidentally introduce errors. Is this actually still necessary in the latest dumpling and emperor branches? I thought sufficiently-old OSDs would go through backfill with the new PG members in order to get up-to-date without copying all the data. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] reinstate ceph cluster_snap support
On Tue, Dec 17, 2013 at 4:14 AM, Alexandre Oliva wrote: > On Aug 27, 2013, Sage Weil wrote: > >> Hi, >> On Sat, 24 Aug 2013, Alexandre Oliva wrote: >>> On Aug 23, 2013, Sage Weil wrote: >>> >>> > FWIW Alexandre, this feature was never really complete. For it to work, >>> > we also need to snapshot the monitors, and roll them back as well. >>> >>> That depends on what's expected from the feature, actually. >>> >>> One use is to roll back a single osd, and for that, the feature works >>> just fine. Of course, for that one doesn't need the multi-osd snapshots >>> to be mutually consistent, but it's still convenient to be able to take >>> a global snapshot with a single command. > >> In principle, we can add this back in. I think it needs a few changes, >> though. > >> First, FileStore::snapshot() needs to pause and drain the workqueue before >> taking the snapshot, similar to what is done with the sync sequence. >> Otherwise it isn't a transactionally consistent snapshot and may tear some >> update. Because it is draining the work queue, it *might* also need to >> drop some locks, but I'm hopeful that that isn't necessary. > > Hmm... I don't quite get this. The Filestore implementation of > snapshot already performs a sync_and_flush before calling the backend's > create_checkpoint. Shouldn't that be enough? FWIW, the code I brought > in from argonaut didn't do any such thing; it did drop locks, but that > doesn't seem to be necessary any more: >From a quick skim I think you're right about that. The more serious concern in the OSDs (which motivated removing the cluster snap) is what Sage mentioned: we used to be able to take a snapshot for which all PGs were at the same epoch, and we can't do that now. It's possible that's okay, but it makes the semantics even weirder than they used to be (you've never been getting a real point-in-time snapshot, although as long as you didn't use external communication channels you could at least be sure it contained a causal cut). And of course that's nothing compared to snapshotting the monitors, as you've noticed — but making it actually be a cluster snapshot (instead of something you could basically do by taking a btrfs snapshot yourself) is something I would want to see before we bring the feature back into mainline. On Tue, Dec 17, 2013 at 6:22 AM, Alexandre Oliva wrote: > On Dec 17, 2013, Alexandre Oliva wrote: > >> On Dec 17, 2013, Alexandre Oliva wrote: Finally, eventually we should make this do a checkpoint on the mons too. We can add the osd snapping back in first, but before this can/should really be used the mons need to be snapshotted as well. Probably that's just adding in a snapshot() method to MonitorStore.h and doing either a leveldb snap or making a full copy of store.db... I forget what leveldb is capable of here. > >>> I haven't looked into this yet. > >> None of these are particularly appealing; (1) wastes disk space and cpu >> cycles; (2) relies on leveldb internal implementation details such as >> the fact that files are never modified after they're first closed, and >> (3) requires a btrfs subvol for the store.db. My favorite choice would >> be 3, but can we just fail mon snaps when this requirement is not met? > > Another aspect that needs to be considered is whether to take a snapshot > of the leader only, or of all monitors in the quorum. The fact that the > snapshot operation may take a while to complete (particularly (1)), and > monitors may not make progress while taking the snapshot (which might > cause the client and other monitors to assume other monitors have > failed), make the whole thing quite more complex than what I'd have > hoped for. > > Another point that may affect the decision is the amount of information > in store.db that may have to be retained. E.g., if it's just a small > amount of information, creating a separate database makes far more sense > than taking a complete copy of the entire database, and it might even > make sense for the leader to include the full snapshot data in the > snapshot-taking message shared with other monitors, so that they all > take exactly the same snapshot, even if they're not in the quorum and > receive the update at a later time. Of course this wouldn't work if the > amount of snapshotted monitor data was more than reasonable for a > monitor message. > > Anyway, this is probably more than what I'd be able to undertake myself, > at least in part because, although I can see one place to add the > snapshot-taking code to the leader (assuming it's ok to take the > snapshot just before or right after all monitors agree on it), I have no > idea of where to plug the snapshot-taking behavior into peon and > recovering monitors. Absent a two-phase protocol, it seems to me that > all monitors ought to take snapshots tentatively when they issue or > acknowledge the snapshot-taking proposal, so as to make sure that if it > succeeds we'll have a quorum of sn
Re: [PATCH] mds: handle setxattr ceph.parent
On Wed, 18 Dec 2013, Alexandre Oliva wrote: > On Dec 18, 2013, "Yan, Zheng" wrote: > > > On Tue, Dec 17, 2013 at 7:25 PM, Alexandre Oliva wrote: > >> # setfattr -n ceph.parent /cephfs/mount/path/name Can we add an additional prefix indicating that this is a debug/developer kludge that is not intended to be supported in the long-term? ceph.dev.force_rewrite_backtrace or something? sage > > > This seems like a good solution for fixing cephfs that was created > > before dumpling. > > There's more to it than just that, actually. Renaming an entire subtree > won't update the parent attribute of files in there, so they will appear > to be incorrect (*). This patch introduces a mechanism that could be > used to force them to be updated. > > (*) I'm well aware that they contain enough information to find the > updated information, so the redundant info in this attribute can be > harmlessly out-of-date, but if someone plans to use the data for other > purposes (like I sometimes do), it's useful to have them fully up to > date. I also move large trees around, which makes this issue visible. > > -- > Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ > You must be the change you wish to see in the world. -- Gandhi > Be Free! -- http://FSFLA.org/ FSF Latin America board member > Free Software Evangelist Red Hat Brazil Compiler Engineer > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: handle setxattr ceph.parent
On Dec 18, 2013, "Yan, Zheng" wrote: > On Tue, Dec 17, 2013 at 7:25 PM, Alexandre Oliva wrote: >> # setfattr -n ceph.parent /cephfs/mount/path/name > This seems like a good solution for fixing cephfs that was created > before dumpling. There's more to it than just that, actually. Renaming an entire subtree won't update the parent attribute of files in there, so they will appear to be incorrect (*). This patch introduces a mechanism that could be used to force them to be updated. (*) I'm well aware that they contain enough information to find the updated information, so the redundant info in this attribute can be harmlessly out-of-date, but if someone plans to use the data for other purposes (like I sometimes do), it's useful to have them fully up to date. I also move large trees around, which makes this issue visible. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fix for issue with RGW on larger than 528k downloads
On Wed, Dec 18, 2013 at 1:53 AM, Pawel Stefanski wrote: > hello! > > We are struggling with issue on downloading bigger files since some > time, on ceph-users there were some complains about this but without > any conclusions or with misunderstood cause (as a solution there was > proposed new http serwer for example nginx, where the cause is > somewhere in radosgw). > > I see that cluster at dreamhost suffered the same issue recently, > could community get the same path/workaround for the stalling download > issue ? > Sure. It's not on a proper release yet. but it's out there if you want to give it a try (wip-rgw-getobj-cb). We'll get it on a proper release soon. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
fix for issue with RGW on larger than 528k downloads
hello! We are struggling with issue on downloading bigger files since some time, on ceph-users there were some complains about this but without any conclusions or with misunderstood cause (as a solution there was proposed new http serwer for example nginx, where the cause is somewhere in radosgw). I see that cluster at dreamhost suffered the same issue recently, could community get the same path/workaround for the stalling download issue ? http://www.dreamhoststatus.com/2013/12/17/dreamobjects-trouble-with-large-objects/ Thanks in advance! best regards! -- Pawel -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html