Re: Persistence of completed_requests in sessionmap (do we need it?)
Just forwarding the replies to the list as it looks like it got blocked due to accidental HTML. -Greg On Mar 4, 2015, at 6:20 AM, John Spray john.sp...@redhat.com wrote: On 04/03/2015 12:14, 严正 wrote: 在 2015年3月4日,05:39,John Spray john.sp...@redhat.com 写道: During replay, we rebuild completed_requests from EMetaBlob::replay, and we've made it this far without reliably persisting it in sessionmap, so I wonder if we ever needed to save this at all? Thoughts? I think we need to save completed_requests for corner cases. consider following scenario: Client A sends setattr request to MDS MDS handles the request and sends reply to client. But network between MDS and client A becomes disconnected. MDS handles lots of setattr requests from other clients. Log entry for the first setattr request get trimmed MDS crashed, standby MDS on other host takes over. Client A re-send the setattr request to the new MDS. Ah, this makes sense. I suspect we never saw that scenario in the automated tests because they almost all just use a single client. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in_seq, out_seq in Messenger
On Feb 12, 2015, at 9:17 PM, Haomai Wang haomaiw...@gmail.com wrote: On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum gfar...@redhat.com wrote: Sorry for the delayed response. On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote: Hmm, I got it. There exists another problem I'm not sure whether captured by upper layer: two monitor node(A,B) connected with lossless_peer_reuse policy, 1. lots of messages has been transmitted 2. markdown A I don’t think monitors ever mark each other down? 3. restart A and call send_message(message will be in out_q) oh, maybe you just mean rebooting it, not an interface thing, okay... 4. network error injected and A failed to build a *session* with B 5. because of policy and in_queue() == true, we will reconnect in writer() 6. connect_seq++ and try to reconnect I think you’re wrong about this step. The messenger won’t increment connect_seq directly in writer() because it will be in STATE_CONNECTING, so it just calls connect() directly. connect() doesn’t increment the connect_seq unless it successfully finishes a session negotiation. Hmm, sorry. I checked log again. Actually A doesn't have any message in queue. So it will enter standby state and increase connect_seq. It will not be *STATE_CONNECTING*. Urgh, that case does seem broken, yes. I take it this is something you’ve actually run across? It looks like that connect_seq++ was added by https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25. Which makes me think we might just be able to increment the connect_seq appropriately in the connect() function if we need to do so (on replacement, I assume). Would you like to look at that and how this change might impact the peer with regards to the referenced assert failure? -A very slow-to-reply and apologetic Greg 2015-02-13 06:19:22.240788 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).writer: state = connecting policy.server=0 2015-02-13 06:19:22.240801 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect 0 2015-02-13 06:19:22.240821 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :0 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connecting to 127.0.0.1:16800/22032 2015-02-13 06:19:22.398009 7fdd147c7700 20 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect read peer addr 127.0.0.1:16800/22032 on socket 91 2015-02-13 06:19:22.398026 7fdd147c7700 20 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect peer addr for me is 127.0.0.1:36265/0 2015-02-13 06:19:22.398066 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect sent my addr 127.0.0.1:16813/22045 2015-02-13 06:19:22.398089 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect sending gseq=8 cseq=0 proto=24 2015-02-13 06:19:22.398115 7fdd147c7700 20 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect wrote (self +) cseq, waiting for reply 2015-02-13 06:19:22.398137 7fdd147c7700 2 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).connect read reply (0) Success 2015-02-13 06:19:22.398155 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060). sleep for 0.1 2015-02-13 06:19:22.498243 7fdd147c7700 2 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).fault (0) Success 2015-02-13 06:19:22.498275 7fdd147c7700 0 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0 c=0x3ed2060).fault with nothing to send, going to standby 2015-02-13 06:19:22.498290 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0 c=0x3ed2060).writer: state = standby policy.server=0 2015-02-13 06:19:22.498301 7fdd147c7700 20 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0 c=0x3ed2060).writer sleeping 2015-02-13 06:19:22.526116 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0 c=0x3ed2060).writer: state = standby policy.server=0 2015-02-13 06:19:22.526132 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=1 l=0 c=0x3ed2060).connect 1 2015-02-13 06:19:22.526158 7fdd147c7700 10 -- 127.0.0.1:16813/22045 127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36265 s=1 pgs=0 cs=1 l=0 c
Re: About in_seq, out_seq in Messenger
Sorry for the delayed response. On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote: Hmm, I got it. There exists another problem I'm not sure whether captured by upper layer: two monitor node(A,B) connected with lossless_peer_reuse policy, 1. lots of messages has been transmitted 2. markdown A I don’t think monitors ever mark each other down? 3. restart A and call send_message(message will be in out_q) oh, maybe you just mean rebooting it, not an interface thing, okay... 4. network error injected and A failed to build a *session* with B 5. because of policy and in_queue() == true, we will reconnect in writer() 6. connect_seq++ and try to reconnect I think you’re wrong about this step. The messenger won’t increment connect_seq directly in writer() because it will be in STATE_CONNECTING, so it just calls connect() directly. connect() doesn’t increment the connect_seq unless it successfully finishes a session negotiation. Unless I’m missing something? :) -Greg 7. because of connect_seq != 0, B can't detect remote reset and in_seq(a large value) will be exchange and cause A crashed(Pipe.cc:1154) So I guess we can't increase connect_seq when reconnecting? We need to let peer side detect remote reset via connect_seq == 0. On Tue, Feb 10, 2015 at 12:00 AM, Gregory Farnum gfar...@redhat.com wrote: - Original Message - From: Haomai Wang haomaiw...@gmail.com To: Gregory Farnum gfar...@redhat.com Cc: Sage Weil sw...@redhat.com, ceph-devel@vger.kernel.org Sent: Friday, February 6, 2015 8:16:42 AM Subject: Re: About in_seq, out_seq in Messenger On Fri, Feb 6, 2015 at 10:47 PM, Gregory Farnum gfar...@redhat.com wrote: - Original Message - From: Haomai Wang haomaiw...@gmail.com To: Sage Weil sw...@redhat.com, Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Sent: Friday, February 6, 2015 12:26:18 AM Subject: About in_seq, out_seq in Messenger Hi all, Recently we enable a async messenger test job in test lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#). We hit many failed assert mostly are: assert(0 == old msgs despite reconnect_seq feature); And assert connection all are cluster messenger which mean it's OSD internal connection. The policy associated this connection is Messenger::Policy::lossless_peer. So when I dive into this problem, I find something confusing about this. Suppose these steps: 1. lossless_peer policy is used by both two side connections. 2. markdown one side(anyway), peer connection will try to reconnect 3. then we restart failed side, a new connection is built but initiator will think it's a old connection so sending in_seq(10) 4. new started connection has no message in queue and it will receive peer connection's in_seq(10) and call discard_requeued_up_to(10). But because no message in queue, it won't modify anything 5. now any side issue a message, it will trigger assert(0 == old msgs despite reconnect_seq feature); I can replay these steps in unittest and actually it's hit in test lab for async messenger which follows simple messenger's design. Besides, if we enable reset_check here, was_session_reset will be called and it will random out_seq, so it will certainly hit assert(0 == skipped incoming seq). Anything wrong above? Sage covered most of this. I'll just add that the last time I checked it, I came to the conclusion that the code to use a random out_seq on initial connect was non-functional. So there definitely may be issues there. In fact, we've fixed a couple (several?) bugs in this area since Firefly was initially released, so if you go over the point release SimpleMessenger patches you might gain some insight. :) -Greg If we want to make random out_seq functional, I think we need to exchange out_seq when handshaking too. Otherwise, we need to give it up. Possibly. Or maybe we just need to weaken our asserts to infer it on initial messages? Another question, do you think reset_check=true is always good for osd internal connection? Huh? resetcheck is false for lossless peer connections. Let Messenger rely on upper layer may not a good idea, so maybe we can enhance in_seq exchange process(ensure each side in_seq+sent.size()==out_seq). From the current handshake impl, it's not easy to insert more action to in_seq exchange process, because this session has been built regardless of the result of in_seq process. If enable reset_check=true, it looks we can solve most of incorrect seq out-of-sync problem? Oh, I see what you mean. Yeah, the problem here is a bit of a mismatch in the interfaces. OSDs are lossless peers with each other, they should not miss any messages, and they don't ever go away. Except of course sometimes they do go away, if one of them dies. This is supposed to be handled by marking it down, but it turns out the race conditions
Re: [PATCH 04/39] mds: make sure table request id unique
On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: On 03/20/2013 02:15 PM, Sage Weil wrote: On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum wrote: Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. Can we re-use the client request ID as table client request ID ? Regards Yan, Zheng Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's agree with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/39] mds: preserve subtree bounds until slave commit
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) When replaying an operation that rename a directory inode to non-auth subtree, if the inode has subtree bounds, we should prevent them from being trimmed until slave commit. This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay() should be called before MDCache::finish_uncommitted_slave_update(). Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 21 +++-- src/mds/Mutation.h | 5 ++--- src/mds/journal.cc (http://journal.cc) | 13 + 3 files changed, 22 insertions(+), 17 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index fddcfc6..684e70b 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -3016,10 +3016,10 @@ void MDCache::add_uncommitted_slave_update(metareqid_t reqid, int master, MDSlav { assert(uncommitted_slave_updates[master].count(reqid) == 0); uncommitted_slave_updates[master][reqid] = su; - if (su-rename_olddir) - uncommitted_slave_rename_olddir[su-rename_olddir]++; + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); ++p) + uncommitted_slave_rename_olddir[*p]++; for(setCInode*::iterator p = su-unlinked.begin(); p != su-unlinked.end(); ++p) - uncommitted_slave_unlink[*p]++; + uncommitted_slave_unlink[*p]++; } void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master) @@ -3031,11 +3031,12 @@ void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master) if (uncommitted_slave_updates[master].empty()) uncommitted_slave_updates.erase(master); // discard the non-auth subtree we renamed out of - if (su-rename_olddir) { - uncommitted_slave_rename_olddir[su-rename_olddir]--; - if (uncommitted_slave_rename_olddir[su-rename_olddir] == 0) { - uncommitted_slave_rename_olddir.erase(su-rename_olddir); - CDir *root = get_subtree_root(su-rename_olddir); + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); ++p) { + CDir *dir = *p; + uncommitted_slave_rename_olddir[dir]--; + if (uncommitted_slave_rename_olddir[dir] == 0) { + uncommitted_slave_rename_olddir.erase(dir); + CDir *root = get_subtree_root(dir); if (root-get_dir_auth() == CDIR_AUTH_UNDEF) try_trim_non_auth_subtree(root); } @@ -6052,8 +6053,8 @@ bool MDCache::trim_non_auth_subtree(CDir *dir) { dout(10) trim_non_auth_subtree( dir ) *dir dendl; - // preserve the dir for rollback - if (uncommitted_slave_rename_olddir.count(dir)) + if (uncommitted_slave_rename_olddir.count(dir) || // preserve the dir for rollback + my_ambiguous_imports.count(dir-dirfrag())) return true; bool keep_dir = false; diff --git a/src/mds/Mutation.h b/src/mds/Mutation.h index 55b84eb..5013f04 100644 --- a/src/mds/Mutation.h +++ b/src/mds/Mutation.h @@ -315,13 +315,12 @@ struct MDSlaveUpdate { bufferlist rollback; elistMDSlaveUpdate*::item item; Context *waiter; - CDir* rename_olddir; + setCDir* olddirs; setCInode* unlinked; MDSlaveUpdate(int oo, bufferlist rbl, elistMDSlaveUpdate* list) : origop(oo), item(this), - waiter(0), - rename_olddir(0) { + waiter(0) { rollback.claim(rbl); list.push_back(item); } diff --git a/src/mds/journal.cc (http://journal.cc) b/src/mds/journal.cc (http://journal.cc) index 5b3bd71..3375e40 100644 --- a/src/mds/journal.cc (http://journal.cc) +++ b/src/mds/journal.cc (http://journal.cc) @@ -1131,10 +1131,15 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, MDSlaveUpdate *slaveup) if (olddir) { if (olddir-authority() != CDIR_AUTH_UNDEF renamed_diri-authority() == CDIR_AUTH_UNDEF) { + assert(slaveup); // auth to non-auth, must be slave prepare listfrag_t leaves; renamed_diri-dirfragtree.get_leaves(leaves); - for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) - renamed_diri-get_or_open_dirfrag(mds-mdcache, *p); + for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) { + CDir *dir = renamed_diri-get_or_open_dirfrag(mds-mdcache, *p); + // preserve subtree bound until slave commit + if (dir-authority() == CDIR_AUTH_UNDEF) + slaveup-olddirs.insert(dir); + } } mds-mdcache-adjust_subtree_after_rename(renamed_diri, olddir, false); @@ -1143,7 +1148,7 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, MDSlaveUpdate *slaveup) CDir *root = mds-mdcache-get_subtree_root(olddir); if (root-get_dir_auth() == CDIR_AUTH_UNDEF) { if (slaveup) // preserve the old dir until slave commit - slaveup-rename_olddir = olddir; + slaveup-olddirs.insert(olddir); else mds-mdcache-try_trim_non_auth_subtree(root); } @@ -2122,10 +2127,10 @@ void ESlaveUpdate::replay(MDS *mds) case
Re: [PATCH 03/39] mds: fix MDCache::adjust_bounded_subtree_auth()
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) There are cases that need both create new bound and swallow intervening subtree. For example: A MDS exports subtree A with bound B and imports subtree B with bound C at the same time. The MDS crashes, exporting subtree A fails, but importing subtree B succeed. During recovery, the MDS may create new bound C and swallow subtree B. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index 684e70b..19dc60b 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -980,15 +980,21 @@ void MDCache::adjust_bounded_subtree_auth(CDir *dir, setCDir* bounds, pairin } else { dout(10) want bound *bound dendl; + CDir *t = get_subtree_root(bound-get_parent_dir()); + if (subtrees[t].count(bound) == 0) { + assert(t != dir); + dout(10) new bound *bound dendl; + adjust_subtree_auth(bound, t-authority()); + } // make sure it's nested beneath ambiguous subtree(s) while (1) { - CDir *t = get_subtree_root(bound-get_parent_dir()); - if (t == dir) break; while (subtrees[dir].count(t) == 0) t = get_subtree_root(t-get_parent_dir()); dout(10) swallowing intervening subtree at *t dendl; adjust_subtree_auth(t, auth); try_subtree_merge_at(t); + t = get_subtree_root(bound-get_parent_dir()); + if (t == dir) break; } } } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/39] mds: send table request when peer is in proper state.
This and patch 6 are probably going to get dealt with as part of our conversation on patch 4 and restart of the TableServers. Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) Table client/server should send request/reply when the peer is active. Anchor query is an exception, because MDS in rejoin stage may need fetch files before sending rejoin ack, the anchor server can also be in rejoin stage. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/AnchorClient.cc (http://AnchorClient.cc) | 5 - src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 9 ++--- src/mds/MDSTableServer.cc (http://MDSTableServer.cc) | 3 ++- 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/src/mds/AnchorClient.cc (http://AnchorClient.cc) b/src/mds/AnchorClient.cc (http://AnchorClient.cc) index 455e97f..d7da9d1 100644 --- a/src/mds/AnchorClient.cc (http://AnchorClient.cc) +++ b/src/mds/AnchorClient.cc (http://AnchorClient.cc) @@ -80,9 +80,12 @@ void AnchorClient::lookup(inodeno_t ino, vectorAnchor trace, Context *onfinis void AnchorClient::_lookup(inodeno_t ino) { + int ts = mds-mdsmap-get_tableserver(); + if (mds-mdsmap-get_state(ts) MDSMap::STATE_REJOIN) + return; MMDSTableRequest *req = new MMDSTableRequest(table, TABLESERVER_OP_QUERY, 0, 0); ::encode(ino, req-bl); - mds-send_message_mds(req, mds-mdsmap-get_tableserver()); + mds-send_message_mds(req, ts); } diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index beba0a3..df0131f 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -149,9 +149,10 @@ void MDSTableClient::_prepare(bufferlist mutation, version_t *ptid, bufferlist void MDSTableClient::send_to_tableserver(MMDSTableRequest *req) { int ts = mds-mdsmap-get_tableserver(); - if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) { mds-send_message_mds(req, ts); - else { + } else { + req-put(); dout(10) deferring request to not-yet-active tableserver mds. ts dendl; } } @@ -193,7 +194,9 @@ void MDSTableClient::got_journaled_ack(version_t tid) void MDSTableClient::finish_recovery() { dout(7) finish_recovery dendl; - resend_commits(); + int ts = mds-mdsmap-get_tableserver(); + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) + resend_commits(); } void MDSTableClient::resend_commits() diff --git a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) index 4f86ff1..07c7d26 100644 --- a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) +++ b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) @@ -159,7 +159,8 @@ void MDSTableServer::handle_mds_recovery(int who) for (mapversion_t,mds_table_pending_t::iterator p = pending_for_mds.begin(); p != pending_for_mds.end(); ++p) { - if (who = 0 p-second.mds != who) + if ((who = 0 p-second.mds != who) || + mds-mdsmap-get_state(p-second.mds) MDSMap::STATE_CLIENTREPLAY) continue; MMDSTableRequest *reply = new MMDSTableRequest(table, TABLESERVER_OP_AGREE, p-second.reqid, p-second.tid); mds-send_message_mds(reply, p-second.mds); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.
The idea of this patch makes sense, but I'm not sure if we guarantee that each daemon sees every map update — if they don't then if an MDS misses the map moving an MDS into CLIENTREPLAY then they won't process them as having recovered on the next map. Sage or Joao, what are the guarantees subscription provides? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) MDS in clientreply state already start servering requests. It also make MDS::handle_mds_recovery() and MDS::recovery_done() match. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 282fa64..b91dcbd 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m) setint oldactive, active; oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE); + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY); mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE); + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY); for (setint::iterator p = active.begin(); p != active.end(); ++p) if (*p != whoami // not me oldactive.count(*p) == 0) // newly so? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/39] mds: mark connection down when MDS fails
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) So if the MDS restarts and uses the same address, it does not get old messages. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 859782a..282fa64 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1046,8 +1046,10 @@ void MDS::handle_mds_map(MMDSMap *m) oldmap-get_failed_mds_set(oldfailed); mdsmap-get_failed_mds_set(failed); for (setint::iterator p = failed.begin(); p != failed.end(); ++p) - if (oldfailed.count(*p) == 0) + if (oldfailed.count(*p) == 0) { + messenger-mark_down(oldmap-get_inst(*p).addr); mdcache-handle_mds_failure(*p); + } // or down then up? // did their addr/inst change? @@ -1055,8 +1057,10 @@ void MDS::handle_mds_map(MMDSMap *m) mdsmap-get_up_mds_set(up); for (setint::iterator p = up.begin(); p != up.end(); ++p) if (oldmap-have_inst(*p) - oldmap-get_inst(*p) != mdsmap-get_inst(*p)) + oldmap-get_inst(*p) != mdsmap-get_inst(*p)) { + messenger-mark_down(oldmap-get_inst(*p).addr); mdcache-handle_mds_failure(*p); + } } if (is_clientreplay() || is_active() || is_stopping()) { // did anyone stop? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.
Oh, also: s/clientreply/clientreplay in the commit message Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) MDS in clientreply state already start servering requests. It also make MDS::handle_mds_recovery() and MDS::recovery_done() match. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 282fa64..b91dcbd 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m) setint oldactive, active; oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE); + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY); mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE); + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY); for (setint::iterator p = active.begin(); p != active.end(); ++p) if (*p != whoami // not me oldactive.count(*p) == 0) // newly so? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/39] mds: defer eval gather locks when removing replica
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Locks' states should not change between composing the cache rejoin ack messages and sending the message. If Locker::eval_gather() is called in MDCache::{inode,dentry}_remove_replica(), it may wake requests and change locks' states. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 51 ++- src/mds/MDCache.h | 8 +--- 2 files changed, 35 insertions(+), 24 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index 19dc60b..0f6b842 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) // possible response(s) MMDSCacheRejoin *ack = 0; // if survivor setvinodeno_t acked_inodes; // if survivor + setSimpleLock * gather_locks; // if survivor bool survivor = false; // am i a survivor? if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) { @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(dnl-is_primary()); if (survivor dn-is_replica(from)) - dentry_remove_replica(dn, from); // this induces a lock gather completion + dentry_remove_replica(dn, from, gather_locks); // this induces a lock gather completion This comment is no longer accurate :) int dnonce = dn-add_replica(from); dout(10) have *dn dendl; if (ack) @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(in); if (survivor in-is_replica(from)) - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); int inonce = in-add_replica(from); dout(10) have *in dendl; @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) CInode *in = get_inode(*p); assert(in); // hmm fixme wrt stray? if (survivor in-is_replica(from)) - inode_remove_replica(in, from); // this induces a lock gather completion + inode_remove_replica(in, from, gather_locks); // this induces a lock gather completion Same here. Other than those, looks good. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com int inonce = in-add_replica(from); dout(10) have base *in dendl; @@ -3909,8 +3910,11 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) ack-add_inode_base(in); } - rejoin_scour_survivor_replicas(from, ack, acked_inodes); + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes); mds-send_message(ack, weak-get_connection()); + + for (setSimpleLock*::iterator p = gather_locks.begin(); p != gather_locks.end(); ++p) + mds-locker-eval_gather(*p); } else { // done? assert(rejoin_gather.count(from)); @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t ino, filepath path, * all validated replicas are acked with a strong nonce, etc. if that isn't in the * ack, the replica dne, and we can remove it from our replica maps. */ -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, setvinodeno_t acked_inodes) +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, + setSimpleLock * gather_locks, + setvinodeno_t acked_inodes) { dout(10) rejoin_scour_survivor_replicas from mds. from dendl; @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (in-is_auth() in-is_replica(from) acked_inodes.count(p-second-vino()) == 0) { - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); dout(10) rem *in dendl; } @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (dn-is_replica(from) (ack-strong_dentries.count(dir-dirfrag()) == 0 || ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, dn-last)) == 0)) { - dentry_remove_replica(dn, from); + dentry_remove_replica(dn, from, gather_locks); dout(10) rem *dn dendl; } } @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) return; } + setSimpleLock * gather_locks; // loop over realms for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin(); p != m-realms.end(); @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) // remove from our cached_by dout(7) inode expire on *in from mds. from cached_by was in-get_replicas() dendl; - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); } else { // this is an old nonce, ignore expire. @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) if (nonce == dn-get_replica_nonce(from)) { dout(7) dentry_expire on *dn from mds. from dendl; -
Re: CephFS: stable release?
On Wednesday, March 20, 2013 at 1:22 PM, Pascal wrote: Am Sun, 24 Feb 2013 14:41:27 -0800 schrieb Gregory Farnum g...@inktank.com (mailto:g...@inktank.com): On Saturday, February 23, 2013 at 2:14 AM, Gandalf Corvotempesta wrote: Hi all, do you have an ETA about a stable realease (or something usable in production) for CephFS? Short answer: no. However, we do have a team of people working on the FS again as of a month or so ago. We're doing a lot of stabilization (bug fixes), code cleanups, and utility work in the coming months; we can estimate the utility and cleanup work but not the bugs that we'll find, and those are our main concern right now. Depending on how the next couple months of QA and bug fixing go we should be able to publicize real estimates soonish. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Hello Gregory, is your response still up-to-date? The FAQ (http://ceph.com/docs/master/faq/) says: Ceph’s object store (RADOS) is production ready. Well put out some blog posts and emails when we have anything more to report. :) RADOS is ready, but CephFS is a whole separate layer above it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Replicated objects need to be added into the cache immediately Signed-off-by: Yan, Zheng zheng.z@intel.com Why do we need to add them right away? Shouldn't we have a journaled replica if we need it? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com --- src/mds/MDCache.cc | 12 src/mds/MDCache.h | 2 +- src/mds/MDS.cc | 6 +++--- src/mds/Server.cc | 55 +++--- 4 files changed, 56 insertions(+), 19 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 0f6b842..b668842 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context *fin, bufferlist bl, int r) /* */ +int MDCache::get_num_client_requests() +{ + int count = 0; + for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); + p != active_requests.end(); + ++p) { + if (p-second-reqid.name.is_client() !p-second-is_slave()) + count++; + } + return count; +} + /* This function takes over the reference to the passed Message */ MDRequest *MDCache::request_start(MClientRequest *req) { diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index a9f05c6..4634121 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -240,7 +240,7 @@ protected: hash_mapmetareqid_t, MDRequest* active_requests; public: - int get_num_active_requests() { return active_requests.size(); } + int get_num_client_requests(); MDRequest* request_start(MClientRequest *req); MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by); diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index b91dcbd..e99eecc 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m) mdcache-is_open() replay_queue.empty() want_state == MDSMap::STATE_CLIENTREPLAY) { - dout(10) still have mdcache-get_num_active_requests() - active replay requests dendl; - if (mdcache-get_num_active_requests() == 0) + int num_requests = mdcache-get_num_client_requests(); + dout(10) still have num_requests active replay requests dendl; + if (num_requests == 0) clientreplay_done(); } diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 4c4c86b..8e89e4c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -107,10 +107,8 @@ void Server::dispatch(Message *m) (m-get_type() == CEPH_MSG_CLIENT_REQUEST (static_castMClientRequest*(m))-is_replay( { // replaying! - } else if (mds-is_clientreplay() m-get_type() == MSG_MDS_SLAVE_REQUEST - ((static_castMMDSSlaveRequest*(m))-is_reply() || - !mds-mdsmap-is_active(m-get_source().num( { - // slave reply or the master is also in the clientreplay stage + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) { + // handle_slave_request() will wait if necessary } else { dout(3) not active yet, waiting dendl; mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) if (m-is_reply()) return handle_slave_request_reply(m); + CDentry *straydn = NULL; + if (m-stray.length() 0) { + straydn = mdcache-add_replica_stray(m-stray, from); + assert(straydn); + m-stray.clear(); + } + // am i a new slave? MDRequest *mdr = NULL; if (mdcache-have_request(m-get_reqid())) { @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) m-put(); return; } - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), m-get_source().num()); + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from); } assert(mdr-slave_request == 0); // only one at a time, please! + + if (straydn) { + mdr-pin(straydn); + mdr-straydn = straydn; + } + + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } else if (mds-is_clientreplay() !mds-mdsmap-is_clientreplay(from) + mdr-locks.empty()) { + dout(3) not active yet, waiting dendl; + mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); + return; + } + mdr-slave_request = m; dispatch_slave_request(mdr); @@ -1339,6 +1361,12 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) { int from = m-get_source().num(); + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } + if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) { metareqid_t r = m-get_reqid(); mds-mdcache-committed_master_slave(r, from); @@ -5138,10 +5166,8 @@ void Server::handle_slave_rmdir_prep(MDRequest *mdr) dout(10) dn *dn dendl; mdr-pin(dn); - assert(mdr-slave_request-stray.length()
Re: [PATCH] ceph: fix buffer pointer advance in ceph_sync_write
Sage beat me to it and merged this in last night. Thanks much! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Monday, March 18, 2013 at 6:46 PM, Henry C Chang wrote: We should advance the user data pointer by _len_ instead of _written_. _len_ is the data length written in each iteration while _written_ is the accumulated data length we have writtent out. Signed-off-by: Henry C Chang henry.cy.ch...@gmail.com --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
On Tuesday, March 19, 2013 at 12:48 PM, Sage Weil wrote: Should the package purge remove /var/lib/ceph/* (potential mon data, osd data) and/or /var/log/ceph/* (logs)? Right now it does, but mysql, for example, leaves /var/lib/mysql where it is (not sure about logs). I'm with Mark's ticket on this (http://tracker.ceph.com/issues/4505). Config data in /etc/ceph and logs in var/log/ceph is fine to remove, but storage data isn't. That's essentially user-generated and not something that can be recovered or rebuilt following the purge. Keyrings might not be unreasonable to delete, but I don't think they're necessary and certainly aren't worth putting in the work to separate from the other user data in var/lib/ceph. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/39] mds: make sure table request id unique
Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests…). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so… -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com When a MDS becomes active, the table server re-sends 'agree' messages for old prepared request. If the recoverd MDS starts a new table request at the same time, The new request's ID can happen to be the same as old prepared request's ID, because current table client assigns request ID from zero after MDS restarts. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 3 +++ src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 + src/mds/MDSTableClient.h | 2 ++ 3 files changed, 10 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index bb1c833..859782a 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) dout(2) boot_start step : opening snap table dendl; snapserver-load(gather.new_sub()); } + + anchorclient-init(); + snapclient-init(); dout(2) boot_start step : opening mds log dendl; mdlog-open(gather.new_sub()); diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index ea021f5..beba0a3 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -34,6 +34,11 @@ #undef dout_prefix #define dout_prefix *_dout mds. mds-get_nodeid() .tableclient( get_mdstable_name(table) ) +void MDSTableClient::init() +{ + // make reqid unique between MDS restarts + last_reqid = (uint64_t)mds-mdsmap-get_epoch() 32; +} void MDSTableClient::handle_request(class MMDSTableRequest *m) { diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h index e15837f..78035db 100644 --- a/src/mds/MDSTableClient.h +++ b/src/mds/MDSTableClient.h @@ -63,6 +63,8 @@ public: MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} virtual ~MDSTableClient() {} + void init(); + void handle_request(MMDSTableRequest *m); void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct IO on CephFS for blocks larger than 8MB
On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote: The following patch should fix the problem. -Henry diff --git a/fs/ceph/file.c b/fs/ceph/file.c index e51558f..4bcbcb6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -608,7 +608,7 @@ out: pos += len; written += len; left -= len; - data += written; + data += len; if (left) goto more; This looks good to me. If you'd like to submit it as a proper patch with a sign-off I'll pull it into our tree. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote: On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com (mailto:ke...@kdecherf.com) wrote: You will find the archive here: snip The data is not anonymized. Interesting folders/files here are /user_309bbd38-3cff-468d-a465-dc17c260de0c/* Sorry for the delay, but I have retrieved this archive locally at least so if you want to remove it from your webserver you can do so. :) Also, I notice when I untar it that the file name includes filtered — what filters did you run it through? Hi Gregory, Do you have any news about it? I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this. Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces. On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote: Thank you a lot for these explanations, looking forward for these fixes! Do you have some public bug reports regarding this to link us? Good luck, thank you for your great job and have a nice weekend Marc-Antoine Perennou Well, for now the fixes are for stuff like make analysis take less time, and export timing information more easily. The most immediately applicable one is probably http://tracker.ceph.com/issues/4354, which I hope to start on next week and should be done by the end of the sprint. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
[Putting list back on cc] On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: On 03/15/2013 04:23 PM, Greg Farnum wrote: As I come back and look at these again, I'm not sure what the context for these logs is. Which test did they come from, and which behavior (slow or not slow, etc) did you see? :) -Greg They come from a test where I had debug mds = 20 and debug ms = 1 on the MDS while writing files from 198 clients. It turns out that for some reason I need debug mds = 20 during writing to reproduce the slow stat behavior later. strace.find.dirs.txt.bz2 contains the log of running strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \; From that output, I believe that the stat of at least these files is slow: zero0.rc11 zero0.rc30 zero0.rc46 zero0.rc8 zero0.tc103 zero0.tc105 zero0.tc106 I believe that log shows slow stats on more files, but those are the first few. mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the find command started, until just after the fifth or sixth slow stat from the list above. I haven't yet tried to find other ways of reproducing this, but so far it appears that something happens during the writing of the files that ends up causing the condition that results in slow stat commands. I have the full MDS log from the writing of the files, as well, but it's big Is that what you were after? Thanks for taking a look! -- Jim I just was coming back to these to see what new information was available, but I realized we'd discussed several tests and I wasn't sure what these ones came from. That information is enough, yes. If in fact you believe you've only seen this with high-level MDS debugging, I believe the cause is as I mentioned last time: the MDS is flapping a bit and so some files get marked as needsrecover, but they aren't getting recovered asynchronously, and the first thing that pokes them into doing a recover is the stat. That's definitely not the behavior we want and so I'll be poking around the code a bit and generating bugs, but given that explanation it's a bit less scary than random slow stats are so it's not such a high priority. :) Do let me know if you come across it without the MDS and clients having had connection issues! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct IO on CephFS for blocks larger than 8MB
On Thursday, March 14, 2013 at 8:20 AM, Sage Weil wrote: On Thu, 14 Mar 2013, Huang, Xiwei wrote: Hi, all, I noticed that CephFS fails to support Direct IO for blocks larger than 8MB, say: sudo dd if=/dev/zero of=mnt/cephfs/foo bs=16M count=1 oflag=direct dd: writing `mnt/cephfs/foo: Bad address 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.213948 s, 0.0 kB/s My version Ceph is 0.56.1. ??I also found the bug has been already reported as Bug #2657. Is this fixed in the new 0.58 version? I'm pretty sure this is a problem on the kernel client side of things, not the server side (which by default handles writes up to ~100MB or so). I suspect it isn't terribly difficult to fix, but hasn't been prioritized... sage My guess too. Are direct IO writes of that size a common thing or of great import to you? Either way, a comment on the tracker saying you've run into it will promote it up when we're doing bug scrubs and backlog reviews. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS locality API RFC
On Thursday, March 14, 2013 at 11:14 AM, Noah Watkins wrote: The current CephFS API is used to extract locality information as follows: First we get a list of OSD IDs: ceph_get_file_extent_osds(offset) - [OSD ID]* Using the OSD IDs we can then query for the CRUSH bucket hierarchy: ceph_get_osd_crush_location(osd_id) - path The path includes hostname information, but we'd still like to get the IP. The current API for doing this is: ceph_get_file_stripe_address(offset) - [sockaddr]* that returns an IP for each OSD holds replicas. The order of the output list should be the same as the the OSD list, but It'd be nice to have a consistent API that deals with OSD id, making the correspondence explicit. Agreed. We should probably deprecate the get_file_stripe_address() and make them turn IDs into addresses on their own. For instance: ceph_get_file_stripe_address(osd_id) - sockaddr How about ceph_get_osd_address(osd_id) - sockaddr ;) Another option is to have `ceph_get_osd_crush_location` return both the path and a sockaddr. No way — that's conflating two different things rather more than we should be. For one thing the sockaddr can change during a daemon restart but the crush location won't. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS locality API RFC
On Thursday, March 14, 2013 at 11:33 AM, Noah Watkins wrote: On Mar 14, 2013, at 11:29 AM, Greg Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: On Thursday, March 14, 2013 at 11:14 AM, Noah Watkins wrote: The current CephFS API is used to extract locality information as follows: First we get a list of OSD IDs: ceph_get_file_extent_osds(offset) - [OSD ID]* Using the OSD IDs we can then query for the CRUSH bucket hierarchy: ceph_get_osd_crush_location(osd_id) - path The path includes hostname information, but we'd still like to get the IP. The current API for doing this is: ceph_get_file_stripe_address(offset) - [sockaddr]* that returns an IP for each OSD holds replicas. The order of the output list should be the same as the the OSD list, but It'd be nice to have a consistent API that deals with OSD id, making the correspondence explicit. Agreed. We should probably deprecate the get_file_stripe_address() and make them turn IDs into addresses on their own. Is there an API deprecation protocol, or just -ENOTSUPP? Well, for the moment I was thinking sticking DEPRECATED next to it and not using it anywhere else — but that is probably an acceptable choice instead. I doubt anybody's using it outside of the old Hadoop bindings. Which I am looking forward to being able to purge out of all memory…. ;) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] ceph: use i_release_count to indicate dir's completeness
Looks good, thanks. :) We'll also be testing the first patch in this series. -Greg On Wednesday, March 13, 2013 at 4:44 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) Current ceph code tracks directory's completeness in two places. ceph_readdir() checks i_release_count to decide if it can set the I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE flag. This indirection introduces locking complexity. This patch adds a new variable i_complete_count to ceph_inode_info. Set i_release_count's value to it when marking a directory complete. By comparing the two variables, we know if a directory is complete Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- Changes since V1: define i_complete_count as atomic_t fs/ceph/caps.c | 4 ++-- fs/ceph/dir.c | 25 + fs/ceph/inode.c | 13 +++-- fs/ceph/mds_client.c | 10 +++--- fs/ceph/super.h | 42 -- 5 files changed, 45 insertions(+), 49 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 76634f4..124e8a1 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -490,7 +490,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, ci-i_rdcache_gen++; /* - * if we are newly issued FILE_SHARED, clear I_COMPLETE; we + * if we are newly issued FILE_SHARED, mark dir not complete; we * don't know what happened to this directory while we didn't * have the cap. */ @@ -499,7 +499,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, ci-i_shared_gen++; if (S_ISDIR(ci-vfs_inode.i_mode)) { dout( marking %p NOT complete\n, ci-vfs_inode); - ci-i_ceph_flags = ~CEPH_I_COMPLETE; + __ceph_dir_clear_complete(ci); } } } diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 76821be..11966c4 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p) * falling back to a normal sync readdir if any dentries in the dir * are dropped. * - * I_COMPLETE tells indicates we have all dentries in the dir. It is + * Complete dir indicates that we have all dentries in the dir. It is * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by * the MDS if/when the directory is modified). */ @@ -198,8 +198,8 @@ more: filp-f_pos++; /* make sure a dentry wasn't dropped while we didn't have parent lock */ - if (!ceph_i_test(dir, CEPH_I_COMPLETE)) { - dout( lost I_COMPLETE on %p; falling back to mds\n, dir); + if (!ceph_dir_is_complete(dir)) { + dout( lost dir complete on %p; falling back to mds\n, dir); err = -EAGAIN; goto out; } @@ -258,7 +258,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir) if (filp-f_pos == 0) { /* note dir version at start of readdir so we can tell * if any dentries get dropped */ - fi-dir_release_count = ci-i_release_count; + fi-dir_release_count = atomic_read(ci-i_release_count); dout(readdir off 0 - '.'\n); if (filldir(dirent, ., 1, ceph_make_fpos(0, 0), @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir) if ((filp-f_pos == 2 || fi-dentry) !ceph_test_mount_opt(fsc, NOASYNCREADDIR) ceph_snap(inode) != CEPH_SNAPDIR - (ci-i_ceph_flags CEPH_I_COMPLETE) + __ceph_dir_is_complete(ci) __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) { spin_unlock(ci-i_ceph_lock); err = __dcache_readdir(filp, dirent, filldir); @@ -350,7 +350,8 @@ more: if (!req-r_did_prepopulate) { dout(readdir !did_prepopulate); - fi-dir_release_count--; /* preclude I_COMPLETE */ + /* preclude from marking dir complete */ + fi-dir_release_count--; } /* note next offset and last dentry name */ @@ -428,9 +429,9 @@ more: * the complete dir contents in our cache. */ spin_lock(ci-i_ceph_lock); - if (ci-i_release_count == fi-dir_release_count) { + if (atomic_read(ci-i_release_count) == fi-dir_release_count) { dout( marking %p complete\n, inode); - ci-i_ceph_flags |= CEPH_I_COMPLETE; + __ceph_dir_set_complete(ci, fi-dir_release_count); ci-i_max_offset = filp-f_pos; } spin_unlock(ci-i_ceph_lock); @@ -605,7 +606,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry, fsc-mount_options-snapdir_name, dentry-d_name.len) !is_root_ceph_dentry(dir, dentry) - (ci-i_ceph_flags CEPH_I_COMPLETE) + __ceph_dir_is_complete(ci) (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) { spin_unlock(ci-i_ceph_lock); dout( dir %p complete, -ENOENT\n, dir); @@ -909,7 +910,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, */ /* d_move screws up d_subdirs order */ - ceph_i_clear(new_dir, CEPH_I_COMPLETE); + ceph_dir_clear_complete(new_dir); d_move(old_dentry, new_dentry); @@ -1079,7 +1080,7 @@ static void ceph_d_prune(struct dentry *dentry) if (IS_ROOT(dentry)) return; - /* if we
Re: OSD memory leaks?
It sounds like maybe you didn't rename the new pool to use the old pool's name? Glance is looking for a specific pool to store its data in; I believe it's configurable but you'll need to do one or the other. -Greg On Wednesday, March 13, 2013 at 3:38 PM, Dave Spano wrote: Sebastien, I'm not totally sure yet, but everything is still working. Sage and Greg, I copied my glance image pool per the posting I mentioned previously, and everything works when I use the ceph tools. I can export rbds from the new pool and delete them as well. I noticed that the copied images pool does not work with glance. I get this error when I try to create images in the new pool. If I put the old pool back, I can create images no problem. Is there something I'm missing in glance that I need to work with a pool created in bobtail? I'm using Openstack Folsom. File /usr/lib/python2.7/dist-packages/glance/api/v1/images.py, line 437, in _upload image_meta['size']) File /usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 244, in add image_size, order) File /usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 207, in _create_image features=rbd.RBD_FEATURE_LAYERING) File /usr/lib/python2.7/dist-packages/rbd.py, line 194, in create raise make_ex(ret, 'error creating image') PermissionError: error creating image Dave Spano - Original Message - From: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) To: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) Cc: Greg Farnum g...@inktank.com (mailto:g...@inktank.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Sent: Wednesday, March 13, 2013 3:59:03 PM Subject: Re: OSD memory leaks? Dave, Just to be sure, did the log max recent=1 _completely_ stod the memory leak or did it slow it down? Thanks! -- Regards, Sébastien Han. On Wed, Mar 13, 2013 at 2:12 PM, Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) wrote: Lol. I'm totally fine with that. My glance images pool isn't used too often. I'm going to give that a try today and see what happens. I'm still crossing my fingers, but since I added log max recent=1 to ceph.conf, I've been okay despite the improper pg_num, and a lot of scrubbing/deep scrubbing yesterday. Dave Spano - Original Message - From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com) To: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) Cc: ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com), Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) Sent: Tuesday, March 12, 2013 5:37:37 PM Subject: Re: OSD memory leaks? Yeah. There's not anything intelligent about that cppool mechanism. :) -Greg On Tuesday, March 12, 2013 at 2:15 PM, Dave Spano wrote: I'd rather shut the cloud down and copy the pool to a new one than take any chances of corruption by using an experimental feature. My guess is that there cannot be any i/o to the pool while copying, otherwise you'll lose the changes that are happening during the copy, correct? Dave Spano Optogenics Systems Administrator - Original Message - From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com) To: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) Cc: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Sent: Tuesday, March 12, 2013 4:20:13 PM Subject: Re: OSD memory leaks? On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote: Well to avoid un necessary data movement
Re: OSD memory leaks?
On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote: Well to avoid un necessary data movement, there is also an _experimental_ feature to change on fly the number of PGs in a pool. ceph osd pool set poolname pg_num numpgs --allow-experimental-feature Don't do that. We've got a set of 3 patches which fix bugs we know about that aren't in bobtail yet, and I'm sure there's more we aren't aware of… -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Cheers! -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 7:09 PM, Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) wrote: Disregard my previous question. I found my answer in the post below. Absolutely brilliant! I thought I was screwed! http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/8924 Dave Spano Optogenics Systems Administrator - Original Message - From: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) To: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Sent: Tuesday, March 12, 2013 1:41:21 PM Subject: Re: OSD memory leaks? If one were stupid enough to have their pg_num and pgp_num set to 8 on two of their pools, how could you fix that? Dave Spano - Original Message - From: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) To: Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com) Sent: Tuesday, March 12, 2013 9:43:44 AM Subject: Re: OSD memory leaks? Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd dump | grep 'rep size' Well it's still 450 each... The default pg_num value 8 is NOT suitable for big cluster. Thanks I know, I'm not new with Ceph. What's your point here? I already said that pg_num was 450... -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 2:00 PM, Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) wrote: Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd dump | grep 'rep size' The default pg_num value 8 is NOT suitable for big cluster. 2013/3/13 Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com): Replica count has been set to 2. Why? -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 12:45 PM, Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) wrote: FYI I'm using 450 pgs for my pools. Please, can you show the number of object replicas? ceph osd dump | grep 'rep size' Vlad Gorbunov 2013/3/5 Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com): FYI I'm using 450 pgs for my pools. -- Regards, Sébastien Han. On Fri, Mar 1, 2013 at 8:10 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: On Fri, 1 Mar 2013, Wido den Hollander wrote: On 02/23/2013 01:44 AM, Sage Weil wrote: On Fri, 22 Feb 2013, S?bastien Han wrote: Hi all, I finally got a core dump. I did it with a kill -SEGV on the OSD process. https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008 Hope we will get something out of it :-). AHA! We have a theory. The pg log isnt trimmed during scrub (because teh old scrub code required that), but the new (deep) scrub can take a very long time, which means the pg log will eat ram in the meantime.. especially under high iops. Does the number of PGs influence the memory leak? So my theory is that when you have a high number of PGs with a low number of objects per PG you don't
Re: OSD memory leaks?
Yeah. There's not anything intelligent about that cppool mechanism. :) -Greg On Tuesday, March 12, 2013 at 2:15 PM, Dave Spano wrote: I'd rather shut the cloud down and copy the pool to a new one than take any chances of corruption by using an experimental feature. My guess is that there cannot be any i/o to the pool while copying, otherwise you'll lose the changes that are happening during the copy, correct? Dave Spano Optogenics Systems Administrator - Original Message - From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com) To: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) Cc: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Sent: Tuesday, March 12, 2013 4:20:13 PM Subject: Re: OSD memory leaks? On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote: Well to avoid un necessary data movement, there is also an _experimental_ feature to change on fly the number of PGs in a pool. ceph osd pool set poolname pg_num numpgs --allow-experimental-feature Don't do that. We've got a set of 3 patches which fix bugs we know about that aren't in bobtail yet, and I'm sure there's more we aren't aware of… -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Cheers! -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 7:09 PM, Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) wrote: Disregard my previous question. I found my answer in the post below. Absolutely brilliant! I thought I was screwed! http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/8924 Dave Spano Optogenics Systems Administrator - Original Message - From: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) To: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Sent: Tuesday, March 12, 2013 1:41:21 PM Subject: Re: OSD memory leaks? If one were stupid enough to have their pg_num and pgp_num set to 8 on two of their pools, how could you fix that? Dave Spano - Original Message - From: Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com) To: Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), ceph-devel ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com) Sent: Tuesday, March 12, 2013 9:43:44 AM Subject: Re: OSD memory leaks? Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd dump | grep 'rep size' Well it's still 450 each... The default pg_num value 8 is NOT suitable for big cluster. Thanks I know, I'm not new with Ceph. What's your point here? I already said that pg_num was 450... -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 2:00 PM, Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) wrote: Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd dump | grep 'rep size' The default pg_num value 8 is NOT suitable for big cluster. 2013/3/13 Sébastien Han han.sebast...@gmail.com (mailto:han.sebast...@gmail.com): Replica count has been set to 2. Why? -- Regards, Sébastien Han. On Tue, Mar 12, 2013 at 12:45 PM, Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com) wrote: FYI I'm using 450 pgs for my pools
Re: [PATCH 2/2] ceph: use i_release_count to indicate dir's completeness
On Monday, March 11, 2013 at 5:42 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Current ceph code tracks directory's completeness in two places. ceph_readdir() checks i_release_count to decide if it can set the I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE flag. This indirection introduces locking complexity. This patch adds a new variable i_complete_count to ceph_inode_info. Set i_release_count's value to it when marking a directory complete. By comparing the two variables, we know if a directory is complete Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- fs/ceph/caps.c | 4 ++-- fs/ceph/dir.c | 25 + fs/ceph/inode.c | 13 +++-- fs/ceph/mds_client.c | 10 +++--- fs/ceph/super.h | 41 +++-- 5 files changed, 44 insertions(+), 49 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 76634f4..124e8a1 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -490,7 +490,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, ci-i_rdcache_gen++; /* - * if we are newly issued FILE_SHARED, clear I_COMPLETE; we + * if we are newly issued FILE_SHARED, mark dir not complete; we * don't know what happened to this directory while we didn't * have the cap. */ @@ -499,7 +499,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, ci-i_shared_gen++; if (S_ISDIR(ci-vfs_inode.i_mode)) { dout( marking %p NOT complete\n, ci-vfs_inode); - ci-i_ceph_flags = ~CEPH_I_COMPLETE; + __ceph_dir_clear_complete(ci); } } } diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 76821be..11966c4 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p) * falling back to a normal sync readdir if any dentries in the dir * are dropped. * - * I_COMPLETE tells indicates we have all dentries in the dir. It is + * Complete dir indicates that we have all dentries in the dir. It is * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by * the MDS if/when the directory is modified). */ @@ -198,8 +198,8 @@ more: filp-f_pos++; /* make sure a dentry wasn't dropped while we didn't have parent lock */ - if (!ceph_i_test(dir, CEPH_I_COMPLETE)) { - dout( lost I_COMPLETE on %p; falling back to mds\n, dir); + if (!ceph_dir_is_complete(dir)) { + dout( lost dir complete on %p; falling back to mds\n, dir); err = -EAGAIN; goto out; } @@ -258,7 +258,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir) if (filp-f_pos == 0) { /* note dir version at start of readdir so we can tell * if any dentries get dropped */ - fi-dir_release_count = ci-i_release_count; + fi-dir_release_count = atomic_read(ci-i_release_count); dout(readdir off 0 - '.'\n); if (filldir(dirent, ., 1, ceph_make_fpos(0, 0), @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir) if ((filp-f_pos == 2 || fi-dentry) !ceph_test_mount_opt(fsc, NOASYNCREADDIR) ceph_snap(inode) != CEPH_SNAPDIR - (ci-i_ceph_flags CEPH_I_COMPLETE) + __ceph_dir_is_complete(ci) __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) { spin_unlock(ci-i_ceph_lock); err = __dcache_readdir(filp, dirent, filldir); @@ -350,7 +350,8 @@ more: if (!req-r_did_prepopulate) { dout(readdir !did_prepopulate); - fi-dir_release_count--; /* preclude I_COMPLETE */ + /* preclude from marking dir complete */ + fi-dir_release_count--; } /* note next offset and last dentry name */ @@ -428,9 +429,9 @@ more: * the complete dir contents in our cache. */ spin_lock(ci-i_ceph_lock); - if (ci-i_release_count == fi-dir_release_count) { + if (atomic_read(ci-i_release_count) == fi-dir_release_count) { dout( marking %p complete\n, inode); - ci-i_ceph_flags |= CEPH_I_COMPLETE; + __ceph_dir_set_complete(ci, fi-dir_release_count); ci-i_max_offset = filp-f_pos; } spin_unlock(ci-i_ceph_lock); @@ -605,7 +606,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry, fsc-mount_options-snapdir_name, dentry-d_name.len) !is_root_ceph_dentry(dir, dentry) - (ci-i_ceph_flags CEPH_I_COMPLETE) + __ceph_dir_is_complete(ci) (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) { spin_unlock(ci-i_ceph_lock); dout( dir %p complete, -ENOENT\n, dir); @@ -909,7 +910,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, */ /* d_move screws up d_subdirs order */ - ceph_i_clear(new_dir, CEPH_I_COMPLETE); + ceph_dir_clear_complete(new_dir); d_move(old_dentry, new_dentry); @@ -1079,7 +1080,7 @@ static void ceph_d_prune(struct dentry *dentry) if (IS_ROOT(dentry)) return; - /* if we are not hashed, we don't affect I_COMPLETE */ + /* if we are not hashed, we don't affect dir's completeness */ if (d_unhashed(dentry)) return; @@ -1087,7 +1088,7 @@ static
Re: CephFS Space Accounting and Quotas
On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: On 03/08/2013 07:05 PM, Greg Farnum wrote: On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: On 03/07/2013 08:15 AM, Jim Schutt wrote: On 03/06/2013 05:18 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: [snip] Do you want the MDS log at 10 or 20? More is better. ;) OK, thanks. I've sent some mds logs via private email... -- Jim I'm going to need to probe into this a bit more, but on an initial examination I see that most of your stats are actually happening very quickly — it's just that occasionally they take quite a while. Interesting... Going through the MDS log for one of those, the inode in question is flagged with needsrecover from its first appearance in the log — that really shouldn't happen unless a client had write caps on it and the client disappeared. Any ideas? The slowness is being caused by the MDS going out and looking at every object which could be in the file — there are a lot since the file has a listed size of 8GB. For this run, the MDS logging slowed it down enough to cause the client caps to occasionally go stale. I don't think it's the cause of the issue, because I was having it before I turned MDS debugging up. My client caps never go stale at, e.g., debug mds 5. Oh, so this might be behaviorally different than you were seeing before? Drat. You had said before that each newfstatat was taking tens of seconds, whereas in the strace log you sent along most of the individual calls were taking a bit less than 20 milliseconds. Do you have an strace of them individually taking much more than that, or were you just noticing that they took a long time in aggregate? I suppose if you were going to run it again then just the message logging could also be helpful. That way we could at least check and see the message delays and if the MDS is doing other work in the course of answering a request. Otherwise, there were no signs of trouble while writing the files. Can you suggest which kernel client debugging I might enable that would help understand what is happening? Also, I have the full MDS log from writing the files, if that will help. It's big (~10 GiB). (There are several other mysteries here that can probably be traced to different varieties of non-optimal and buggy code as well — there is a client which has write caps on the inode in question despite it needing recovery, but the recovery isn't triggered until the stat event occurs, etc). OK, thanks for taking a look. Let me know if there is other logging I can enable that will be helpful. I'm going to want to spend more time with the log I've got, but I'll think about if there's a different set of data we can gather less disruptively. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Estimating OSD memory requirements (was Re: stuff for v0.56.4)
On Monday, March 11, 2013 at 8:10 AM, Bryan K. Wright wrote: s...@inktank.com said: On Thu, 7 Mar 2013, Bryan K. Wright wrote: s...@inktank.com said: - pg log trimming (probably a conservative subset) to avoid memory bloat Anything that reduces the size of OSD processes would be appreciated. You can probably do this with just log max recent = 1000 By default it's keeping 100k lines of logs in memory, which can eat a lot of ram (but is great when debugging issues). Thanks for the tip about log max recent. I've made this change, but it doesn't seem to significantly reduce the size of the OSD processes. In general, are there some rules of thumb for estimated the memory requirements for OSDs? I see processes blow up to 8gb of resident memory sometimes. If I need to allow for that much memory per OSD process, I may have to just walk away from ceph. Does the memory usage scale with the size of the disks? I've been trying to run 12 OSDs with 12 2TB disks on a single box. Would I be better off (memory-usage-wise) if I RAIDed the disks together and used a single OSD process? Memory use depends on several things, but the most important are how many PGs the daemon is hosting, and whether it's undergoing recovery of some kind. (Absolute disk size is not involved.) If you're getting up to 8GB per, it sounds as if you may have a bit too many PGs. You could try RAIDing some of your drives together instead, yes -- memory CPU utilization is one of the trade offs there, balanced against larger discrete failure units and the loss of space or reliability (depending on the RAID chosen). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote: On 03/11/2013 09:48 AM, Greg Farnum wrote: On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: For this run, the MDS logging slowed it down enough to cause the client caps to occasionally go stale. I don't think it's the cause of the issue, because I was having it before I turned MDS debugging up. My client caps never go stale at, e.g., debug mds 5. Oh, so this might be behaviorally different than you were seeing before? Drat. You had said before that each newfstatat was taking tens of seconds, whereas in the strace log you sent along most of the individual calls were taking a bit less than 20 milliseconds. Do you have an strace of them individually taking much more than that, or were you just noticing that they took a long time in aggregate? When I did the first strace, I didn't turn on timestamps, and I was watching it scroll by. I saw several stats in a row take ~30 secs, at which point I got bored, and took a look at the strace man page to figure out how to get timestamps ;) Also, another difference is for that test, I was looking at files I had written the day before, whereas for the strace log I sent, there was only several minutes between writing and the strace of find. I thought I had eliminated the page cache issue by using fdatasync when writing the files. Perhaps the real issue is affected by that delay? I'm not sure. I can't think of any mechanism by which waiting longer would increase the time lags, though, so I doubt it. I suppose if you were going to run it again then just the message logging could also be helpful. That way we could at least check and see the message delays and if the MDS is doing other work in the course of answering a request. I can do as many trials as needed to isolate the issue. What message debugging level is sufficient on the MDS; 1? Yep, that will capture all incoming and outgoing messages. :) If you want I can attempt to duplicate my memory of the first test I reported, writing the files today and doing the strace tomorrow (with timestamps, this time). Also, would it be helpful to write the files with minimal logging, in hopes of inducing minimal timing changes, then upping the logging for the stat phase? Well that would give us better odds of not introducing failures of any kind during the write phase, and then getting accurate information on what's happening during the stats, so it probably would. Basically I'd like as much logging as possible without changing the states they system goes through. ;) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: On 03/07/2013 08:15 AM, Jim Schutt wrote: On 03/06/2013 05:18 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: [snip] Do you want the MDS log at 10 or 20? More is better. ;) OK, thanks. I've sent some mds logs via private email... -- Jim I'm going to need to probe into this a bit more, but on an initial examination I see that most of your stats are actually happening very quickly — it's just that occasionally they take quite a while. Going through the MDS log for one of those, the inode in question is flagged with needsrecover from its first appearance in the log — that really shouldn't happen unless a client had write caps on it and the client disappeared. Any ideas? The slowness is being caused by the MDS going out and looking at every object which could be in the file — there are a lot since the file has a listed size of 8GB. (There are several other mysteries here that can probably be traced to different varieties of non-optimal and buggy code as well — there is a client which has write caps on the inode in question despite it needing recovery, but the recovery isn't triggered until the stat event occurs, etc). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS running at 100% CPU, no clients
This isn't bringing up anything in my brain, but I don't know what that _sample() function is actually doing — did you get any farther into it? -Greg On Wednesday, March 6, 2013 at 6:23 PM, Noah Watkins wrote: Which, looks to be in a tight loop in the memory model _sample… (gdb) bt #0 0x7f0270d84d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7f027046dd88 in std::__basic_filechar::xsgetn(char*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #2 0x7f027046f4c5 in std::basic_filebufchar, std::char_traitschar ::underflow() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f0270467ceb in std::basic_istreamchar, std::char_traitschar std::getlinechar, std::char_traitschar, std::allocatorchar (std::basic_istreamchar, std::char_traitschar , std::basic_stringchar, std::char_traitschar, std::allocatorchar , char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x0072bdd4 in MemoryModel::_sample(MemoryModel::snap*) () #5 0x005658db in MDCache::check_memory_usage() () #6 0x004ba929 in MDS::tick() () #7 0x00794c65 in SafeTimer::timer_thread() () #8 0x007958ad in SafeTimerThread::entry() () #9 0x7f0270d7de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 On Mar 6, 2013, at 6:18 PM, Noah Watkins jayh...@cs.ucsc.edu (mailto:jayh...@cs.ucsc.edu) wrote: On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu (mailto:jayh...@cs.ucsc.edu) wrote: The MDS process in my cluster is running at 100% CPU. In fact I thought the cluster came down, but rather an ls was taking a minute. There aren't any clients active. I've left the process running in case there is any probing you'd like to do on it: virt res cpu 4629m 88m 5260 S 92 1.1 113:32.79 ceph-mds Thanks, Noah This is a ceph-mds child thread under strace. The only thread that appears to be doing anything. root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372 Process 3372 attached - interrupt to quit read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050 read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050 read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050 read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020 read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020 read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020 read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020 ... That file looks to be: ceph-mds 3337 root 1649r REG 0,3 0 266903 /proc/3337/maps (3337 is the parent process). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: changes to rados command
On Thursday, March 7, 2013 at 11:25 AM, Andrew Hume wrote: in order to make the rados command more useful in scripts, i'd like to make a change, specifically change to rados -p pool getomapval obj key [fmt] where fmt is an optional formatting parameter. i've implemented 'str' which will print the value as an unadorned string. what is the process for doing this? Patch submission, you mean? Github pull requests, sending a pull request with git URL to the list, or sending straight patches the list are all good. I'll like you more if you give me a URL of some form instead of making me get the patches out of email and into my git repo correctly, though. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: changes to rados command
(Re-added the list for future reference) Well, you'll need to learn how to use git at a basic level in order to be able to work effectively on Ceph (or most other open-source projects). Some links that might be helpful: http://www.joelonsoftware.com/items/2010/03/17.html http://www.ibm.com/developerworks/library/l-git-subversion-1/ http://try.github.com/ I haven't been through these all thoroughly, but the first one should describe the mental model changes that motivate git, the second looks to be a deep tutorial, and the third will teach you the mechanics. :) Github pull requests are a Github nicety; their website will teach you how to use them. A simple git URL just requires that your git repository be accessible over the internet, and then you tell us what the URL is and what branch to pull from, and we can do so. (This of course requires that your changes actually be in a branch, so you'll need to have the commits arranged nicely and such.) -Greg On Thursday, March 7, 2013 at 12:42 PM, Andrew Hume wrote: i don't know how to do teh first two (but i am a quickish learner). i know how to type git diff | mail already. if you can guide me a little on how to do teh git things, i'll do those. On Mar 7, 2013, at 1:37 PM, Greg Farnum wrote: On Thursday, March 7, 2013 at 11:25 AM, Andrew Hume wrote: in order to make the rados command more useful in scripts, i'd like to make a change, specifically change to rados -p pool getomapval obj key [fmt] where fmt is an optional formatting parameter. i've implemented 'str' which will print the value as an unadorned string. what is the process for doing this? Patch submission, you mean? Github pull requests, sending a pull request with git URL to the list, or sending straight patches the list are all good. I'll like you more if you give me a URL of some form instead of making me get the patches out of email and into my git repo correctly, though. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html --- Andrew Hume 623-551-2845 (VO and best) 973-236-2014 (NJ) and...@research.att.com (mailto:and...@research.att.com) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] ceph: increase i_release_count when clear I_COMPLETE flag
I'm pulling this in for now to make sure this clears out that ENOENT bug we hit — but shouldn't we be fixing ceph_i_clear() to always bump the i_release_count? It doesn't seem like it would ever be correct without it, and these are the only two callers. The second one looks good to us and we'll test it but of course that can't go upstream through our tree. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thursday, March 7, 2013 at 3:36 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) If some dentries were pruned or FILE_SHARED cap was revoked while readdir is in progress. make sure ceph_readdir() does not mark the directory as complete. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- fs/ceph/caps.c | 1 + fs/ceph/dir.c | 13 +++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 76634f4..35cebf3 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -500,6 +500,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, if (S_ISDIR(ci-vfs_inode.i_mode)) { dout( marking %p NOT complete\n, ci-vfs_inode); ci-i_ceph_flags = ~CEPH_I_COMPLETE; + ci-i_release_count++; } } } diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 76821be..068304c 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -909,7 +909,11 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, */ /* d_move screws up d_subdirs order */ - ceph_i_clear(new_dir, CEPH_I_COMPLETE); + struct ceph_inode_info *ci = ceph_inode(new_dir); + spin_lock(ci-i_ceph_lock); + ci-i_ceph_flags = ~CEPH_I_COMPLETE; + ci-i_release_count++; + spin_unlock(ci-i_ceph_lock); d_move(old_dentry, new_dentry); @@ -1073,6 +1077,7 @@ static int ceph_snapdir_d_revalidate(struct dentry *dentry, */ static void ceph_d_prune(struct dentry *dentry) { + struct ceph_inode_info *ci; dout(ceph_d_prune %p\n, dentry); /* do we have a valid parent? */ @@ -1087,7 +1092,11 @@ static void ceph_d_prune(struct dentry *dentry) * we hold d_lock, so d_parent is stable, and d_fsdata is never * cleared until d_release */ - ceph_i_clear(dentry-d_parent-d_inode, CEPH_I_COMPLETE); + ci = ceph_inode(dentry-d_parent-d_inode); + spin_lock(ci-i_ceph_lock); + ci-i_ceph_flags = ~CEPH_I_COMPLETE; + ci-i_release_count++; + spin_unlock(ci-i_ceph_lock); } /* -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CephFS Space Accounting and Quotas (was: CephFS First product release discussion)
On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote: On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all the time, so creating separate filesystems for them would be very time consuming. Now, I'm not talking about enforcing soft or hard quotas, I'm just talking about knowing how much space uid X and Y consume on the filesystem. The part I'm most unclear on is what use cases people have where uid X and Y are spread around the file system (not in a single or small set of sub directories) and per-user (not, say, per-project) quotas are still necessary. In most environments, users get their own home directory and everything lives there... Hmmm, is there a tool I should be using that will return the space used by a directory, and all its descendants? If it's 'du', that tool is definitely not fast for me. I'm doing an 'strace du -s path', where path has one subdirectory which contains ~600 files. I've got ~200 clients mounting the file system, and each client wrote 3 files in that directory. I'm doing the 'du' from one of those nodes, and the strace is showing me du is doing a 'newfstat' for each file. For each file that was written on a different client from where du is running, that 'newfstat' takes tens of seconds to return. Which means my 'du' has been running for quite some time and hasn't finished yet I'm hoping there's another tool I'm supposed to be using that I don't know about yet. Our use case includes tens of millions of files written from thousands of clients, and whatever tool we use to do space accounting needs to not walk an entire directory tree, checking each file. Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :) Unfortunately we can't (currently) use those recursive statistics to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 11:58 AM, Jim Schutt wrote: On 03/06/2013 12:13 PM, Greg Farnum wrote: Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :) That is just exceptionally cool! Unfortunately we can't (currently) use those recursive statistics to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag. 'ls -lh dir' seems to be just the thing if you already know dir. And it's perfectly suitable for our use case of not scheduling new jobs for users consuming too much space. I was thinking I might need to find a subtree where all the subdirectories are owned by the same user, on the theory that all the files in such a subtree would be owned by that same user. E.g., we might want such a capability to manage space per user in shared project directories. So, I tried 'find dir -type d -exec ls -lhd {} \;' Unfortunately, that ended up doing a 'newfstatat' on each file under dir, evidently to learn if it was a directory. The result was that same slowdown for files written on other clients. Is there some other way I should be looking for directories if I don't already know what they are? Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote: On 03/06/2013 01:21 PM, Greg Farnum wrote: Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :) That all makes sense. But, it turns out the files in question were written yesterday, and I did the stat operations today. So, shouldn't the dirty buffer issue not be in play here? Probably not. :/ Is there anything else that might be going on? In that case it sounds like either there's a slowdown on disk access that is propagating up the chain very bizarrely, there's a serious performance issue on the MDS (ie, swapping for everything), or the clients are still holding onto capabilities for the files in question and you're running into some issues with the capability revocation mechanisms. Can you describe your setup a bit more? What versions are you running, kernel or userspace clients, etc. What config options are you setting on the MDS? Assuming you're on something semi-recent, getting a perfcounter dump from the MDS might be illuminating as well. We'll probably want to get a high-debug log of the MDS during these slow stats as well. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: When I'm doing these stat operations the file system is otherwise idle. What's the cluster look like? This is just one active MDS and a couple hundred clients? What is happening is that once one of these slow stat operations on a file completes, it never happens again for that file, from any client. At least, that's the case if I'm not writing to the file any more. I haven't checked if appending to the files restarts the behavior. I assume it'll come back, but if you could verify that'd be good. On the client side I'm running with 3.8.2 + the ceph patch queue that was merged into 3.9-rc1. On the server side I'm running recent next branch (commit 0f42eddef5), with the tcp receive socket buffer option patches cherry-picked. I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num rather than pg_bits to set initial number of PGs (same for pgp_num), and a patch that lets me run with just one pool that contains both data and metadata. I'm testing data distribution uniformity with 512K PGs. My MDS tunables are all at default settings. We'll probably want to get a high-debug log of the MDS during these slow stats as well. OK. Do you want me to try to reproduce with a more standard setup? No, this is fine. Also, I see Sage just pushed a patch to pgid decoding - I expect I need that as well, if I'm running the latest client code. Yeah, if you've got the commit it references you'll want it. Do you want the MDS log at 10 or 20? More is better. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CephFS First product release discussion
This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that! The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :) The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching). Thanks, -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS First product release discussion
On Tuesday, March 5, 2013 at 10:08 AM, Wido den Hollander wrote: On 03/05/2013 06:03 PM, Greg Farnum wrote: This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that! The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :) The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching). Great news! Although RBD and RADOS itself are already great, a lot of applications would still require a shared filesystem. Think about a (Cloud|Open)Stack environment with thousands of instances running but also need some form of shared filesystem. One thing I'm missing though is user-quotas, have they been discussed at all and what would the work to implement those involve? I know it would require a lot more tracking per file so it's not that easy and would certainly not make it into a first release, but are they on the roadmap at all? Not at present. I think there are some tickets related to this in the tracker as feature requests, but CephFS needs more groundwork about multi-tenancy in general before we can do reasonable planning around a robust user quota feature. (Near-real-time hacks are possible now based around the rstats infrastructure and I believe somebody has built them, though I've never seen them myself.) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: clean up skipped message logic
On Tuesday, March 5, 2013 at 7:33 AM, Alex Elder wrote: (This patch is available as the top commit in branch review/wip-4324 in the ceph-client git repository.) In ceph_con_in_msg_alloc() it is possible for a connection's alloc_msg method to indicate an incoming message should be skipped. By default, read_partial_message() initializes the skip variable to 0 before it gets provided to ceph_con_in_msg_alloc(). The osd client, mon client, and mds client each supply an alloc_msg method. The mds client always assigns skip to be 0. The other two leave the skip value of as-is, or assigns it to zero, except: - if no (osd or mon) request having the given tid is found, in which case skip is set to 1 and NULL is returned; or - in the osd client, if the data of the reply message is not adequate to hold the message to be read, it assigns skip value 1 and returns NULL. So the returned message pointer will always be NULL if skip is ever non-zero. Clean up the logic a bit in ceph_con_in_msg_alloc() to make this state of affairs more obvious. Add a comment explaining how a null message pointer can mean either a message that should be skipped or a problem allocating a message. This resolves: http://tracker.ceph.com/issues/4324 Reported-by: Greg Farnum g...@inktank.com (mailto:g...@inktank.com) Signed-off-by: Alex Elder el...@inktank.com (mailto:el...@inktank.com) --- net/ceph/messenger.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 5bf1bb5..644cb6c 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2860,18 +2860,21 @@ static int ceph_con_in_msg_alloc(struct ceph_connection *con, int *skip) ceph_msg_put(msg); return -EAGAIN; } - con-in_msg = msg; - if (con-in_msg) { + if (msg) { + BUG_ON(*skip); + con-in_msg = msg; con-in_msg-con = con-ops-get(con); BUG_ON(con-in_msg-con == NULL); - } - if (*skip) { - con-in_msg = NULL; - return 0; - } - if (!con-in_msg) { - con-error_msg = - error allocating memory for incoming message; + } else { + /* + * Null message pointer means either we should skip + * this message or we couldn't allocate memory. The + * former is not an error. + */ + if (*skip) + return 0; + con-error_msg = error allocating memory for incoming message; + return -ENOMEM; } memcpy(con-in_msg-hdr, con-in_hdr, sizeof(con-in_hdr)); -- 1.7.9.5 Reviewed-by: Greg Farnum g...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk?
On Tuesday, March 5, 2013 at 5:54 AM, Wido den Hollander wrote: On 03/05/2013 05:33 AM, Xing Lin wrote: Hi Gregory, Thanks for your reply. On 03/04/2013 09:55 AM, Gregory Farnum wrote: The journal [min|max] sync interval values specify how frequently the OSD's FileStore sends a sync to the disk. However, data is still written into the normal filesystem as it comes in, and the normal filesystem continues to schedule normal dirty data writeouts. This is good — it means that when we do send a sync down you don't need to wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to disk before it's completed. I do not think I understand this well. When the writeahead journal mode is in use, would you please explain what happens to a single 4M write request? I assume that an entry in the journal will be created for this write request and after this entry is flushed to the journal disk, Ceph returns successful. There should be no IO to the osd's disk. All IOs are supposed to go to the journal disk. At a later time, Ceph will start to apply these changes to the normal filesystem by reading from the first entry at which its previous synchronization stops. Finally, it will read this entry and apply this write change to the normal file system. Could you please point out where is wrong in my understanding? Thanks, All the data goes to the disk in write-back mode so it isn't safe yet until the flush is called. That's why it goes into the journal first, to be consistent at all times. If you would buffer everything in the journal and flush that at once you would overload the disk for that time. Let's say you have 300MB in the journal after 10 seconds and you want to flush that at once. That would mean that specific disk is unable to do any other operations then writing with 60MB/sec for 5 seconds. It's better to always write in write-back mode to the disk and flush at a certain point. In the meantime the scheduler can do it's job to balance between the reads and the writes. Wido Yep, what Wido said. Specifically, we do force the data to the journal with an fsync or equivalent before responding to the client, but once it's stable on the journal we give it to the filesystem (without doing any sort of forced sync). This is necessary — all reads are served from the filesystem. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When ceph synchronizes journal to disk? / read request
On Tuesday, March 5, 2013 at 12:37 AM, Dieter Kasper wrote: Hi Gregory, another interesting aspect for me is: How will a read-request for this block/sub-block (pending between journal and OSD) be satisfied (assuming the client will not cache) ? Will this read go to the journal or to the OSD ? All read requests are satisfied from the main OSD store filesystem. Satisfying reads from the journal would be extraordinarily complicated and not buy us anything that I can think of. (In fact the journal is only read during recovery). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] ceph: don't early drop Fw cap
On Monday, March 4, 2013 at 5:57 PM, Yan, Zheng wrote: On 03/05/2013 02:26 AM, Gregory Farnum wrote: On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR cap dirty before data is copied to page cache and inode size is updated. The optimization avoids slow cap revocation caused by balance_dirty_pages(), but introduces inode size update race. If ceph_check_caps() flushes the dirty cap before the inode size is updated, MDS can miss the new inode size. So just remove the optimization. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- fs/ceph/file.c | 42 +- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index a949805..28ef273 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov, if (ceph_snap(inode) != CEPH_NOSNAP) return -EROFS; + sb_start_write(inode-i_sb); retry_snap: - if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL)) - return -ENOSPC; + if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL)) { + ret = -ENOSPC; + goto out; + } __ceph_do_pending_vmtruncate(inode); dout(aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n, inode, ceph_vinop(inode), pos, (unsigned)iov-iov_len, @@ -750,29 +753,10 @@ retry_snap: ret = ceph_sync_write(file, iov-iov_base, iov-iov_len, iocb-ki_pos); } else { - /* - * buffered write; drop Fw early to avoid slow - * revocation if we get stuck on balance_dirty_pages - */ - int dirty; - - spin_lock(ci-i_ceph_lock); - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR); - spin_unlock(ci-i_ceph_lock); - ceph_put_cap_refs(ci, got); - - ret = generic_file_aio_write(iocb, iov, nr_segs, pos); - if ((ret = 0 || ret == -EIOCBQUEUED) - ((file-f_flags O_SYNC) || IS_SYNC(file-f_mapping-host) - || ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_NEARFULL))) { - err = vfs_fsync_range(file, pos, pos + ret - 1, 1); - if (err 0) - ret = err; - } - - if (dirty) - __mark_inode_dirty(inode, dirty); - goto out; + mutex_lock(inode-i_mutex); + ret = __generic_file_aio_write(iocb, iov, nr_segs, + iocb-ki_pos); + mutex_unlock(inode-i_mutex); Hmm, you're here passing in a different value than the removed generic_file_aio_write() call did — iocb-ki_pos instead of pos. Everything else is using the pos parameter so I rather expect that should still be used here? They always have the same value, see the BUG_ON in generic_file_aio_write() Also a quick skim of the interfaces makes me think that the two versions aren't interchangeable — __generic_file_aio_write() also handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them? ceph has its own code that handles O_SYNC case. I want to make sb_start_write() covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here. Regards Yan, Zheng Ah, yep — sounds good! -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] ceph: misc fixes
I've merged this series into the testing branch, with appropriate Reviewed-by tags from me (and Sage on #4). Thanks much for the code and helping me go through it. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Monday, March 4, 2013 at 7:38 PM, Yan, Zheng wrote: On 03/05/2013 02:49 AM, Gregory Farnum wrote: On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com These patches are also in: git://github.com/ukernel/linux.git (http://github.com/ukernel/linux.git) wip-ceph 1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's concerns on 4 and my questions on 6 I'll be happy to pull these in. :) I rechecked locking assumption in patch 4, nothing goes wrong. Regards Yan, Zheng -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] ceph: only set message data pointers if non-empty
On Tuesday, March 5, 2013 at 5:53 AM, Alex Elder wrote: The ceph file system doesn't typically send information in the data portion of a message. (It relies on some functionality exported by the osd client to read and write page data.) There are two spots it does send data though. The value assigned to an extended attribute is held in one or more pages allocated by ceph_sync_setxattr(). Eventually those pages are assigned to a request message in create_request_message(). The second spot is when sending a reconnect message, where a ceph pagelist is used to build up an array of snaprealm_reconnect structures to send to the mds. Change it so we only assign the outgoing data information for these messages if there is outgoing data to send. This is related to: http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder el...@inktank.com --- fs/ceph/mds_client.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 42400ce..ae83aa9 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1718,7 +1718,12 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc, msg-front.iov_len = p - msg-front.iov_base; msg-hdr.front_len = cpu_to_le32(msg-front.iov_len); - ceph_msg_data_set_pages(msg, req-r_pages, req-r_num_pages, 0); + if (req-r_num_pages) { + /* outbound data set only by ceph_sync_setxattr() */ + BUG_ON(!req-r_pages); + ceph_msg_data_set_pages(msg, req-r_pages, + req-r_num_pages, 0); + } msg-hdr.data_len = cpu_to_le32(req-r_data_len); msg-hdr.data_off = cpu_to_le16(0); @@ -2599,10 +2604,13 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, goto fail; } - ceph_msg_data_set_pagelist(reply, pagelist); if (recon_state.flock) reply-hdr.version = cpu_to_le16(2); - reply-hdr.data_len = cpu_to_le32(pagelist-length); + if (pagelist-length) { + /* set up outbound data if we have any */ + reply-hdr.data_len = cpu_to_le32(pagelist-length); + ceph_msg_data_set_pagelist(reply, pagelist); + } ceph_con_send(session-s_con, reply); mutex_unlock(session-s_mutex); Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] libceph: activate message data assignment checks
On Tuesday, March 5, 2013 at 5:53 AM, Alex Elder wrote: The mds client no longer tries to assign zero-length message data, and the osd client no longer sets its data info more than once. This allows us to activate assertions in the messenger to verify these things never happen. This resolves both of these: http://tracker.ceph.com/issues/4263 http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder el...@inktank.com (mailto:el...@inktank.com) --- net/ceph/messenger.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 97506ac..5bf1bb5 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2677,10 +2677,10 @@ EXPORT_SYMBOL(ceph_con_keepalive); void ceph_msg_data_set_pages(struct ceph_msg *msg, struct page **pages, unsigned int page_count, size_t alignment) { - /* BUG_ON(!pages); */ - /* BUG_ON(!page_count); */ - /* BUG_ON(msg-pages); */ - /* BUG_ON(msg-page_count); */ + BUG_ON(!pages); + BUG_ON(!page_count); + BUG_ON(msg-pages); + BUG_ON(msg-page_count); msg-pages = pages; msg-page_count = page_count; @@ -2691,8 +2691,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_pages); void ceph_msg_data_set_pagelist(struct ceph_msg *msg, struct ceph_pagelist *pagelist) { - /* BUG_ON(!pagelist); */ - /* BUG_ON(msg-pagelist); */ + BUG_ON(!pagelist); + BUG_ON(msg-pagelist); msg-pagelist = pagelist; } @@ -2700,8 +2700,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_pagelist); void ceph_msg_data_set_bio(struct ceph_msg *msg, struct bio *bio) { - /* BUG_ON(!bio); */ - /* BUG_ON(msg-bio); */ + BUG_ON(!bio); + BUG_ON(msg-bio); msg-bio = bio; } @@ -2709,8 +2709,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_bio); void ceph_msg_data_set_trail(struct ceph_msg *msg, struct ceph_pagelist *trail) { - /* BUG_ON(!trail); */ - /* BUG_ON(msg-trail); */ + BUG_ON(!trail); + BUG_ON(msg-trail); msg-trail = trail; } -- 1.7.9.5 Reviewed-by: Greg Farnum g...@inktank.com I'll leave #4 for Josh to review. :) Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd locking and handling broken clients
On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote: Greg, My understanding of Ceph code internals is far too limited to comment on your specific points, but allow me to ask a naive question. Couldn't you be stealing a lot of ideas from SCSI-3 Persistent Reservations? If you had server-side (OSD) persistence of information of the this device is in use by X type (where anything other than X would get an I/O error when attempting to access data), and you had a manual, authenticated override akin to SCSI PR preemption, plus key registration/exchange for that authentication, then you would at least have to have the combination of a misbehaving OSD plus a malicious client for data corruption. A non-malicious but just broken client probably won't do. Clearly I may be totally misguided, as Ceph is fundamentally decentralized and SCSI isn't, but if PR-ish behavior comes even close to what you're looking for, grabbing those ideas would look better to me than designing your own wheel. Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally decentralized. :) I'm not familiar with the SCSI PR mechanism either, but it looks to me like it deals in entirely local information — the equivalent with RBD would require performing a locking operation on every object in the RBD image before you accessed it. We could do that, but then opening an image would take time linear in its size… :( On Wednesday, June 13, 2012 at 4:14 PM, Tommi Virtanen wrote: On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: 2) Client fencing. See http://tracker.newdream.net/issues/2531. There is an existing blacklist functionality in the OSDs/OSDMap, where you can specify an entity_addr_t (consisting of an IP, a port, and a nonce — so essentially unique per-process) which is not allowed to communicate with the cluster any longer. The problem with this is that Does that work even after a TCP connection close re-establish, where the client now has a new source port address? (Perhaps the port is 0 for clients?) Precisely — client ports are 0 since they never accept incoming connections. You know, I'd be really happy if this could be achieved by means of removing cephx keys. Unfortunately, that wouldn't really solve the problem without dramatically decreasing the rotation interval for cluster access keys which cephx shares. Alternative (entirely theoretical) security schemes might, but they're well behind what's feasible for us to work on any time soon... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mds dump
On Friday, June 8, 2012 at 12:31 PM, Tommi Virtanen wrote: On Fri, Jun 8, 2012 at 12:22 PM, Martin Wilderoth martin.wilder...@linserv.se (mailto:martin.wilder...@linserv.se) wrote: I have removed the data and metadata pool. Do I need to create them again, or will they be created automaticly. Maybe I need the undocumented way of creating the mds map ?. I would like to get an empty cephfs to play with again. Just create them again with rados mkpool, that gets you back to square one. Actually, this doesn't — the MDS uses pool IDs for its access, not pool names, so you need to do a bit more (Sage illustrated the simplest route for handling that). -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount: 10.0.6.10:/: can't read superblock
For future reference, that error was because the active MDS server was in replay. I can't tell why it didn't move on to active from what you posted, but I imagine it just got a little stuck since restarting made it work out. -Greg On Tuesday, June 5, 2012 at 1:05 PM, Martin Wilderoth wrote: Hello Again, I restarted the mds on all servers and then it worked again /Regards Martin Hello Hi Martin, On 06/05/2012 08:07 PM, Martin Wilderoth wrote: Hello Is there a way to recover this error. mount -t ceph 10.0.6.10:/ /mnt -vv -o name=admin,secret=XXX [ 506.640433] libceph: loaded (mon/osd proto 15/24, osdmap 5/6 5/6) [ 506.650594] ceph: loaded (mds proto 32) [ 506.652353] libceph: client0 fsid a9d5f9e1-4bb9-4fab-b79b-ba4457631b01 [ 506.670876] Intel AES-NI instructions are not detected. [ 506.678861] libceph: mon0 10.0.6.10:6789 session established mount: 10.0.6.10:/: can't read superblock Could you share some more information? For example the output from: ceph -s 2012-06-05 20:25:05.307914 pg v1189604: 1152 pgs: 1152 active+clean; 191 GB data, 393 GB used, 973 GB / 1379 GB avail 012-06-05 20:25:05.315871 mds e60: 1/1/1 up {0=c=up:replay}, 2 up:standby 2012-06-05 20:25:05.315965 osd e1106: 8 osds: 8 up, 8 in 2012-06-05 20:25:05.316165 log 2012-06-05 20:24:50.425527 mon.0 10.0.6.10:6789/0 75 : [INF] mds.? 10.0.6.11:6800/22974 up:boot 2012-06-05 20:25:05.316371 mon e1: 3 mons at {a=10.0.6.10:6789/0,b=10.0.6.11:6789/0,c=10.0.6.12:6789/0} Did you change anything to the cluster since it worked? And what version are you running? I have not done any changes installed at version 0.46 upgraded earlier and have been testing with ceph and ceph-fuse and backuppc. It was during the ceph-fuse it hanged. Current version ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) One of my mds logs has 24G of data. Is it still running? I have restarted mds.a and mds.b they seems to be running. But not everything. mds.a was stoped not sure mds.b but it has a big logfile. I have some rbd devices that I would like to keep. RBD doesn't use the MDS nor the POSIX filesystem, so you will probably be fine, but we need the output of ceph -s first. Does this work? $ rbd ls this works I'm still using the rbd with no problem $ rados -p rbd ls seems to work reports something simmilar to rb.0.2.052e rb.0.0.02f2 rb.0.7.0345 rb.0.7.0896 rb.0.0.0102 rb.0.9.0172 rb.0.1.0350 rb.0.4.0180 rb.0.4.068b rb.0.5.054c rb.0.2.01e1 Wido /Regards Martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is. On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote: Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. The leveldb crash is almost certainly a result of memory corruption. Thanks -Sam On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com (mailto:t...@inktank.com) wrote: On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether they are because of the same root cause, I can't tell. Quick triage to benefit the other devs: #1: kernel crash, no details available 1 of the physical machine was in kernel oops state - Nothing was remote #2: leveldb corruption? may be memory corruption that started elsewhere.. Sam, does this look like the leveldb issue you saw? [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal (Aborted) ** ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6ef69d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice const)+0x9f) [0x6fdd9f] #3: PG::merge_log assertion while recovering from the above; Sam, any ideas? 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157 osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head = olog.tail olog.head = log.tail) #4: unknown btrfs warnings, there should an actual message above this traceback; believed fixed in latest kernel Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] [8105a9f0] ? add_wait_queue+0x60/0x60 Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] [a026abb1] ? do_async_commit+0x11/0x20 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crash, wont startup again
On Thursday, May 24, 2012 at 5:29 AM, Felix Feinhals wrote: Hi, i was using the Debian Packages, but i tried now from source. I used the same version from GIT (cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash report. Then i applied your patch but again the same crash, i think the backtrace is also the same: (gdb) thread 1 [Switching to thread 1 (Thread 9564)]#0 0x7f33a3e58ebb in raise (sig=value optimized out) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41 41 in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c (gdb) backtrace #0 0x7f33a3e58ebb in raise (sig=value optimized out) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41 #1 0x0081423e in reraise_fatal (signum=11) at global/signal_handler.cc:58 (http://signal_handler.cc:58) #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 (http://signal_handler.cc:104) #3 signal handler called #4 SnapRealm::have_past_parents_open (this=0x0, first=..., last=...) at mds/snap.cc:112 (http://snap.cc:112) #5 0x0055d58b in MDCache::check_realm_past_parents (this=0x27a7200, realm=0x0) at mds/MDCache.cc:4495 (http://MDCache.cc:4495) #6 0x00572eec in MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200) at mds/MDCache.cc:4533 (http://MDCache.cc:4533) #7 0x005931a0 in MDCache::rejoin_gather_finish (this=0x27a7200) at mds/MDCache.cc: (http://MDCache.cc:) #8 0x0059b9d5 in MDCache::rejoin_send_rejoins (this=0x27a7200) at mds/MDCache.cc:3388 (http://MDCache.cc:3388) #9 0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at mds/MDS.cc:1404 (http://MDS.cc:1404) #10 0x004c253a in MDS::handle_mds_map (this=0x27bc000, m=value optimized out) at mds/MDS.cc:968 (http://MDS.cc:968) #11 0x004c4513 in MDS::handle_core_message (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1651 (http://MDS.cc:1651) #12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1790 (http://MDS.cc:1790) #13 0x004c628b in MDS::ms_dispatch (this=0x27bc000, m=0x27ab800) at mds/MDS.cc:1602 (http://MDS.cc:1602) #14 0x00732609 in Messenger::ms_deliver_dispatch (this=0x279f680) at msg/Messenger.h:178 #15 SimpleMessenger::dispatch_entry (this=0x279f680) at msg/SimpleMessenger.cc:363 (http://SimpleMessenger.cc:363) #16 0x007207ad in SimpleMessenger::DispatchThread::entry() () #17 0x7f33a3e508ca in start_thread (arg=value optimized out) at pthread_create.c:300 #18 0x7f33a26d892d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #19 0x in ?? () Any more ideas? :) Or can i get you more debugging output? Sorry for the delay — I'm afraid that's a hazard of using the MDS before we're ready to support it. :( Anyway, I haven't had a lot of time to look into this, but that makes it look like there's an actual problem, where one of the inodes can't find the SnapRealm which it lives in. Things that will make this easier to diagnose (in the event that somebody gets the time) include generating high-level debug logs and placing them somewhere accessible (start up the MDS with debug mds = 20 added to the config file); if you want you could also try the below patch (which will cause the MDS to dump its full inode cache upon triggering this bug) and we can see if there's anything really obvious. (This is a fine thing to make bug reports on at tracker.newdream.net, btw — and that allows attachments of things like log files.) -Greg diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 143faca..6aa5923 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4527,6 +4527,11 @@ void MDCache::choose_lock_states_and_reconnect_caps() dout(15) chose lock states on *in dendl; SnapRealm *realm = in-find_snaprealm(); + if (!realm) { + dout(0) serious error, could not find snaprealm for in *in + , triggering cache dump dendl; + dump_cache(); + } check_realm_past_parents(realm); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On Thursday, May 31, 2012 at 4:58 PM, Noah Watkins wrote: On May 31, 2012, at 3:39 PM, Greg Farnum wrote: Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? I _think_ this is because the JVM uses its own threading library, and Ceph assumes pthreads and pthread compatible mutexes--is that assumption about Ceph correct? Hence the error that looks like Mutex::lock(bool) being reference for context during the segfault. To verify this all that is needed is some synchronization added to the Java. I'm not quite sure what you mean here. Ceph is definitely using pthread threading and mutexes, but I don't see how the use of a different threading library can break pthread mutexes (which are just using the kernel futex stuff, AFAIK). But I admit I'm not real good at handling those sorts of interactions, so maybe I'm missing something? There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. Nam: where are you grabbing the Java patches from? I'll push some updates. The only other scenario that comes to mind is related to signaling: The RADOS Java wrappers suffered from an interaction between the JVM and RADOS client signal handlers, in which either the JVM or RADOS would replace the handlers for the other (not sure which order). Anyway, the solution was to link in the JVM libjsig.so signal chaining library. This might be the same thing we are seeing here, but I'm betting it is the first theory I mentioned. Hmm. I think that's an issue we've run into but I thought it got fixed for librados. Perhaps I'm mixing that up with libceph, or just pulling past scenarios out of thin air. It never manifested as Mutex count bugs, though! -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On Monday, June 4, 2012 at 1:47 PM, Noah Watkins wrote: On Mon, Jun 4, 2012 at 1:17 PM, Greg Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: I'm not quite sure what you mean here. Ceph is definitely using pthread threading and mutexes, but I don't see how the use of a different threading library can break pthread mutexes (which are just using the kernel futex stuff, AFAIK). But I admit I'm not real good at handling those sorts of interactions, so maybe I'm missing something? The basic idea was that threads in Java did not map 1:1 with kernel threads (think co-routines), which would break a lot of stuff, especially futex. Looking at some documentation, old JVMs had something called Green Threads, but have now been abandoned in favor of native threads. So maybe this theory is now irrelevant, and evidence seems to suggest you're right and Java is using native threads. Gotcha, that makes sense. The RADOS Java wrappers suffered from an interaction between the JVM and RADOS client signal handlers, in which either the JVM or RADOS would replace the handlers for the other (not sure which order). Anyway, the solution was to link in the JVM libjsig.so signal chaining library. This might be the same thing we are seeing here, but I'm betting it is the first theory I mentioned. Hmm. I think that's an issue we've run into but I thought it got fixed for librados. Perhaps I'm mixing that up with libceph, or just pulling past scenarios out of thin air. It never manifested as Mutex count bugs, though! I haven't tested the Rados wrappers in a while. I've never had to link in the signal chaining library for libcephfs. I wonder if the Mutex::lock(bool) being printed out is a red herring... Well, it's a SIGSEGV. So my guess is that's the frame that happens to be going outside its allowed bounds, probably because it's the first frame actually accessing the memory off of a bad (probably NULL) pointer. For instance, if it not only failed to mount the client, but even to create the context object? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iozone test crashed on ceph
On Thursday, May 31, 2012 at 5:58 PM, udit agarwal wrote: Hi, I have set up ceph system with a client, mon and mds on one system which is connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But when I ran iozone test with a 5G file, the process got killed and our ceph system hanged. Can anyone please help me with this. What do you mean, the process got killed? It hung and some task watcher killed it? Or it got OOMed? How did you determine that the ceph system hung? The cluster stopped responding to requests, or just the local mount point? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On Thursday, May 31, 2012 at 7:43 AM, Noah Watkins wrote: On May 31, 2012, at 6:20 AM, Nam Dang wrote: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? Nam: it definitely shouldn't be segfaulting just because a monitor went down. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple named clusters on same nodes
On Thursday, May 24, 2012 at 1:58 AM, Amon Ott wrote: On Thursday 24 May 2012 wrote Amon Ott: Attached is a patch based on current git stable that makes mkcephfs work fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon data (default ceph instead of supplied cluster name), so I put in a workaround. Please have a look and consider inclusion as well as fixing mon data path. Thanks. And another patch for the init script to handle multiple clusters. Amon: Thanks for the patches! Unfortunately nobody who's competent to review these (ie, not me) has time to look into them right now, but they're on the queue when TV or Sage gets some time. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Error 5 when trying to mount Ceph 0.47.1
On Thursday, May 24, 2012 at 10:58 PM, Nam Dang wrote: Hi, I've just started working with Ceph for a couple of weeks. At the moment, I'm trying to setup a small cluster with 1 monitor, 1 MDS and 6 OSDs. However, I cannot mount ceph to the system no matter which node I'm executing the mounting command on. My nodes run Ubuntu 11.10 with kernal 3.0.0-12 Seeing some other people also faced similar problems, I attached the result of running ceph -s as followed: 2012-05-25 23:52:17.802590 pg v434: 1152 pgs: 189 active+clean, 963 stale+active+clean; 8730 bytes data, 3667 MB used, 844 GB / 893 GB avail 2012-05-25 23:52:17.806759 mds e12: 1/1/1 up {0=1=up:replay} 2012-05-25 23:52:17.806827 osd e30: 6 osds: 1 up, 1 in 2012-05-25 23:52:17.806966 log 2012-05-25 23:44:14.584879 mon.0 192.168.172.178:6789/0 2 : [INF] mds.? 192.168.172.179:6800/6515 up:boot 2012-05-25 23:52:17.807086 mon e1: 1 mons at {0=192.168.172.178:6789/0} I tried to use the mount -t ceph node:port:/ [destination] but I keep getting mount error 5 = Input/output error I also check if the firewall is blocking anything with nmap -sT -p 6789 [monNode] My ceph version is 0.47.1, installed with sudo apt-get on the system. I've spent a couple of days googling with no avails, and the documentation does not address this issue at all. Thank you very much for your help, Notice how the MDS status is up:replay? That means it restarted at some point and is currently replaying the journal, which is why your client can't connect. Ordinarily journal replay happens very quickly (a couple to several seconds, depending mostly on length), so if it's been in that state for a while something has gone wrong. And indeed only 1 out of 6 of your OSDs is up, and most of your PGs are stale because the people responsible for them aren't running. This is preventing the MDS from retrieving objects. So we need to figure out why your OSDs are down. Did you fail to start them? Have they crashed and left behind backtraces or core dumps? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'rbd map' asynchronous behavior
That looks like a bug that isn't familiar to Josh or I. Can you create a report in the tracker and provide as much debug info as you can come up with? :) On Friday, May 25, 2012 at 3:15 AM, Andrey Korolyov wrote: Hi, Newer kernel rbd driver throws a quite strange messages on map|unmap, comparing to 3.2 branch: rbd map 'path' # device appears as /dev/rbd1 instead of rbd0, then rbd unmap /dev/rbd1 # causes following trace, w/ vanilla 3.4.0 from kernel.org (http://kernel.org): [ 99.700802] BUG: scheduling while atomic: rbd/3846/0x0002 [ 99.700857] Modules linked in: btrfs ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250 serial_core [ 99.700899] Pid: 3846, comm: rbd Not tainted 3.4.0 #3 [ 99.700902] Call Trace: [ 99.700910] [81464d68] ? __schedule+0x96/0x625 [ 99.700916] [8105f98a] ? __queue_work+0x254/0x27c [ 99.700921] [81465d39] ? _raw_spin_lock_irqsave+0x2a/0x32 [ 99.700926] [81069b6d] ? complete+0x31/0x40 [ 99.700931] [8105f10a] ? flush_workqueue_prep_cwqs+0x16e/0x180 [ 99.700947] [81463bd8] ? schedule_timeout+0x21/0x1af [ 99.700951] [8107165d] ? enqueue_entity+0x67/0x13d [ 99.700955] [81464ad9] ? wait_for_common+0xc5/0x143 [ 99.700959] [8106d5fc] ? try_to_wake_up+0x217/0x217 [ 99.700963] [81063952] ? kthread_stop+0x30/0x50 [ 99.700967] [81060979] ? destroy_workqueue+0x148/0x16b [ 99.700977] [a004ce07] ? ceph_osdc_stop+0x1f/0xaa [libceph] [ 99.700984] [a00463b4] ? ceph_destroy_client+0x10/0x44 [libceph] [ 99.700989] [a00652ae] ? rbd_client_release+0x38/0x4b [rbd] [ 99.700993] [a0065719] ? rbd_put_client.isra.10+0x28/0x3d [rbd] [ 99.700998] [a006609d] ? rbd_dev_release+0xc3/0x157 [rbd] [ 99.701003] [81287387] ? device_release+0x41/0x72 [ 99.701007] [81202b95] ? kobject_release+0x4e/0x6a [ 99.701025] [a0065156] ? rbd_remove+0x102/0x11e [rbd] [ 99.701035] [8114b058] ? sysfs_write_file+0xd3/0x10f [ 99.701044] [810f8796] ? vfs_write+0xaa/0x136 [ 99.701052] [810f8a07] ? sys_write+0x45/0x6e [ 99.701062] [8146a839] ? system_call_fastpath+0x16/0x1b [ 99.701170] BUG: scheduling while atomic: rbd/3846/0x0002 [ 99.701220] Modules linked in: btrfs ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250 serial_core [ 99.701251] Pid: 3846, comm: rbd Not tainted 3.4.0 #3 [ 99.701253] Call Trace: [ 99.701257] [81464d68] ? __schedule+0x96/0x625 [ 99.701261] [81465ef9] ? _raw_spin_unlock_irq+0x5/0x2e [ 99.701265] [81069f92] ? finish_task_switch+0x4c/0xc1 [ 99.701268] [8146525b] ? __schedule+0x589/0x625 [ 99.701272] [812084b2] ? ip4_string+0x5a/0xc8 [ 99.701276] [81208cbd] ? string.isra.3+0x39/0x9f [ 99.701281] [81208e33] ? ip4_addr_string.isra.5+0x5a/0x76 [ 99.701285] [81208b7a] ? number.isra.1+0x10e/0x218 [ 99.701290] [81463bd8] ? schedule_timeout+0x21/0x1af [ 99.701294] [81464ad9] ? wait_for_common+0xc5/0x143 [ 99.701298] [8106d5fc] ? try_to_wake_up+0x217/0x217 [ 99.701303] [8105f24c] ? flush_workqueue+0x130/0x2a5 [ 99.701309] [a00463b9] ? ceph_destroy_client+0x15/0x44 [libceph] [ 99.701314] [a00652ae] ? rbd_client_release+0x38/0x4b [rbd] [ 99.701319] [a0065719] ? rbd_put_client.isra.10+0x28/0x3d [rbd] [ 99.701324] [a006609d] ? rbd_dev_release+0xc3/0x157 [rbd] [ 99.701328] [81287387] ? device_release+0x41/0x72 [ 99.701334] [81202b95] ? kobject_release+0x4e/0x6a [ 99.701343] [a0065156] ? rbd_remove+0x102/0x11e [rbd] [ 99.701352] [8114b058] ? sysfs_write_file+0xd3/0x10f [ 99.701361] [810f8796] ? vfs_write+0xaa/0x136 [ 99.701369] [810f8a07] ? sys_write+0x45/0x6e [ 99.701377] [8146a839] ? system_call_fastpath+0x16/0x1b On Wed, May 16, 2012 at 12:24 PM, Andrey Korolyov and...@xdel.ru (mailto:and...@xdel.ru) wrote: This is most likely due to a recently-fixed problem. The fix is found in this commit, although there were other changes that led up to it: 32eec68d2f rbd: don't drop the rbd_id too early It is present starting in Linux kernel 3.3; it appears you are running 2.6? Nope, it`s just Debian kernel naming - they continue to name 3.x with 2.6 and I`m following them at own build naming. I have tried that on 3.2 first time, and just a couple of minutes ago on
Re: RBD format changes and layering
On Thursday, May 24, 2012 at 4:05 PM, Josh Durgin wrote: snip One thing that's not addressed in the earlier design is how to make images read-only. The simplest way would be to only support layering on top of snapshots, which are read-only by definition. Another way would be to allow images to be set read-only or read-write, and disallow setting images with children read-write. Are there many use cases that would justify this second, more complicated way? I'm pretty sure we want to require images to be based on snapshots. It's actually more flexible than read-write flags: service providers could provide several Ubuntu 12.04 installs with different packages available by simply snapshotting as they go through the install procedure. If they instead had to go to an endpoint and then mark the image read-only, they would need to duplicate all the shared data. Copy-up === Another feature we want to include with layering is the ability to copy all remaining data from the parent image to the child image, to break the dependency of the latter on the former. This does not change snapshots that were taken earlier though - they still rely on the parent image. Thus, the children of a parent image will need to include snapshots as well, and the reference to the parent image will be needed to interact with snapshots. Thus, we can't just remove the information pointing the parent. Instead, we can add a boolean has_parent field that is stored in the header and with each snapshot, since some snapshots may be taken when the parent was still used, and some after all the data has been copied to the child. I understand why you're maintaining a reference to the parent image for old snapshots, but it makes me a little uneasy. This limitation means that you either need to delete snapshots or you need to maintain access to the parent image, which makes me a sad panda. Have you looked into options for doing a full local copy of the needed parent data? I realize there are several tricky problems, but given some of the usage scenarios for layering (ie, migration) it would be an advantage. My last question is about recursive layering. I know it's been discussed some, and *hopefully* it won't impact the actual on-disk layout of RBD images; do you have enough of a design sketched out to be sure? (One example: given the security concerns you've raised, I think layered images are going to need to list themselves as a child of each of their ancestors, rather than letting that information be absorbed by the intermediate image. Can the plan for storing parent pointers handle that?) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote: Am 24.05.2012 10:22, schrieb Wido den Hollander: On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote: ~# rados -p data ls|wc -l 46631 That is weird, I thought the bench tool cleaned up it's mess. Imho it should cleanup after it's done, but there might be a reason why it's not. Did you abort the benchmark or did you let it do the whole run? No it doesn't BUG? It doesn't because you might want to leave around the data for read benchmarking (or so that your cluster is full of data). There should probably be an option to clean up bench data, though! I've created a bug: http://tracker.newdream.net/issues/2477 ~# rados -p data ls ~# ~# rados -p data bench 20 write -t 16 ... ~# rados -p data ls| wc -l 589 I do not use the data pool so it is seperate ;-) i only use the rbd pool for block devices. So i will free the space with: for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done rados -p data ls|xargs -n 1 rados -p data rm I love shorter commands ;) me too i just tried it without -n and hoped that this works but rados didn't support more than 1 file per command and i didn't remembered -n1 ;) Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I have some problem to mount ceph file system
That's not an option any more, since malicious clients can fake it so easily. :( On Wednesday, May 23, 2012 at 10:35 PM, FrankWOO Su wrote: So in this version, can i do some settings about mount command limited by IP ? any example ?? Thanks -Frank 2012/5/24 Sage Weil s...@inktank.com (mailto:s...@inktank.com) On Wed, 23 May 2012, Gregory Farnum wrote: On Wed, May 23, 2012 at 1:51 AM, Frank frankwoo@gmail.com (mailto:frankwoo@gmail.com) wrote: Hello I have a question about ceph. When I mount ceph, I do the command as follow : # mount -t ceph -o name=admin,secret=XX 10.1.0.1:6789/ (http://10.1.0.1:6789/) /mnt/ceph -vv now I create an user foo and make a secretkey by ceph-authtool like that : # ceph-authtool /etc/ceph/keyring.bin -n client.foo --gen-key then I add the key into ceph : # ceph auth add client.foo osd 'allow *' mon 'allow *' mds 'allow' -i /etc/ceph/keyring.bin so i can mount ceph by foo : # mount -t ceph -o name=foo,secret=XOXOXO 10.1.0.1:6789/ (http://10.1.0.1:6789/) /mnt/ceph -vv my question is if i don't want foo that has permission to mount 10.1.0.1:6789/ (http://10.1.0.1:6789/) HOW TO DO ITÿÿ if there is a directory foo I want he can mount 10.1.0.1:6789:/foo/ but have no access to mount 10.1.0.1:6789:/ I'm afraid that's not an option with Ceph right now, that I'm aware of. It was built and designed for a trusted set of servers and clients, and while we're slowly carving out areas of security, this isn't one we've done yet. If it's an important feature for you, you should create a feature request in the tracker (tracker.newdream.net (http://tracker.newdream.net)) for it, which we will prioritize and work on once we've moved to focus on the full filesystem. :) http://tracker.newdream.net/issues/1237 (tho the final config will probably not look like that; suggestions welcome.) sag -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS re-exporting CEPH cluster
On Wednesday, May 23, 2012 at 10:14 PM, Madhusudhana U wrote: Hi all, Can anyone tried re-exporting CEPH cluster via NFS with success (I mean to say, mount the CEPH cluster in one of the machine and then export that via NFS to clients)? I need to do this bcz of my client kernel version and some EDA tools compatibility.Can someone suggest me how I can successfully re-export CEPH over NFS ? Have you tried something and it failed? Or are you looking for suggestions? If the former, please report the failure. :) If the latter: http://ceph.com/wiki/Re-exporting_NFS -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSDMap::apply_incremental not updating crush map
On Thursday, May 24, 2012 at 10:58 AM, Adam Crume wrote: I'm trying to simulate adding an OSD to a cluster. I set up an OSDMap::Incremental and apply it, but nothing ever gets mapped to the new OSD. Apparently, the crush map never gets updated. Do I have to do that manually? Yes. If you need help, check out the OSDMonitor::prepare_command code crush section. :) It seems like apply_incremental should do it automatically. apply_incremental has no idea where the new ID is located in terms of failure domains. My test case is below. It shows that the OSDMap is updated to have 11 OSDs, but the crush map still shows only 10. Thanks, Adam Crume #include assert.h #include osd/OSDMap.h #include common/code_environment.h int main() { OSDMap *osdmap = new OSDMap(); CephContext *cct = new CephContext(CODE_ENVIRONMENT_UTILITY); uuid_d fsid; int num_osds = 10; osdmap-build_simple(cct, 1, fsid, num_osds, 7, 8); for(int i = 0; i num_osds; i++) { osdmap-set_state(i, osdmap-get_state(i) | CEPH_OSD_UP | CEPH_OSD_EXISTS); osdmap-set_weight(i, CEPH_OSD_IN); } int osd_num = 10; OSDMap::Incremental inc(osdmap-get_epoch() + 1); inc.new_max_osd = osdmap-get_max_osd() + 1; inc.new_weight[osd_num] = CEPH_OSD_IN; inc.new_state[osd_num] = CEPH_OSD_UP | CEPH_OSD_EXISTS; inc.new_up_client[osd_num] = entity_addr_t(); inc.new_up_internal[osd_num] = entity_addr_t(); inc.new_hb_up[osd_num] = entity_addr_t(); inc.new_up_thru[osd_num] = inc.epoch; uuid_d new_uuid; new_uuid.generate_random(); inc.new_uuid[osd_num] = new_uuid; int e = osdmap-apply_incremental(inc); assert(e == 0); printf(State for 10: %d, State for 0: %d\n, osdmap-get_state(10), osdmap-get_state(0)); printf(10 exists: %s\n, osdmap-exists(10) ? yes : no); printf(10 is in: %s\n, osdmap-is_in(10) ? yes : no); printf(10 is up: %s\n, osdmap-is_up(10) ? yes : no); printf(OSDMap max OSD: %d\n, osdmap-get_max_osd()); printf(CRUSH max devices: %d\n, osdmap-crush-get_max_devices()); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to free space from rados bench comman?
On Thursday, May 24, 2012 at 11:05 AM, Josh Durgin wrote: Why not have the read benchmark write data itself, and then benchmark reading? Then both read and write benchmarks can clean up after themselves. It's a bit odd to have the read benchmark depend on you running a write benchmark first. Josh We've talked about that and decided we didn't like it. I think it was about being able to repeat large read benchmarks without having to wait for all the data to get written out first, and also (although this was never implemented) being able to implement random read benchmarks and things in ways that allowed you to make the cache cold first. Which is not to say that changing it is a bad idea; I could be talked into that or somebody else could do it. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crash, wont startup again
On Tuesday, May 22, 2012 at 3:12 AM, Felix Feinhals wrote: I am not quite sure on how to get you the coredump infos. I installed all ceph-dbg packages and executed: gdb /usr/bin/ceph-mds core snip GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-linux-gnu. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/bin/ceph-mds...Reading symbols from /usr/lib/debug/usr/bin/ceph-mds...done. (no debugging symbols found)...done. [New Thread 22980] [New Thread 22984] [New Thread 22986] [New Thread 22979] [New Thread 22970] [New Thread 22981] [New Thread 22971] [New Thread 22976] [New Thread 22973] [New Thread 22975] [New Thread 22974] [New Thread 22972] [New Thread 22978] [New Thread 22982] warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libcrypto++.so.8 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libuuid.so.1 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libtcmalloc.so.0 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libstdc++.so.6 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libunwind.so.7 Core was generated by `/usr/bin/ceph-mds -i c --pid-file /var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'. Program terminated with signal 11, Segmentation fault. #0 0x7f10c00d2ebb in raise () from /lib/libpthread.so.0 Argh. This is finicky and annoying; don't feel bad. :) There are two possibilities here: 1) If I remember correctly, PATH and the actual debug symbol install locations often don't match up. Check out where the debug packages actually installed to, and make sure that directory is in PATH when running gdb. 2) The default thread you're getting a backtrace on doesn't look to be the one we actually care about (notice how the backtrace is through completely different parts of the code); it's conceivable that there just aren't any debug symbols for those libraries. Try running thread apply all bt (I think that's the right command) and looking for one that matches the backtrace in the log file. Then switch to it (thread x where x is the thread number) and get the backtrace of that. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to mount a specific pool in cephs
On Tuesday, May 22, 2012 at 5:02 AM, Grant Ashman wrote: Tommi Virtanen tommi.virtanen at dreamhost.com (http://dreamhost.com) writes: You don't mount pools directly; there's filesystem metadata (as managed by metadata servers) that is needed too. What you probably want is to specify that a subtree of your ceph dfs stores the file data in a separate pool, using cephfs /mnt/ceph/some/subtree set_layout --pool 6. Note that a numerical pool id is currently required. http://ceph.newdream.net/docs/master/man/8/cephfs/ You can mount any subtree of the ceph dfs directly, using 10.32.0.10:6789:/some/subtree in your mount command. Hi Tommi, We have tried setting the layout as described below with: 'cephfs /mnt/ceph-backup/ set_layout --pool 3' However, I only ever receive the following output; 'Error setting layout: Invalid argument' I can run other view the current layout of the mount point, but cannot change the pool layout. My understanding of the numeric pool is as follows: root@dsan-test:/mnt# ceph osd dump -o -|grep 'pool' pool 0 'data' rep size 2 crush_ruleset 0 pool 1 'metadata' rep size 2 crush_ruleset 1 pool 2 'rbd' rep size 2 crush_ruleset 2 pool 3 'backup' rep size 2 crush_ruleset 0 (omitted detail I thought unnecessary) Therefore the backup pool which we specifically want to mount is 3? Are you able to assist with the syntax for cephfs set_layout? That's the right pool ID; yes. I believe the problem is that the cephfs tool currently requires you to fill in all the fields, not just the one you wish to change. Try that (setting all the other values to match what you see when you view the layout). :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
What does your test look like? With multiple large IOs in flight we can regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs in flight performance degrades accordingly. On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote: Hi list, my ceph block testcluster is now running fine. Setup: 4x ceph servers - 3x mon with /mon on local os SATA disk - 4x OSD with /journal on tmpfs and /srv on intel ssd all of them use 2x 1Gbit/s lacp trunk. 1x KVM Host system (2x 1Gbit/s lacp trunk) With one KVM i do not get more than 40MB/s and my network link is just at 40% of 1Gbit/s. Is this expected? If not where can i start searching / debugging? Thanks, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: Am 22.05.2012 21:35, schrieb Greg Farnum: What does your test look like? With multiple large IOs in flight we can regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs in flight performance degrades accordingly. iperf shows 950Mbit/s so this is OK (from KVM host to OSDs) sorry: dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M count=1000; 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? Can you (from the KVM host) run rados -p data bench seq 60 -t 1 rados -p data bench seq 60 -t 16 and paste the final output from both? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph osd crush add - uknown command crush
On Tuesday, May 22, 2012 at 1:15 PM, Sławomir Skowron wrote: /usr/bin/ceph -v ceph version 0.47.1 (commit:f5a9404445e2ed5ec2ee828aa53d73d4a002f7a5) root@obs-10-177-66-4:/# /usr/bin/ceph osd crush add 1 osd.1 1.0 pool=default rack=unknownrack host=obs-10-177-66-4 root@obs-10-177-66-4:/# unknown command crush Something has changed (there is no change in doc) or it's bug ?? Gah, something changed. Use ceph osd crush set… -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 1:30 PM, Stefan Priebe wrote: Am 22.05.2012 21:52, schrieb Greg Farnum: On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? v0.47.1 Can you (from the KVM host) run rados -p data bench seq 60 -t 1 rados -p data bench seq 60 -t 16 and paste the final output from both? OK here it is first with write then with seq read. # rados -p data bench 60 write -t 1 # rados -p data bench 60 write -t 16 # rados -p data bench 60 seq -t 1 # rados -p data bench 60 seq -t 16 Output is here: http://pastebin.com/iFy8GS7i Heh, yep, sorry about the commands — haven't run them personally in a while. :) Anyway, it looks like you're just paying a synchronous write penalty, since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 you're getting 100MB/s. (If you bump up past 16 or increase the size of each with -b you may find yourself getting even more.) So try enabling RBD writeback caching — see http://marc.info/?l=ceph-develm=133758599712768w=2 -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote: Am 22.05.2012 22:49, schrieb Greg Farnum: Anyway, it looks like you're just paying a synchronous write penalty What does that exactly mean? Shouldn't one threaded write to four 260MB/s devices gives at least 100Mb/s? Well, with dd you've got a single thread issuing synchronous IO requests to the kernel. We could have it set up so that those synchronous requests get split up, but they aren't, and between the kernel and KVM it looks like when it needs to make a write out to disk it sends one request at a time to the Ceph backend. So you aren't writing to four 260MB/s devices; you are writing to one 260MB/s device without any pipelining — meaning you send off a 4MB write, then wait until it's done, then send off a second 4MB write, then wait until it's done, etc. Frankly I'm surprised you aren't getting a bit more throughput than you're seeing (I remember other people getting much more out of less beefy boxes), but it doesn't much matter because what you really want to do is enable the client-side writeback cache in RBD, which will dispatch multiple requests at once and not force writes to be committed before reporting back to the kernel. Then you should indeed be writing to four 260MB/s devices at once. :) since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 you're getting100MB/s. (If you bump up past 16 or increase the size of each with -b you may find yourself getting even more.) yep noticed that. So try enabling RBD writeback caching — see http://marc.info/?l=ceph-develm=133758599712768w=2 will test tomorrow. Thanks. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to mount a specific pool in cephs
On Tuesday, May 22, 2012 at 2:12 PM, Grant Ashman wrote: That's the right pool ID; yes. I believe the problem is that the cephfs tool currently requires you to fill in all the fields, not ?just the one you wish to change. Try that (setting all the other values to match what you see when you view the layout). :) - Greg Hi, I've tried setting all values the same as the current layout - changing only the pool number and I still get; root@dsan-test:/mnt/ceph-test# cephfs /mnt/ceph-backup show_layout layout.data_pool: 0 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 root@dsan-test:/mnt/ceph-test# root@dsan-test:/mnt/ceph-test# cephfs /mnt/ceph-backup set_layout -p 3 -s 4194304 -u 4194304 -c 1 Error setting layout: Invalid argument If I leave all values exactly the same i.e '-p 0' the command runs without any error output. However, changing the pool from anything but 0 results in 'Error setting layout: Invalid argument' Any ideas? Oh, I got this conversation confused with another one. You also need to specify the pool as a valid pool to store filesystem data in, if you haven't done that already: ceph mds add_data_pool poolname And you may not actually need to specify all options — I'd been assuming since it broke but I don't remember if that's actually the case (I think it's changed over the versions). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to mount a specific pool in cephs
On Tuesday, May 22, 2012 at 2:31 PM, Grant Ashman wrote: Greg Farnum greg at inktank.com (http://inktank.com) writes: Oh, I got this conversation confused with another one. You also need to specify the pool as a valid pool to store filesystem data in, if you haven't done that already: ceph mds add_data_pool poolname Thanks Greg, However, I still get the same error :( When I specify the add data pool I get the following: (with or without the additional values) root@dsan-test:~# ceph mds add_data_pool backup added data pool 0 to mdsmap Okay, that's not right — it should say pool 3. According to the docs I found you ran that correctly, but let's try running ceph mds add_data_pool 3 and see if that resolves correctly. *goes to look at code* Argh, yep, it's expecting a pool ID, not a pool name. Gah. root@dsan-test:~# cephfs /mnt/ceph-backup set_layout -p 3 Error setting layout: Invalid argument Seems to be pointing at pool 0 instead of pool 3 like it should? Sorry if this has all been covered before, I've not found any resolution and the ability to mount specific pools I.E data for production data and backup for 1N backup data is a huge priority to begin using Ceph. No problem — we haven't generated docs for this yet and obviously we need to at some point. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible memory leak in mon?
On Wednesday, May 2, 2012 at 11:24 PM, Vladimir Bashkirtsev wrote: Greg, Apologies for multiple emails: my mail server is backed by ceph now and it struggled this morning (separate issue). So my mail server reported back to my mailer that sending of email failed when obviously it was not the case. Interesting — I presume you're using the file system? That's not something we've heard of anybody doing with Ceph before. :) [root@gamma ~]# ceph -s 2012-05-03 15:46:55.640951 mds e2666: 1/1/1 up {0=1=up:active}, 1 up:standby 2012-05-03 15:46:55.647106 osd e10728: 6 osds: 6 up, 6 in 2012-05-03 15:46:55.654052 log 2012-05-03 15:46:26.557084 mon.2 172.16.64.202:6789/0 2878 : [INF] mon.2 calling new monitor election 2012-05-03 15:46:55.654425 mon e7: 3 mons at {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0} 2012-05-03 15:46:56.961624 pg v1251669: 600 pgs: 2 creating, 598 active+clean; 309 GB data, 963 GB used, 1098 GB / 2145 GB avail Loggin is on but nothing obvious in there: logs quite small. Number of ceph health logged (ceph monitored by nagios and so this record appears every 5 minutes), monitors periodically call for election (different periods between 1 to 15 minutes as it looks). That's it. Hrm. Generally speaking the monitors shouldn't call for elections unless something changes (one of them crashes) or the leader monitor is slowing down. Can you increase the debug_mon to 20, the debug_ms to 1, and post one of the logs somewhere? The Live Debugging section of http://ceph.com/wiki/Debugging should give you what you need. :) Regards, Vladimir -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weighted distributed processing.
On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote: Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012: Hello All, First off, I'm sending this email to three discussion groups: gear...@googlegroups.com (mailto:gear...@googlegroups.com) - distributed processing library ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org) - distributed file system archivemat...@googlegroups.com (mailto:archivemat...@googlegroups.com) - my project's discussion list, a distributed processing system. I'd like to start a discussion about something I'll refer to as weighted distributed task based processing. Presently, we are using gearman's library's to meet our distributed processing needs. The majority of our processing is file based, and our processing stations are accessing the files over an nfs share. We are looking at replacing the nfs server share with a distributed file systems, like ceph. It occurs to me that our processing times could theoretically be reduced by by assigning tasks to processing clients where the file resides, over places where it would need to be copied over the network. In order for this to happen, the gearman server would need to get file location information from the ceph system. If I understand the design of CEPH completely, it spreads I/O at the block level, not the file level. So there is little point in weighting since it seeks to spread the whole file across all the machines/block devices in the cluster. Even if you do ask ceph which servers is file X on, which I'm sure it could tell you, You will end up with high weights for most of the servers, and no real benefit. In this scenario, you're just better off having a really powerful network and CEPH will balance the I/O enough that you can scale out the I/O independently of the compute resources. This seems like a huge win, as I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit switches are still not super cheap, but they are probably cheaper than software engineer hours. If your network is not up to the task of transferring all those blocks around, you probably need to focus instead on something that keeps whole files in a certain place. One such system would be MogileFS. This has a database with a list of keys that say where the data lives, and in fact the protocol the MogileFS tracker uses will tell you all the places a key lives. You could then place a hint in the payload and have 2 levels of workers. The pseudo becomes: -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname' -client sends task w/ filename to 'dispatch_foo' -dispatcher looks at filename, asks mogile where the file is, looks at recent queue lengths in gearman, and decides whether or not it is enough of a win to direct the job to the host where the file is, or to farm it out to somewhere that is less busy. This will take a lot of poking at to get tuned right, but it should be tunable to a single number, the ratio of localized queue length versus non-localized queue length. pseudo: gearman client creates a task includes a weight, of type ceph file gearman server identifies the file polls the ceph system for clients that have this file ceph system returns a list of clients that have the file locally gearman assigns the task . if there is a client available for processing that has the file locally . assign it there . (that client has local access to the file, still on the ceph system) . else . assign to other client . (that processing client will pull the file from the ceph system over the network) I call it a weighted distributed processing system, because it reminds me of a weighted die: The outcome is influenced to a certain direction (in the task assignment). I wanted to start this as a discussion, rather than filing feature requests, because of the complex nature of the requests, and the nicer medium for feedback, clarification and refinement. I'd be very interested to hear feedback on the idea, Joseph Perry https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1 ^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten the original email — at least I didn't!). Clint is mostly correct: Ceph does not store files in a single location. It's not block-based in the sense of 4K disk blocks though — instead it breaks up files into (by default) 4MB chunks. It's possible to change this default to a larger number though; our Hadoop bindings break files into 64MB chunks. And it is possible to retrieve this location data using the cephfs tool: ./cephfs not enough parameters! usage: cephfs path command [options]* Commands: show_layout -- view the layout information on a file or dir set_layout -- set the layout on an empty file, or the default layout on a
Re: weighted distributed processing.
(Trimmed CC:) apparently neither Gearman nor Archivematica lists allow posting from non-members, which leads to some wonderful spam from Google and is going to make holding a cross-list conversation…difficult. On Wednesday, May 2, 2012 at 4:26 PM, Greg Farnum wrote: On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote: Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012: Hello All, First off, I'm sending this email to three discussion groups: gear...@googlegroups.com (mailto:gear...@googlegroups.com) - distributed processing library ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org) - distributed file system archivemat...@googlegroups.com (mailto:archivemat...@googlegroups.com) - my project's discussion list, a distributed processing system. I'd like to start a discussion about something I'll refer to as weighted distributed task based processing. Presently, we are using gearman's library's to meet our distributed processing needs. The majority of our processing is file based, and our processing stations are accessing the files over an nfs share. We are looking at replacing the nfs server share with a distributed file systems, like ceph. It occurs to me that our processing times could theoretically be reduced by by assigning tasks to processing clients where the file resides, over places where it would need to be copied over the network. In order for this to happen, the gearman server would need to get file location information from the ceph system. If I understand the design of CEPH completely, it spreads I/O at the block level, not the file level. So there is little point in weighting since it seeks to spread the whole file across all the machines/block devices in the cluster. Even if you do ask ceph which servers is file X on, which I'm sure it could tell you, You will end up with high weights for most of the servers, and no real benefit. In this scenario, you're just better off having a really powerful network and CEPH will balance the I/O enough that you can scale out the I/O independently of the compute resources. This seems like a huge win, as I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit switches are still not super cheap, but they are probably cheaper than software engineer hours. If your network is not up to the task of transferring all those blocks around, you probably need to focus instead on something that keeps whole files in a certain place. One such system would be MogileFS. This has a database with a list of keys that say where the data lives, and in fact the protocol the MogileFS tracker uses will tell you all the places a key lives. You could then place a hint in the payload and have 2 levels of workers. The pseudo becomes: -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname' -client sends task w/ filename to 'dispatch_foo' -dispatcher looks at filename, asks mogile where the file is, looks at recent queue lengths in gearman, and decides whether or not it is enough of a win to direct the job to the host where the file is, or to farm it out to somewhere that is less busy. This will take a lot of poking at to get tuned right, but it should be tunable to a single number, the ratio of localized queue length versus non-localized queue length. pseudo: gearman client creates a task includes a weight, of type ceph file gearman server identifies the file polls the ceph system for clients that have this file ceph system returns a list of clients that have the file locally gearman assigns the task . if there is a client available for processing that has the file locally . assign it there . (that client has local access to the file, still on the ceph system) . else . assign to other client . (that processing client will pull the file from the ceph system over the network) I call it a weighted distributed processing system, because it reminds me of a weighted die: The outcome is influenced to a certain direction (in the task assignment). I wanted to start this as a discussion, rather than filing feature requests, because of the complex nature of the requests, and the nicer medium for feedback, clarification and refinement. I'd be very interested to hear feedback on the idea, Joseph Perry https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1 ^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten the original email — at least I didn't!). Clint is mostly correct: Ceph does not store files in a single location. It's not block-based in the sense of 4K disk blocks though — instead it breaks up files into (by default) 4MB chunks. It's possible
Re: Possible memory leak in mon?
On Wednesday, May 2, 2012 at 3:28 PM, Vladimir Bashkirtsev wrote: Dear devs, I have three mons and two of them suddenly consumed around 4G of RAM while third one happily lived with 150M. This immediately prompts few questions: 1. What is expected memory use of mon? I believed that mon merely directs clients to relevant OSDs and should not consume a lot of resources - please correct me if I am wrong. 2. In both cases where mon consumed a lot of memory it was preceded by disk-full condition and both machines where incidents happened are 64 bit, rest of cluster 32 bit. mon fs and log files happened to be in the same partition - ceph osd produced a lot of messages, filled up disk, mon crashed (no core as disk was full), manually deleted logs, restarted mon without any issue, some time later found mon using 4G of RAM. Running 0.45. Should I deliberately recreate conditions and crash mon to get more debug info (if you need it of course, and if yes then what)? 3. Does figure 4G per process coming from 32 bit pointers in mon? Or mon potentially can consume more than 4G? Regards, Vladimir -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html First: one email is enough. Second: in normal use your monitors should not consume very much memory. It sounds like something's wrong. Can you please provide the output of ceph -s? Also, do you have any monitor logging on? My best guess is that for some reason the monitors aren't all communicating with each other and so they are buffering messages. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: global_init fails when only specifying monitor address
On Thursday, April 26, 2012 at 9:33 AM, Sage Weil wrote: On Thu, 26 Apr 2012, Wido den Hollander wrote: Hi, I tried to connect to a small Ceph setup on my desktop without cephx and that failed: root@stack01:~# ceph -m wido-desktop.widodh.nl:6789 (http://wido-desktop.widodh.nl:6789) -s global_init: unable to open config file. root@stack01:~# I however worked with: root@stack01:~# ceph -m wido-desktop.widodh.nl:6789 (http://wido-desktop.widodh.nl:6789) -c /dev/null -s 2012-04-26 14:55:33.828524 pg v148: 594 pgs: 594 active+clean; 0 bytes data, 7740 KB used, 70571 MB / 76800 MB avail 2012-04-26 14:55:33.829622 mds e1: 0/0/1 up 2012-04-26 14:55:33.836144 osd e14: 3 osds: 3 up, 3 in 2012-04-26 14:55:33.886429 log 2012-04-26 14:52:50.674430 osd.1 [2a00:f10:11c:ab:52e5:49ff:fec2:c976]:6807/28366 12 : [INF] 1.2b scrub ok 2012-04-26 14:55:33.892423 mon e1: 1 mons at {desktop=[2a00:f10:11c:ab:52e5:49ff:fec2:c976]:6789/0} root@stack01:~# I quick look at global_init.cc (http://global_init.cc) showed me why this happened, it simply looks for a configuration file to open and when it can't it fails. But if a monitor address is set, a config file shouldn't be mandatory. It could be accomplished rather simple by setting the flag CINIT_FLAG_NO_DEFAULT_CONFIG_FILE if a mon_host has been set, but to do that conf-parse_argv(args); should move a few lines up. Comments? Thoughts? I wonder if the simplest thing to do is: - never error out on missing config in the default search path - always error out on missing config if it was explicitly specified via -c foo or CEPH_CONF in environment. ? sage I think this is probably right. I think that we may even error out correctly if we don't have values specified that we need, but we'll need to check that. I'm working on similar stuff as I look at monitor cluster additions for Carl, so I'll look at this today. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Log files with 0.45
I checked with Sam on this and it turns out he created a new subsystem whose output you can control with the debug optracker (or --debug-optracker) option (in the same way as the other debug log systems). In 0.45 the output for that system was at inappropriately high levels (1) and it's fixed in our current master (5 now), but you probably want to set debug optracker = 0 in your config. That should restore things to the way they used to be! (And sorry for the long wait, Danny.) -Greg On Friday, April 20, 2012 at 1:05 PM, Nick Bartos wrote: Is there a recommended log config for production systems? I'm also trying to decrease the verbosity in 0.45, using the options specified here: http://ceph.newdream.net/wiki/Debugging. Setting them down to '1' doesn't end the insane log sprawl I'm seeing. On Tue, Apr 17, 2012 at 10:09 PM, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: On Tuesday, April 17, 2012 at 9:53 PM, Danny Kukawka wrote: Hi, did something change with the default log levels for OSDs on v0.45? With 0.43 and IIRC also 0.44* the logfiles had a acceptable size, but now I get by default ~3Gbyte per OSD over 12 hours without any change in the config file. Danny I think some extra event notifications got stuck in the logs for OSD operations; they're nice for debugging but may well have a log level higher than they should. They should be easy to compress a lot, though! Can you comment on this, Sam? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Thursday, April 19, 2012 at 2:07 AM, Damien Churchill wrote: On 18 April 2012 21:41, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: That should get everything back up and running. The one sour note is that due to the bug in the past, your data (ie, filesystem) and vmimages pools have gotten conflated. It shouldn't cause any issues (they use very different naming schemes), but they're tied together in terms of replication and the raw pool statistics. (If that's important you can create a new pool and move the rbd images to it.) Thanks a lot Greg! All back up and running now. What negative side effects could having the pools mixed together have, given that I'm not doing any special placement of them? There shouldn't be any negative side effects from it at all. It just means that you've got a mixed namespace, and if you don't care about that none of our current tools do either. (Something like the still-entirely-theoretical ceph-fsck probably wouldn't appreciate it, but we don't have anything like that right now.) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-librbd-caching
On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote: Hi, I changed the values and the performance is still very good and the memory footprint is much smaller. OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty OR tx.. bigish) OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep this smallish) // note: the max amount of in flight dirty data is roughly (max - target) But I am not quite sure about the meaning of the values. client_oc_size Max size of the cache? client_oc_max_dirty max dirty value before the writeback starts? client_oc_target_dirty ??? Right now the cache writeout algorithms are based on amount of dirty data, rather than something like how long the data has been dirty. client_oc_size is the max (and therefore typical) size of the cache. client_oc_max_dirty is the largest amount of dirty data in the cache — if this much is dirty and you try to dirty more, the dirtier (a write of some kind) will block until some of the other dirty data has been committed. client_oc_target_dirty is the amount of dirty data that will trigger the cache to start flushing data out. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Tuesday, April 17, 2012 at 11:41 PM, Damien Churchill wrote: On 17 April 2012 17:49, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: Do you know what version this was created with, and what upgrades you've been through? My best guess right now is that there's a problem with the encoding and decoding that I'm going to have to track down, and more context will make it a lot easier. :) Hmmm that's testing my memory, I'd say that cluster has been alive at least since 0.34. Occasionally I think there was a version skipped, not sure if that could cause any issues? Okay. So the good news is that we can see what's broken now and have a kludge to prevent it happening to others; the bad news is we still have no idea how it actually occurred. :( But I don't think it's worth investing the time given what we have available, so all we can do is repair your cluster. Are you building your binaries from source, and can you run a patched version of the monitors? If you can I'll give you a patch to enable a simple command that should make things work; otherwise we'll need to start editing things by hand. (Yucky) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Wednesday, April 18, 2012 at 12:04 PM, Damien Churchill wrote: On 18 April 2012 19:41, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: Are you building your binaries from source, and can you run a patched version of the monitors? If you can I'll give you a patch to enable a simple command that should make things work; otherwise we'll need to start editing things by hand. (Yucky) -Greg I was using the Ubuntu packages but I can quite happily build my own packages if you give me the patch :-) I agree it's a waste of time if it's not obvious what's caused it, could be some obscure cause occurred due to upgrading between older versions. Okay, assuming you're still on 0.41.1, can you checkout the git branch for-damien and build it? Then shut down your monitors, replace their executables with the freshly-built ones, and run ceph osd pool set vmimages pg_num 320 --a-dev-told-me-to and ceph osd pool set vmimages pgp_num 320 That should get everything back up and running. The one sour note is that due to the bug in the past, your data (ie, filesystem) and vmimages pools have gotten conflated. It shouldn't cause any issues (they use very different naming schemes), but they're tied together in terms of replication and the raw pool statistics. (If that's important you can create a new pool and move the rbd images to it.) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Monday, April 16, 2012 at 10:32 PM, Damien Churchill wrote: On 17 April 2012 01:06, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: Yep! We looked into this more today and have discovered some definite oddness. Have you by any chance tried to change the number of PGs in your pools? I haven't no (at least certainly not on purpose!). All I've really done is copy a bit of stuff onto the unix fs and create a few rbd volumes, as well as upgrade it when a new version comes out. Drat, that means there's actually a problem to track down somewhere. Do you know what version this was created with, and what upgrades you've been through? My best guess right now is that there's a problem with the encoding and decoding that I'm going to have to track down, and more context will make it a lot easier. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Log files with 0.45
On Tuesday, April 17, 2012 at 9:53 PM, Danny Kukawka wrote: Hi, did something change with the default log levels for OSDs on v0.45? With 0.43 and IIRC also 0.44* the logfiles had a acceptable size, but now I get by default ~3Gbyte per OSD over 12 hours without any change in the config file. Danny I think some extra event notifications got stuck in the logs for OSD operations; they're nice for debugging but may well have a log level higher than they should. They should be easy to compress a lot, though! Can you comment on this, Sam? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librados aio completion
On Sunday, April 15, 2012 at 9:44 PM, Sage Weil wrote: We just switched the completion callbacks so that they are called asynchronously from another thread. This makes the locking less weird for certain callers and lets you call back into librados in your callback safely. This breaks one of the functional tests, which sets a bool in the callback, does wait_for_complete() on the aio handle, and then asserts that the bool is set. There's now a race between the caller's thread and the completion thread. Do we just call this a broken test, or do we want some way of blocking on the aio handle until the completion has been called? I think we have to block the aio handle until the completion has been called. Expecting users to (constantly re-)implement that themselves is just silly, and anybody who needs to wait_for_complete() is going to expect the completion to have been called. I don't remember exactly how wait_for_complete() is triggered, but it can't be too difficult to move it into the completion thread. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librados aio completion
On Monday, April 16, 2012 at 2:07 PM, Sage Weil wrote: On Mon, 16 Apr 2012, Greg Farnum wrote: Or set the bool to true, then do the callback, then signal? That's sort of what I was getting at with wait_for_complete_and_callback_returned(). We could make wait_for_complete() do that, although it should be a second bool because cond.Wait() can wake up nondeterministically (because of a signal or something). For example I could clear the callback pointer after it returns, and make the wait loop check for the bool and callback_ptr == NULL. It just means the wait_for_complete() does not actually wait for is_complete(), but is_complete() did callback. Okay. I'm just thinking that we need wait_for_complete() to be the stronger variant, since that's how it previously behaved. (Whereas I believe that previously is_complete() and is_safe() functioned correctly inside callbacks, correct? So we need to preserve that behavior as well.) If we want the weaker guarantee for some reason, we can add a wait_for_complete_response() or similar. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Friday, April 13, 2012 at 12:42 PM, Damien Churchill wrote: Hi, On 13 April 2012 20:30, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: On Thursday, April 12, 2012 at 8:29 AM, Damien Churchill wrote: On 11 April 2012 00:40, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: A quick glance through these shows that all the pg_temp requests aren't actually requesting any changes from the monitor. It's either a very serious mon bug which happened a while ago (unlikely, given the restarts and ongoing map changes, etc), or an OSD bug. I think we want logs from both osd.0 and osd.3 at the same time, from what I'm seeing. :) -Greg Just to make sure all bases are covered: http://damoxc.net/ceph/ceph-logs-20120412142537.tar.gz This contains all 5 osd logs and all 3 monitor logs, everything restarted with debug logging prior to capturing the logs. I (and Sam) spent some time looking at this very closely. It continues to tell me that the OSD and the monitor are disagreeing on whether osd 3 should be in the pg temp set for some things, but they seem to agree on everything else…. Can you zip up for me: 1) The files matching osdmap* of osd0's store from the current/meta/ directory, 2) The contents of your lead monitor's osdmap and osdmap_full directories? Here they are http://damoxc.net/ceph/osdmap.0.tar.gz http://damoxc.net/ceph/mon.node21.osdmap.tar.gz Hopefully I got the right files :) Yep! We looked into this more today and have discovered some definite oddness. Have you by any chance tried to change the number of PGs in your pools? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Thursday, April 12, 2012 at 8:29 AM, Damien Churchill wrote: On 11 April 2012 00:40, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: A quick glance through these shows that all the pg_temp requests aren't actually requesting any changes from the monitor. It's either a very serious mon bug which happened a while ago (unlikely, given the restarts and ongoing map changes, etc), or an OSD bug. I think we want logs from both osd.0 and osd.3 at the same time, from what I'm seeing. :) -Greg Just to make sure all bases are covered: http://damoxc.net/ceph/ceph-logs-20120412142537.tar.gz This contains all 5 osd logs and all 3 monitor logs, everything restarted with debug logging prior to capturing the logs. I (and Sam) spent some time looking at this very closely. It continues to tell me that the OSD and the monitor are disagreeing on whether osd 3 should be in the pg temp set for some things, but they seem to agree on everything else…. Can you zip up for me: 1) The files matching osdmap* of osd0's store from the current/meta/ directory, 2) The contents of your lead monitor's osdmap and osdmap_full directories? We can check these for differences and then run them through some of our tools and stuff to try and identify the issue. Thanks! -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-librbd-caching
On Thursday, April 12, 2012 at 12:45 PM, Sage Weil wrote: On Thu, 12 Apr 2012, Martin Mailand wrote: The other point is, that the cache is not KSM enabled, therefore identical pages will not be merged, could that be changed, what would be the downside? So maybe we could reduce the memory footprint of the cache, but keep it's performance. I'm not familiar with the performance implications of KSM, but the objectcacher doesn't modify existing buffers in place, so I suspect it's a good candidate. And it looks like there's minimal effort in enabling it... But if you're supposed to advise the kernel that the memory is a good candidate, then probably we shouldn't be making that madvise call on every buffer (I imagine it's doing a sha1 on each page and then examining a tree) — especially since we (probably) flush all that data out relatively quickly. And RBD doesn't currently have any information about whether the data is OS or user data… (I guess in future, with layering, we could call madvise on pages which were read from an underlying gold image.) Also, TV is wondering if the data is even page-aligned or not? I can't recall off-hand. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Statically binding ports for ceph-osd
You're unlikely to hit it since you're setting all addresses, but we somehow managed to introduce an error even in that small patch -- you may want to pull in commit cd4a760e9b22047fa5a45d0211ec4130809d725e as well. -Greg On Tuesday, April 10, 2012 at 5:13 PM, Nick Bartos wrote: Good enough for me, I'll just patch it for the short term. Thanks! On Tue, Apr 10, 2012 at 4:51 PM, Sage Weil s...@newdream.net (mailto:s...@newdream.net) wrote: On Tue, 10 Apr 2012, Nick Bartos wrote: Awesome, thanks so much! Can I assume this will make it into the next ceph stable release? I'll probably just backport it now before we actually start using it, so I don't have to change the config later. v0.45 is out today/tomorrow, but it'll be in v0.46. sage On Tue, Apr 10, 2012 at 4:16 PM, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: Yep, you're absolutely correct. Might as well let users specify the whole address rather than just the port, though ? since your patch won't apply to current upstream due to some heartbeating changes I whipped up another one which adds the osd heartbeat addr option. It's pushed it to master in commit 6fbac10dc68e67d1c700421f311cf5e26991d39c, but you'll want to backport (easy) or carry your change until you upgrade (and remember to change the config!). :) Thanks for the report! -Greg On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote: After doing some more looking at the code, it appears that this option is not supported. I created a small patch (attached) which adds the functionality. Is there any way we could get this, or something like this, applied upstream? I think this is important functionality for firewalled environments, and seems like a simple fix since all the other services (including ones for ceph-mon and ceph-mds) already allow you to specify a static port. On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com (mailto:n...@pistoncloud.com) wrote: I'm trying to get ceph-osd's listening ports to be set statically for firewall reasons. I am able to get 2 of the 3 ports set statically, however the 3rd one is still getting set dynamically. I am using: [osd.48] host = 172.16.0.13 cluster addr = 172.16.0.13:6944 public addr = 172.16.0.13:6945 The daemon will successfully bind to 6944 and 6945, but also binds to 6800. What additional option do I need? I started looking at the code and thought hb addr = 172.16.0.13:6946 would do it, but specifying that option seems to have no effect (or at least does not achieve the desired result). Attachments: - ceph-0.41-osd_hb_port.patch -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make libcephfs return error when unmounted?
On Wednesday, April 11, 2012 at 11:25 AM, Noah Watkins wrote: On Apr 11, 2012, at 11:22 AM, Greg Farnum wrote: On Wednesday, April 11, 2012 at 11:12 AM, Noah Watkins wrote: Hi all, -Noah I'm not sure where the -1004 came from, ceph_mount(..) seems to return some random error codes (-1000, 1001) already :) grumble legacy undocumented grr /grumble Let's try to use standard error codes where available, and (if we have to create our own) document any new ones with user-accessible names and explanations. I don't know which one is best but I see a lot of applicable choices when scanning errno-base et al. Also, what Yehuda said. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make libcephfs return error when unmounted?
On Wednesday, April 11, 2012 at 2:59 PM, Noah Watkins wrote: On Apr 11, 2012, at 11:22 AM, Yehuda Sadeh Weinraub wrote: Also need to check that cmount is initialized. I'd add a helper: Client *ceph_get_client(struct ceph_mount_info *cmont) { if (cmount cmount-is_mounted()) return cmount-get_client(); return NULL; } How useful is checking cmount != NULL here? This defensive check depends on users initializing their cmount pointers to NULL, but the API doesn't do anything to require this initialization assumption. - Noah I had a whole email going until I realized you were just right. So, yeah, that wouldn't do anything since a cmount they forgot to have the API initialize is just going to hold random data. Urgh. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make libcephfs return error when unmounted?
On Wednesday, April 11, 2012 at 3:34 PM, Yehuda Sadeh Weinraub wrote: On Wed, Apr 11, 2012 at 3:13 PM, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: On Wednesday, April 11, 2012 at 2:59 PM, Noah Watkins wrote: On Apr 11, 2012, at 11:22 AM, Yehuda Sadeh Weinraub wrote: Also need to check that cmount is initialized. I'd add a helper: Client *ceph_get_client(struct ceph_mount_info *cmont) { if (cmount cmount-is_mounted()) return cmount-get_client(); return NULL; } How useful is checking cmount != NULL here? This defensive check depends on users initializing their cmount pointers to NULL, but the API doesn't do anything to require this initialization assumption. - Noah I had a whole email going until I realized you were just right. So, yeah, that wouldn't do anything since a cmount they forgot to have the API initialize is just going to hold random data. Urgh. There's no destructor either, maybe it's a good time to add one? Yehuda Actually, there is. The problem is that to the client it's an opaque pointer under many(most?) circumstances, so that it can be used by C users. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Statically binding ports for ceph-osd
Yep, you're absolutely correct. Might as well let users specify the whole address rather than just the port, though — since your patch won't apply to current upstream due to some heartbeating changes I whipped up another one which adds the osd heartbeat addr option. It's pushed it to master in commit 6fbac10dc68e67d1c700421f311cf5e26991d39c, but you'll want to backport (easy) or carry your change until you upgrade (and remember to change the config!). :) Thanks for the report! -Greg On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote: After doing some more looking at the code, it appears that this option is not supported. I created a small patch (attached) which adds the functionality. Is there any way we could get this, or something like this, applied upstream? I think this is important functionality for firewalled environments, and seems like a simple fix since all the other services (including ones for ceph-mon and ceph-mds) already allow you to specify a static port. On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com (mailto:n...@pistoncloud.com) wrote: I'm trying to get ceph-osd's listening ports to be set statically for firewall reasons. I am able to get 2 of the 3 ports set statically, however the 3rd one is still getting set dynamically. I am using: [osd.48] host = 172.16.0.13 cluster addr = 172.16.0.13:6944 public addr = 172.16.0.13:6945 The daemon will successfully bind to 6944 and 6945, but also binds to 6800. What additional option do I need? I started looking at the code and thought hb addr = 172.16.0.13:6946 would do it, but specifying that option seems to have no effect (or at least does not achieve the desired result). Attachments: - ceph-0.41-osd_hb_port.patch -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pgs stuck inactive
On Tuesday, April 10, 2012 at 4:00 PM, Samuel Just wrote: Can you send along the osd log as well for comparison? -Sam On Tue, Apr 10, 2012 at 3:03 PM, Damien Churchill dam...@gmail.com (mailto:dam...@gmail.com) wrote: Here are the monitor logs, they're from the monitor starts, however I restarted the osd shortly afterwards and left it to run for 5 or so minutes. http://damoxc.net/ceph/mon.node21.log.gz http://damoxc.net/ceph/mon.node22.log.gz http://damoxc.net/ceph/mon.node23.log.gz On 10 April 2012 22:49, Samuel Just sam.j...@dreamhost.com (mailto:sam.j...@dreamhost.com) wrote: Nothing apparent from the backtrace. I need monitor logs from when the osd is sending pg_temp requests. Can you restart the osd and post the osd and all three monitor logs from when you restarted the osd? You'll have to enable monitor logging on all three. -Sam A quick glance through these shows that all the pg_temp requests aren't actually requesting any changes from the monitor. It's either a very serious mon bug which happened a while ago (unlikely, given the restarts and ongoing map changes, etc), or an OSD bug. I think we want logs from both osd.0 and osd.3 at the same time, from what I'm seeing. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Statically binding ports for ceph-osd
I think we've already branched off 0.45, so it'll have to wait until 0.46 unless we decide to pull it over. Sage could probably be talked into it if you ask nicely? -Greg On Tuesday, April 10, 2012 at 4:45 PM, Nick Bartos wrote: Awesome, thanks so much! Can I assume this will make it into the next ceph stable release? I'll probably just backport it now before we actually start using it, so I don't have to change the config later. On Tue, Apr 10, 2012 at 4:16 PM, Greg Farnum gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote: Yep, you're absolutely correct. Might as well let users specify the whole address rather than just the port, though — since your patch won't apply to current upstream due to some heartbeating changes I whipped up another one which adds the osd heartbeat addr option. It's pushed it to master in commit 6fbac10dc68e67d1c700421f311cf5e26991d39c, but you'll want to backport (easy) or carry your change until you upgrade (and remember to change the config!). :) Thanks for the report! -Greg On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote: After doing some more looking at the code, it appears that this option is not supported. I created a small patch (attached) which adds the functionality. Is there any way we could get this, or something like this, applied upstream? I think this is important functionality for firewalled environments, and seems like a simple fix since all the other services (including ones for ceph-mon and ceph-mds) already allow you to specify a static port. On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com (mailto:n...@pistoncloud.com) wrote: I'm trying to get ceph-osd's listening ports to be set statically for firewall reasons. I am able to get 2 of the 3 ports set statically, however the 3rd one is still getting set dynamically. I am using: [osd.48] host = 172.16.0.13 cluster addr = 172.16.0.13:6944 public addr = 172.16.0.13:6945 The daemon will successfully bind to 6944 and 6945, but also binds to 6800. What additional option do I need? I started looking at the code and thought hb addr = 172.16.0.13:6946 would do it, but specifying that option seems to have no effect (or at least does not achieve the desired result). Attachments: - ceph-0.41-osd_hb_port.patch -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html