Re: Persistence of completed_requests in sessionmap (do we need it?)

2015-03-04 Thread Greg Farnum
Just forwarding the replies to the list as it looks like it got blocked due to 
accidental HTML.
-Greg


 On Mar 4, 2015, at 6:20 AM, John Spray john.sp...@redhat.com wrote:
 
 On 04/03/2015 12:14, 严正 wrote:
 
 在 2015年3月4日,05:39,John Spray john.sp...@redhat.com 写道:
 
 During replay, we rebuild completed_requests from EMetaBlob::replay, and 
 we've made it this far without reliably persisting it in sessionmap, so I 
 wonder if we ever needed to save this at all? Thoughts?
 
 I think we need to save completed_requests for corner cases. consider 
 following scenario:
 
 Client A sends setattr request to MDS
 MDS handles the request and sends reply to client. But network between MDS 
 and client A becomes disconnected.
 MDS handles lots of setattr requests from other clients. Log entry for the 
 first setattr request get trimmed
 MDS crashed, standby MDS on other host takes over.
 Client A re-send the setattr request to the new MDS.
 Ah, this makes sense.  I suspect we never saw that scenario in the automated 
 tests because they almost all just use a single client.
 
 John
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in_seq, out_seq in Messenger

2015-02-23 Thread Greg Farnum
On Feb 12, 2015, at 9:17 PM, Haomai Wang haomaiw...@gmail.com wrote:
 
 On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum gfar...@redhat.com wrote:
 Sorry for the delayed response.
 
 On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote:
 
 Hmm, I got it.
 
 There exists another problem I'm not sure whether captured by upper layer:
 
 two monitor node(A,B) connected with lossless_peer_reuse policy,
 1. lots of messages has been transmitted
 2. markdown A
 
 I don’t think monitors ever mark each other down?
 
 3. restart A and call send_message(message will be in out_q)
 
 oh, maybe you just mean rebooting it, not an interface thing, okay...
 
 4. network error injected and A failed to build a *session* with B
 5. because of policy and in_queue() == true, we will reconnect in writer()
 6. connect_seq++ and try to reconnect
 
 I think you’re wrong about this step. The messenger won’t increment 
 connect_seq directly in writer() because it will be in STATE_CONNECTING, so 
 it just calls connect() directly.
 connect() doesn’t increment the connect_seq unless it successfully finishes 
 a session negotiation.
 
 Hmm, sorry. I checked log again. Actually A doesn't have any message
 in queue. So it will enter standby state and increase connect_seq. It
 will not be *STATE_CONNECTING*.
 

Urgh, that case does seem broken, yes. I take it this is something you’ve 
actually run across?

It looks like that connect_seq++ was added by 
https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25. 
Which makes me think we might just be able to increment the connect_seq 
appropriately in the connect() function if we need to do so (on replacement, I 
assume). Would you like to look at that and how this change might impact the 
peer with regards to the referenced assert failure?

-A very slow-to-reply and apologetic Greg

 
 2015-02-13 06:19:22.240788 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).writer: state = connecting policy.server=0
 2015-02-13 06:19:22.240801 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect 0
 2015-02-13 06:19:22.240821 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connecting to 127.0.0.1:16800/22032
 2015-02-13 06:19:22.398009 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect read peer addr 127.0.0.1:16800/22032 on socket 91
 2015-02-13 06:19:22.398026 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect peer addr for me is 127.0.0.1:36265/0
 2015-02-13 06:19:22.398066 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect sent my addr 127.0.0.1:16813/22045
 2015-02-13 06:19:22.398089 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect sending gseq=8 cseq=0 proto=24
 2015-02-13 06:19:22.398115 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect wrote (self +) cseq, waiting for reply
 2015-02-13 06:19:22.398137 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect read reply (0) Success
 2015-02-13 06:19:22.398155 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060). sleep for 0.1
 2015-02-13 06:19:22.498243 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).fault (0) Success
 2015-02-13 06:19:22.498275 7fdd147c7700  0 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).fault with nothing to send, going to standby
 2015-02-13 06:19:22.498290 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
 c=0x3ed2060).writer: state = standby policy.server=0
 2015-02-13 06:19:22.498301 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
 c=0x3ed2060).writer sleeping
 2015-02-13 06:19:22.526116 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
 c=0x3ed2060).writer: state = standby policy.server=0
 2015-02-13 06:19:22.526132 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=1 l=0
 c=0x3ed2060).connect 1
 2015-02-13 06:19:22.526158 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36265 s=1 pgs=0 cs=1 l=0
 c

Re: About in_seq, out_seq in Messenger

2015-02-12 Thread Greg Farnum
Sorry for the delayed response.

 On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote:
 
 Hmm, I got it.
 
 There exists another problem I'm not sure whether captured by upper layer:
 
 two monitor node(A,B) connected with lossless_peer_reuse policy,
 1. lots of messages has been transmitted
 2. markdown A

I don’t think monitors ever mark each other down?

 3. restart A and call send_message(message will be in out_q)

oh, maybe you just mean rebooting it, not an interface thing, okay...

 4. network error injected and A failed to build a *session* with B
 5. because of policy and in_queue() == true, we will reconnect in writer()
 6. connect_seq++ and try to reconnect

I think you’re wrong about this step. The messenger won’t increment connect_seq 
directly in writer() because it will be in STATE_CONNECTING, so it just calls 
connect() directly.
connect() doesn’t increment the connect_seq unless it successfully finishes a 
session negotiation.

Unless I’m missing something? :)
-Greg

 7. because of connect_seq != 0, B can't detect remote reset and
 in_seq(a large value) will be exchange and cause A
 crashed(Pipe.cc:1154)
 
 So I guess we can't increase connect_seq when reconnecting? We need to
 let peer side detect remote reset via connect_seq == 0.
 
 
 
 On Tue, Feb 10, 2015 at 12:00 AM, Gregory Farnum gfar...@redhat.com wrote:
 - Original Message -
 From: Haomai Wang haomaiw...@gmail.com
 To: Gregory Farnum gfar...@redhat.com
 Cc: Sage Weil sw...@redhat.com, ceph-devel@vger.kernel.org
 Sent: Friday, February 6, 2015 8:16:42 AM
 Subject: Re: About in_seq, out_seq in Messenger
 
 On Fri, Feb 6, 2015 at 10:47 PM, Gregory Farnum gfar...@redhat.com wrote:
 - Original Message -
 From: Haomai Wang haomaiw...@gmail.com
 To: Sage Weil sw...@redhat.com, Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Sent: Friday, February 6, 2015 12:26:18 AM
 Subject: About in_seq, out_seq in Messenger
 
 Hi all,
 
 Recently we enable a async messenger test job in test
 lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#).
 We hit many failed assert mostly are:
  assert(0 == old msgs despite reconnect_seq feature);
 
 And assert connection all are cluster messenger which mean it's OSD
 internal connection. The policy associated this connection is
 Messenger::Policy::lossless_peer.
 
 So when I dive into this problem, I find something confusing about
 this. Suppose these steps:
 1. lossless_peer policy is used by both two side connections.
 2. markdown one side(anyway), peer connection will try to reconnect
 3. then we restart failed side, a new connection is built but
 initiator will think it's a old connection so sending in_seq(10)
 4. new started connection has no message in queue and it will receive
 peer connection's in_seq(10) and call discard_requeued_up_to(10). But
 because no message in queue, it won't modify anything
 5. now any side issue a message, it will trigger assert(0 == old
 msgs despite reconnect_seq feature);
 
 I can replay these steps in unittest and actually it's hit in test lab
 for async messenger which follows simple messenger's design.
 
 Besides, if we enable reset_check here, was_session_reset will be
 called and it will random out_seq, so it will certainly hit assert(0
 == skipped incoming seq).
 
 Anything wrong above?
 
 Sage covered most of this. I'll just add that the last time I checked it, I
 came to the conclusion that the code to use a random out_seq on initial
 connect was non-functional. So there definitely may be issues there.
 
 In fact, we've fixed a couple (several?) bugs in this area since Firefly
 was initially released, so if you go over the point release
 SimpleMessenger patches you might gain some insight. :)
 -Greg
 
 If we want to make random out_seq functional, I think we need to
 exchange out_seq when handshaking too. Otherwise, we need to give it
 up.
 
 Possibly. Or maybe we just need to weaken our asserts to infer it on initial 
 messages?
 
 
 Another question, do you think reset_check=true is always good for
 osd internal connection?
 
 Huh? resetcheck is false for lossless peer connections.
 
 
 Let Messenger rely on upper layer may not a good idea, so maybe we can
 enhance in_seq exchange process(ensure each side
 in_seq+sent.size()==out_seq). From the current handshake impl, it's
 not easy to insert more action to in_seq exchange process, because
 this session has been built regardless of the result of in_seq
 process.
 
 If enable reset_check=true, it looks we can solve most of incorrect
 seq out-of-sync problem?
 
 Oh, I see what you mean.
 Yeah, the problem here is a bit of a mismatch in the interfaces. OSDs are 
 lossless peers with each other, they should not miss any messages, and 
 they don't ever go away. Except of course sometimes they do go away, if one 
 of them dies. This is supposed to be handled by marking it down, but it 
 turns out the race conditions 

Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-20 Thread Greg Farnum
On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
 On 03/20/2013 02:15 PM, Sage Weil wrote:
  On Wed, 20 Mar 2013, Yan, Zheng wrote:
   On 03/20/2013 07:09 AM, Greg Farnum wrote:
Hmm, this is definitely narrowing the race (probably enough to never 
hit it), but it's not actually eliminating it (if the restart happens 
after 4 billion requests?). More importantly this kind of symptom makes 
me worry that we might be papering over more serious issues with 
colliding states in the Table on restart.
I don't have the MDSTable semantics in my head so I'll need to look 
into this later unless somebody else volunteers to do so?



   Not just 4 billion requests, MDS restart has several stage, mdsmap epoch  
   increases for each stage. I don't think there are any more colliding  
   states in the table. The table client/server use two phase commit. it's  
   similar to client request that involves multiple MDS. the reqid is  
   analogy to client request id. The difference is client request ID is  
   unique because new client always get an unique session id.
   
   
   
  Each time a tid is consumed (at least for an update) it is journaled in  
  the EMetaBlob::table_tids list, right? So we could actually take a max  
  from journal replay and pick up where we left off? That seems like the  
  cleanest.
   
  I'm not too worried about 2^32 tids, I guess, but it would be nicer to  
  avoid that possibility.
  
  
  
 Can we re-use the client request ID as table client request ID ?
  
 Regards
 Yan, Zheng

Not sure what you're referring to here — do you mean the ID of the filesystem 
client request which prompted the update? I don't think that would work as 
client requests actually require two parts to be unique (the client GUID and 
the request seq number), and I'm pretty sure a single client request can spawn 
multiple Table updates.

As I look over this more, it sure looks to me as if the effect of the code we 
have (when non-broken) is to rollback every non-committed request by an MDS 
which restarted — the only time it can handle the TableServer's agree with a 
different response is if the MDS was incorrectly marked out by the map. Am I 
parsing this correctly, Sage? Given that, and without having looked at the code 
more broadly, I think we want to add some sort of implicit or explicit 
handshake letting each of them know if the MDS actually disappeared. We use the 
process/address nonce to accomplish this in other places…
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/39] mds: preserve subtree bounds until slave commit

2013-03-20 Thread Greg Farnum
Reviewed-by: Greg Farnum g...@inktank.com 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 When replaying an operation that rename a directory inode to non-auth subtree,
 if the inode has subtree bounds, we should prevent them from being trimmed
 until slave commit.
 
 This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay()
 should be called before MDCache::finish_uncommitted_slave_update().
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 21 +++--
 src/mds/Mutation.h | 5 ++---
 src/mds/journal.cc (http://journal.cc) | 13 +
 3 files changed, 22 insertions(+), 17 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index fddcfc6..684e70b 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -3016,10 +3016,10 @@ void 
 MDCache::add_uncommitted_slave_update(metareqid_t reqid, int master, MDSlav
 {
 assert(uncommitted_slave_updates[master].count(reqid) == 0);
 uncommitted_slave_updates[master][reqid] = su;
 - if (su-rename_olddir)
 - uncommitted_slave_rename_olddir[su-rename_olddir]++;
 + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); 
 ++p)
 + uncommitted_slave_rename_olddir[*p]++;
 for(setCInode*::iterator p = su-unlinked.begin(); p != su-unlinked.end(); 
 ++p)
 - uncommitted_slave_unlink[*p]++;
 + uncommitted_slave_unlink[*p]++;
 }
 
 void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master)
 @@ -3031,11 +3031,12 @@ void 
 MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master)
 if (uncommitted_slave_updates[master].empty())
 uncommitted_slave_updates.erase(master);
 // discard the non-auth subtree we renamed out of
 - if (su-rename_olddir) {
 - uncommitted_slave_rename_olddir[su-rename_olddir]--;
 - if (uncommitted_slave_rename_olddir[su-rename_olddir] == 0) {
 - uncommitted_slave_rename_olddir.erase(su-rename_olddir);
 - CDir *root = get_subtree_root(su-rename_olddir);
 + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); 
 ++p) {
 + CDir *dir = *p;
 + uncommitted_slave_rename_olddir[dir]--;
 + if (uncommitted_slave_rename_olddir[dir] == 0) {
 + uncommitted_slave_rename_olddir.erase(dir);
 + CDir *root = get_subtree_root(dir);
 if (root-get_dir_auth() == CDIR_AUTH_UNDEF)
 try_trim_non_auth_subtree(root);
 }
 @@ -6052,8 +6053,8 @@ bool MDCache::trim_non_auth_subtree(CDir *dir)
 {
 dout(10)  trim_non_auth_subtree(  dir  )   *dir  dendl;
 
 - // preserve the dir for rollback
 - if (uncommitted_slave_rename_olddir.count(dir))
 + if (uncommitted_slave_rename_olddir.count(dir) || // preserve the dir for 
 rollback
 + my_ambiguous_imports.count(dir-dirfrag()))
 return true;
 
 bool keep_dir = false;
 diff --git a/src/mds/Mutation.h b/src/mds/Mutation.h
 index 55b84eb..5013f04 100644
 --- a/src/mds/Mutation.h
 +++ b/src/mds/Mutation.h
 @@ -315,13 +315,12 @@ struct MDSlaveUpdate {
 bufferlist rollback;
 elistMDSlaveUpdate*::item item;
 Context *waiter;
 - CDir* rename_olddir;
 + setCDir* olddirs;
 setCInode* unlinked;
 MDSlaveUpdate(int oo, bufferlist rbl, elistMDSlaveUpdate* list) :
 origop(oo),
 item(this),
 - waiter(0),
 - rename_olddir(0) {
 + waiter(0) {
 rollback.claim(rbl);
 list.push_back(item);
 }
 diff --git a/src/mds/journal.cc (http://journal.cc) b/src/mds/journal.cc 
 (http://journal.cc)
 index 5b3bd71..3375e40 100644
 --- a/src/mds/journal.cc (http://journal.cc)
 +++ b/src/mds/journal.cc (http://journal.cc)
 @@ -1131,10 +1131,15 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, 
 MDSlaveUpdate *slaveup)
 if (olddir) {
 if (olddir-authority() != CDIR_AUTH_UNDEF 
 renamed_diri-authority() == CDIR_AUTH_UNDEF) {
 + assert(slaveup); // auth to non-auth, must be slave prepare
 listfrag_t leaves;
 renamed_diri-dirfragtree.get_leaves(leaves);
 - for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p)
 - renamed_diri-get_or_open_dirfrag(mds-mdcache, *p);
 + for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) {
 + CDir *dir = renamed_diri-get_or_open_dirfrag(mds-mdcache, *p);
 + // preserve subtree bound until slave commit
 + if (dir-authority() == CDIR_AUTH_UNDEF)
 + slaveup-olddirs.insert(dir);
 + }
 }
 
 mds-mdcache-adjust_subtree_after_rename(renamed_diri, olddir, false);
 @@ -1143,7 +1148,7 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, 
 MDSlaveUpdate *slaveup)
 CDir *root = mds-mdcache-get_subtree_root(olddir);
 if (root-get_dir_auth() == CDIR_AUTH_UNDEF) {
 if (slaveup) // preserve the old dir until slave commit
 - slaveup-rename_olddir = olddir;
 + slaveup-olddirs.insert(olddir);
 else
 mds-mdcache-try_trim_non_auth_subtree(root);
 }
 @@ -2122,10 +2127,10 @@ void ESlaveUpdate::replay(MDS *mds)
 case

Re: [PATCH 03/39] mds: fix MDCache::adjust_bounded_subtree_auth()

2013-03-20 Thread Greg Farnum
Reviewed-by: Greg Farnum g...@inktank.com


Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 There are cases that need both create new bound and swallow intervening
 subtree. For example: A MDS exports subtree A with bound B and imports
 subtree B with bound C at the same time. The MDS crashes, exporting
 subtree A fails, but importing subtree B succeed. During recovery, the
 MDS may create new bound C and swallow subtree B.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index 684e70b..19dc60b 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -980,15 +980,21 @@ void MDCache::adjust_bounded_subtree_auth(CDir *dir, 
 setCDir* bounds, pairin
 }
 else {
 dout(10)   want bound   *bound  dendl;
 + CDir *t = get_subtree_root(bound-get_parent_dir());
 + if (subtrees[t].count(bound) == 0) {
 + assert(t != dir);
 + dout(10)   new bound   *bound  dendl;
 + adjust_subtree_auth(bound, t-authority());
 + }
 // make sure it's nested beneath ambiguous subtree(s)
 while (1) {
 - CDir *t = get_subtree_root(bound-get_parent_dir());
 - if (t == dir) break;
 while (subtrees[dir].count(t) == 0)
 t = get_subtree_root(t-get_parent_dir());
 dout(10)   swallowing intervening subtree at   *t  dendl;
 adjust_subtree_auth(t, auth);
 try_subtree_merge_at(t);
 + t = get_subtree_root(bound-get_parent_dir());
 + if (t == dir) break;
 }
 }
 }
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/39] mds: send table request when peer is in proper state.

2013-03-20 Thread Greg Farnum
This and patch 6 are probably going to get dealt with as part of our 
conversation on patch 4 and restart of the TableServers. 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 Table client/server should send request/reply when the peer is active.
 Anchor query is an exception, because MDS in rejoin stage may need
 fetch files before sending rejoin ack, the anchor server can also be
 in rejoin stage.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/AnchorClient.cc (http://AnchorClient.cc) | 5 -
 src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 9 ++---
 src/mds/MDSTableServer.cc (http://MDSTableServer.cc) | 3 ++-
 3 files changed, 12 insertions(+), 5 deletions(-)
 
 diff --git a/src/mds/AnchorClient.cc (http://AnchorClient.cc) 
 b/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 index 455e97f..d7da9d1 100644
 --- a/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 +++ b/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 @@ -80,9 +80,12 @@ void AnchorClient::lookup(inodeno_t ino, vectorAnchor 
 trace, Context *onfinis
 
 void AnchorClient::_lookup(inodeno_t ino)
 {
 + int ts = mds-mdsmap-get_tableserver();
 + if (mds-mdsmap-get_state(ts)  MDSMap::STATE_REJOIN)
 + return;
 MMDSTableRequest *req = new MMDSTableRequest(table, TABLESERVER_OP_QUERY, 0, 
 0);
 ::encode(ino, req-bl);
 - mds-send_message_mds(req, mds-mdsmap-get_tableserver());
 + mds-send_message_mds(req, ts);
 }
 
 
 diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
 b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 index beba0a3..df0131f 100644
 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 @@ -149,9 +149,10 @@ void MDSTableClient::_prepare(bufferlist mutation, 
 version_t *ptid, bufferlist
 void MDSTableClient::send_to_tableserver(MMDSTableRequest *req)
 {
 int ts = mds-mdsmap-get_tableserver();
 - if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY)
 + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) {
 mds-send_message_mds(req, ts);
 - else {
 + } else {
 + req-put();
 dout(10)   deferring request to not-yet-active tableserver mds.  ts  
 dendl;
 }
 }
 @@ -193,7 +194,9 @@ void MDSTableClient::got_journaled_ack(version_t tid)
 void MDSTableClient::finish_recovery()
 {
 dout(7)  finish_recovery  dendl;
 - resend_commits();
 + int ts = mds-mdsmap-get_tableserver();
 + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY)
 + resend_commits();
 }
 
 void MDSTableClient::resend_commits()
 diff --git a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) 
 b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 index 4f86ff1..07c7d26 100644
 --- a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 +++ b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 @@ -159,7 +159,8 @@ void MDSTableServer::handle_mds_recovery(int who)
 for (mapversion_t,mds_table_pending_t::iterator p = pending_for_mds.begin();
 p != pending_for_mds.end();
 ++p) {
 - if (who = 0  p-second.mds != who)
 + if ((who = 0  p-second.mds != who) ||
 + mds-mdsmap-get_state(p-second.mds)  MDSMap::STATE_CLIENTREPLAY)
 continue;
 MMDSTableRequest *reply = new MMDSTableRequest(table, TABLESERVER_OP_AGREE, 
 p-second.reqid, p-second.tid);
 mds-send_message_mds(reply, p-second.mds);
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.

2013-03-20 Thread Greg Farnum
The idea of this patch makes sense, but I'm not sure if we guarantee that each 
daemon sees every map update — if they don't then if an MDS misses the map 
moving an MDS into CLIENTREPLAY then they won't process them as having 
recovered on the next map. Sage or Joao, what are the guarantees subscription 
provides?  
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
  
 MDS in clientreply state already start servering requests. It also
 make MDS::handle_mds_recovery() and MDS::recovery_done() match.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 2 ++
 1 file changed, 2 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 282fa64..b91dcbd 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m)
  
 setint oldactive, active;
 oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE);
 + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY);
 mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE);
 + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY);
 for (setint::iterator p = active.begin(); p != active.end(); ++p)  
 if (*p != whoami  // not me
 oldactive.count(*p) == 0) // newly so?
 --  
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/39] mds: mark connection down when MDS fails

2013-03-20 Thread Greg Farnum
Reviewed-by: Greg Farnum g...@inktank.com



Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 So if the MDS restarts and uses the same address, it does not get
 old messages.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 859782a..282fa64 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1046,8 +1046,10 @@ void MDS::handle_mds_map(MMDSMap *m)
 oldmap-get_failed_mds_set(oldfailed);
 mdsmap-get_failed_mds_set(failed);
 for (setint::iterator p = failed.begin(); p != failed.end(); ++p)
 - if (oldfailed.count(*p) == 0)
 + if (oldfailed.count(*p) == 0) {
 + messenger-mark_down(oldmap-get_inst(*p).addr);
 mdcache-handle_mds_failure(*p);
 + }
 
 // or down then up?
 // did their addr/inst change?
 @@ -1055,8 +1057,10 @@ void MDS::handle_mds_map(MMDSMap *m)
 mdsmap-get_up_mds_set(up);
 for (setint::iterator p = up.begin(); p != up.end(); ++p) 
 if (oldmap-have_inst(*p) 
 - oldmap-get_inst(*p) != mdsmap-get_inst(*p))
 + oldmap-get_inst(*p) != mdsmap-get_inst(*p)) {
 + messenger-mark_down(oldmap-get_inst(*p).addr);
 mdcache-handle_mds_failure(*p);
 + }
 }
 if (is_clientreplay() || is_active() || is_stopping()) {
 // did anyone stop?
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.

2013-03-20 Thread Greg Farnum
Oh, also: s/clientreply/clientreplay in the commit message 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 MDS in clientreply state already start servering requests. It also
 make MDS::handle_mds_recovery() and MDS::recovery_done() match.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 2 ++
 1 file changed, 2 insertions(+)
 
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 282fa64..b91dcbd 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m)
 
 setint oldactive, active;
 oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE);
 + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY);
 mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE);
 + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY);
 for (setint::iterator p = active.begin(); p != active.end(); ++p) 
 if (*p != whoami  // not me
 oldactive.count(*p) == 0) // newly so?
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/39] mds: defer eval gather locks when removing replica

2013-03-20 Thread Greg Farnum
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 Locks' states should not change between composing the cache rejoin ack
 messages and sending the message. If Locker::eval_gather() is called
 in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
 change locks' states.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 51 
 ++-
 src/mds/MDCache.h | 8 +---
 2 files changed, 35 insertions(+), 24 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index 19dc60b..0f6b842 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 // possible response(s)
 MMDSCacheRejoin *ack = 0; // if survivor
 setvinodeno_t acked_inodes; // if survivor
 + setSimpleLock * gather_locks; // if survivor
 bool survivor = false; // am i a survivor?
 
 if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) {
 @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(dnl-is_primary());
 
 if (survivor  dn-is_replica(from)) 
 - dentry_remove_replica(dn, from); // this induces a lock gather completion
 + dentry_remove_replica(dn, from, gather_locks); // this induces a lock 
 gather completion

This comment is no longer accurate :) 
 int dnonce = dn-add_replica(from);
 dout(10)   have   *dn  dendl;
 if (ack) 
 @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(in);
 
 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 int inonce = in-add_replica(from);
 dout(10)   have   *in  dendl;
 
 @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 CInode *in = get_inode(*p);
 assert(in); // hmm fixme wrt stray?
 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from); // this induces a lock gather completion
 + inode_remove_replica(in, from, gather_locks); // this induces a lock gather 
 completion

Same here. 

Other than those, looks good.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


 int inonce = in-add_replica(from);
 dout(10)   have base   *in  dendl;
 
 @@ -3909,8 +3910,11 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 ack-add_inode_base(in);
 }
 
 - rejoin_scour_survivor_replicas(from, ack, acked_inodes);
 + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes);
 mds-send_message(ack, weak-get_connection());
 +
 + for (setSimpleLock*::iterator p = gather_locks.begin(); p != 
 gather_locks.end(); ++p)
 + mds-locker-eval_gather(*p);
 } else {
 // done?
 assert(rejoin_gather.count(from));
 @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t 
 ino, filepath path,
 * all validated replicas are acked with a strong nonce, etc. if that isn't in 
 the
 * ack, the replica dne, and we can remove it from our replica maps.
 */
 -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, 
 setvinodeno_t acked_inodes)
 +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack,
 + setSimpleLock * gather_locks,
 + setvinodeno_t acked_inodes)
 {
 dout(10)  rejoin_scour_survivor_replicas from mds.  from  dendl;
 
 @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (in-is_auth() 
 in-is_replica(from) 
 acked_inodes.count(p-second-vino()) == 0) {
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 dout(10)   rem   *in  dendl;
 }
 
 @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (dn-is_replica(from) 
 (ack-strong_dentries.count(dir-dirfrag()) == 0 ||
 ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, dn-last)) 
 == 0)) {
 - dentry_remove_replica(dn, from);
 + dentry_remove_replica(dn, from, gather_locks);
 dout(10)   rem   *dn  dendl;
 }
 }
 @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 return;
 }
 
 + setSimpleLock * gather_locks;
 // loop over realms
 for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin();
 p != m-realms.end();
 @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 // remove from our cached_by
 dout(7)   inode expire on   *in   from mds.  from 
   cached_by was   in-get_replicas()  dendl;
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 } 
 else {
 // this is an old nonce, ignore expire.
 @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 
 if (nonce == dn-get_replica_nonce(from)) {
 dout(7)   dentry_expire on   *dn   from mds.  from  dendl;
 - 

Re: CephFS: stable release?

2013-03-20 Thread Greg Farnum
On Wednesday, March 20, 2013 at 1:22 PM, Pascal wrote:
 Am Sun, 24 Feb 2013 14:41:27 -0800
 schrieb Gregory Farnum g...@inktank.com (mailto:g...@inktank.com):
  
  On Saturday, February 23, 2013 at 2:14 AM, Gandalf Corvotempesta
  wrote:
   Hi all,
   do you have an ETA about a stable realease (or something usable in
   production) for CephFS?
   
   
   
  Short answer: no.  
   
  However, we do have a team of people working on the FS again as of a
  month or so ago. We're doing a lot of stabilization (bug fixes), code
  cleanups, and utility work in the coming months; we can estimate the
  utility and cleanup work but not the bugs that we'll find, and those
  are our main concern right now. Depending on how the next couple
  months of QA and bug fixing go we should be able to publicize real
  estimates soonish. -Greg
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
  
 Hello Gregory,
  
 is your response still up-to-date?  
  
 The FAQ (http://ceph.com/docs/master/faq/) says: Ceph’s object store (RADOS) 
 is production ready.
  


Well put out some blog posts and emails when we have anything more to report. 
:) RADOS is ready, but CephFS is a whole separate layer above it.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request

2013-03-20 Thread Greg Farnum
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 Replicated objects need to be added into the cache immediately
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com
Why do we need to add them right away? Shouldn't we have a journaled replica if 
we need it?
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
 ---
 src/mds/MDCache.cc | 12 
 src/mds/MDCache.h | 2 +-
 src/mds/MDS.cc | 6 +++---
 src/mds/Server.cc | 55 +++---
 4 files changed, 56 insertions(+), 19 deletions(-)
 
 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 0f6b842..b668842 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context 
 *fin, bufferlist bl, int r)
 
 /*  */
 
 +int MDCache::get_num_client_requests()
 +{
 + int count = 0;
 + for (hash_mapmetareqid_t, MDRequest*::iterator p = 
 active_requests.begin();
 + p != active_requests.end();
 + ++p) {
 + if (p-second-reqid.name.is_client()  !p-second-is_slave())
 + count++;
 + }
 + return count;
 +}
 +
 /* This function takes over the reference to the passed Message */
 MDRequest *MDCache::request_start(MClientRequest *req)
 {
 diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
 index a9f05c6..4634121 100644
 --- a/src/mds/MDCache.h
 +++ b/src/mds/MDCache.h
 @@ -240,7 +240,7 @@ protected:
 hash_mapmetareqid_t, MDRequest* active_requests; 
 
 public:
 - int get_num_active_requests() { return active_requests.size(); }
 + int get_num_client_requests();
 
 MDRequest* request_start(MClientRequest *req);
 MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by);
 diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
 index b91dcbd..e99eecc 100644
 --- a/src/mds/MDS.cc
 +++ b/src/mds/MDS.cc
 @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m)
 mdcache-is_open() 
 replay_queue.empty() 
 want_state == MDSMap::STATE_CLIENTREPLAY) {
 - dout(10)   still have   mdcache-get_num_active_requests()
 -   active replay requests  dendl;
 - if (mdcache-get_num_active_requests() == 0)
 + int num_requests = mdcache-get_num_client_requests();
 + dout(10)   still have   num_requests   active replay requests  
 dendl;
 + if (num_requests == 0)
 clientreplay_done();
 }
 
 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 4c4c86b..8e89e4c 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -107,10 +107,8 @@ void Server::dispatch(Message *m)
 (m-get_type() == CEPH_MSG_CLIENT_REQUEST 
 (static_castMClientRequest*(m))-is_replay( {
 // replaying!
 - } else if (mds-is_clientreplay()  m-get_type() == MSG_MDS_SLAVE_REQUEST 
 
 - ((static_castMMDSSlaveRequest*(m))-is_reply() ||
 - !mds-mdsmap-is_active(m-get_source().num( {
 - // slave reply or the master is also in the clientreplay stage
 + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) {
 + // handle_slave_request() will wait if necessary
 } else {
 dout(3)  not active yet, waiting  dendl;
 mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 if (m-is_reply())
 return handle_slave_request_reply(m);
 
 + CDentry *straydn = NULL;
 + if (m-stray.length()  0) {
 + straydn = mdcache-add_replica_stray(m-stray, from);
 + assert(straydn);
 + m-stray.clear();
 + }
 +
 // am i a new slave?
 MDRequest *mdr = NULL;
 if (mdcache-have_request(m-get_reqid())) {
 @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 m-put();
 return;
 }
 - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), 
 m-get_source().num());
 + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from);
 }
 assert(mdr-slave_request == 0); // only one at a time, please! 
 +
 + if (straydn) {
 + mdr-pin(straydn);
 + mdr-straydn = straydn;
 + }
 +
 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + } else if (mds-is_clientreplay()  !mds-mdsmap-is_clientreplay(from) 
 + mdr-locks.empty()) {
 + dout(3)  not active yet, waiting  dendl;
 + mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 mdr-slave_request = m;
 
 dispatch_slave_request(mdr);
 @@ -1339,6 +1361,12 @@ void 
 Server::handle_slave_request_reply(MMDSSlaveRequest *m)
 {
 int from = m-get_source().num();
 
 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) {
 metareqid_t r = m-get_reqid();
 mds-mdcache-committed_master_slave(r, from);
 @@ -5138,10 +5166,8 @@ void Server::handle_slave_rmdir_prep(MDRequest *mdr)
 dout(10)   dn   *dn  dendl;
 mdr-pin(dn);
 
 - assert(mdr-slave_request-stray.length()  

Re: [PATCH] ceph: fix buffer pointer advance in ceph_sync_write

2013-03-19 Thread Greg Farnum
Sage beat me to it and merged this in last night. Thanks much! 
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Monday, March 18, 2013 at 6:46 PM, Henry C Chang wrote:

 We should advance the user data pointer by _len_ instead of _written_.
 _len_ is the data length written in each iteration while _written_ is the
 accumulated data length we have writtent out.
 
 Signed-off-by: Henry C Chang henry.cy.ch...@gmail.com
 ---
 fs/ceph/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index e51558f..4bcbcb6 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -608,7 +608,7 @@ out:
 pos += len;
 written += len;
 left -= len;
 - data += written;
 + data += len;
 if (left)
 goto more;
 
 -- 
 1.7.9.5
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deb/rpm package purge

2013-03-19 Thread Greg Farnum
On Tuesday, March 19, 2013 at 12:48 PM, Sage Weil wrote:
 Should the package purge remove /var/lib/ceph/* (potential mon data, osd 
 data) and/or /var/log/ceph/* (logs)? Right now it does, but mysql, for 
 example, leaves /var/lib/mysql where it is (not sure about logs).
 


I'm with Mark's ticket on this (http://tracker.ceph.com/issues/4505). Config 
data in /etc/ceph and logs in var/log/ceph is fine to remove, but storage data 
isn't. That's essentially user-generated and not something that can be 
recovered or rebuilt following the purge. Keyrings might not be unreasonable to 
delete, but I don't think they're necessary and certainly aren't worth putting 
in the work to separate from the other user data in var/lib/ceph.

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-19 Thread Greg Farnum
Hmm, this is definitely narrowing the race (probably enough to never hit it), 
but it's not actually eliminating it (if the restart happens after 4 billion 
requests…). More importantly this kind of symptom makes me worry that we might 
be papering over more serious issues with colliding states in the Table on 
restart.
I don't have the MDSTable semantics in my head so I'll need to look into this 
later unless somebody else volunteers to do so…
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com
  
 When a MDS becomes active, the table server re-sends 'agree' messages
 for old prepared request. If the recoverd MDS starts a new table request
 at the same time, The new request's ID can happen to be the same as old
 prepared request's ID, because current table client assigns request ID
 from zero after MDS restarts.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 3 +++
 src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +
 src/mds/MDSTableClient.h | 2 ++
 3 files changed, 10 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index bb1c833..859782a 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
 dout(2)  boot_start   step  : opening snap table  dendl;  
 snapserver-load(gather.new_sub());
 }
 +
 + anchorclient-init();
 + snapclient-init();
  
 dout(2)  boot_start   step  : opening mds log  dendl;
 mdlog-open(gather.new_sub());
 diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
 b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 index ea021f5..beba0a3 100644
 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 @@ -34,6 +34,11 @@
 #undef dout_prefix
 #define dout_prefix *_dout  mds.  mds-get_nodeid()  .tableclient( 
  get_mdstable_name(table)  ) 
  
 +void MDSTableClient::init()
 +{
 + // make reqid unique between MDS restarts
 + last_reqid = (uint64_t)mds-mdsmap-get_epoch()  32;
 +}
  
 void MDSTableClient::handle_request(class MMDSTableRequest *m)
 {
 diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
 index e15837f..78035db 100644
 --- a/src/mds/MDSTableClient.h
 +++ b/src/mds/MDSTableClient.h
 @@ -63,6 +63,8 @@ public:
 MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
 virtual ~MDSTableClient() {}
  
 + void init();
 +
 void handle_request(MMDSTableRequest *m);
  
 void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, Context 
 *onfinish);
 --  
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct IO on CephFS for blocks larger than 8MB

2013-03-18 Thread Greg Farnum
On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote:
 The following patch should fix the problem.
 
 -Henry
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index e51558f..4bcbcb6 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -608,7 +608,7 @@ out:
 pos += len;
 written += len;
 left -= len;
 - data += written;
 + data += len;
 if (left)
 goto more;

This looks good to me. If you'd like to submit it as a proper patch with a 
sign-off I'll pull it into our tree. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
 On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
  On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com 
  (mailto:ke...@kdecherf.com) wrote:
   You will find the archive here: snip
   The data is not anonymized. Interesting folders/files here are
   /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
   
   
   
  Sorry for the delay, but I have retrieved this archive locally at
  least so if you want to remove it from your webserver you can do so.
  :) Also, I notice when I untar it that the file name includes
  filtered — what filters did you run it through?
  
  
  
 Hi Gregory,
  
 Do you have any news about it?
  

I wrote a couple tools to do log analysis and created a number of bugs to make 
the MDS more amenable to analysis as a result of this.
Having spot-checked some of your longer-running requests, they're all getattrs 
or setattrs contending on files in what look to be shared cache and php 
libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd 
look into what your split applications are sharing across those spaces.

On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% 
of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most 
of the time. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote:
 Thank you a lot for these explanations, looking forward for these fixes!
 Do you have some public bug reports regarding this to link us?
 
 Good luck, thank you for your great job and have a nice weekend
 
 Marc-Antoine Perennou 
Well, for now the fixes are for stuff like make analysis take less time, and 
export timing information more easily. The most immediately applicable one is 
probably http://tracker.ceph.com/issues/4354, which I hope to start on next 
week and should be done by the end of the sprint.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-15 Thread Greg Farnum
[Putting list back on cc]

On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:

 On 03/15/2013 04:23 PM, Greg Farnum wrote:
  As I come back and look at these again, I'm not sure what the context
  for these logs is. Which test did they come from, and which behavior
  (slow or not slow, etc) did you see? :) -Greg
 
 
 
 They come from a test where I had debug mds = 20 and debug ms = 1
 on the MDS while writing files from 198 clients. It turns out that 
 for some reason I need debug mds = 20 during writing to reproduce
 the slow stat behavior later.
 
 strace.find.dirs.txt.bz2 contains the log of running 
 strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls 
 -lhd {} \;
 
 From that output, I believe that the stat of at least these files is slow:
 zero0.rc11
 zero0.rc30
 zero0.rc46
 zero0.rc8
 zero0.tc103
 zero0.tc105
 zero0.tc106
 I believe that log shows slow stats on more files, but those are the first 
 few.
 
 mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
 find command started, until just after the fifth or sixth slow stat from
 the list above.
 
 I haven't yet tried to find other ways of reproducing this, but so far
 it appears that something happens during the writing of the files that
 ends up causing the condition that results in slow stat commands.
 
 I have the full MDS log from the writing of the files, as well, but it's
 big
 
 Is that what you were after?
 
 Thanks for taking a look!
 
 -- Jim

I just was coming back to these to see what new information was available, but 
I realized we'd discussed several tests and I wasn't sure what these ones came 
from. That information is enough, yes.

If in fact you believe you've only seen this with high-level MDS debugging, I 
believe the cause is as I mentioned last time: the MDS is flapping a bit and so 
some files get marked as needsrecover, but they aren't getting recovered 
asynchronously, and the first thing that pokes them into doing a recover is the 
stat.
That's definitely not the behavior we want and so I'll be poking around the 
code a bit and generating bugs, but given that explanation it's a bit less 
scary than random slow stats are so it's not such a high priority. :) Do let me 
know if you come across it without the MDS and clients having had connection 
issues!
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct IO on CephFS for blocks larger than 8MB

2013-03-14 Thread Greg Farnum
On Thursday, March 14, 2013 at 8:20 AM, Sage Weil wrote:
 On Thu, 14 Mar 2013, Huang, Xiwei wrote:
  Hi, all, 
  I noticed that CephFS fails to support Direct IO for blocks larger than 
  8MB, say:
  sudo dd if=/dev/zero of=mnt/cephfs/foo bs=16M count=1 oflag=direct
  dd: writing `mnt/cephfs/foo: Bad address
  1+0 records in
  0+0 records out
  0 bytes (0 B) copied, 0.213948 s, 0.0 kB/s
  My version Ceph is 0.56.1. 
  ??I also found the bug has been already reported as Bug #2657.
  Is this fixed in the new 0.58 version? 
 
 
 
 I'm pretty sure this is a problem on the kernel client side of things, not 
 the server side (which by default handles writes up to ~100MB or so). I 
 suspect it isn't terribly difficult to fix, but hasn't been prioritized...
 
 sage 
My guess too. Are direct IO writes of that size a common thing or of great 
import to you?
Either way, a comment on the tracker saying you've run into it will promote it 
up when we're doing bug scrubs and backlog reviews. :) 
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS locality API RFC

2013-03-14 Thread Greg Farnum
On Thursday, March 14, 2013 at 11:14 AM, Noah Watkins wrote:
 The current CephFS API is used to extract locality information as follows:
  
 First we get a list of OSD IDs:
  
 ceph_get_file_extent_osds(offset) - [OSD ID]*
  
 Using the OSD IDs we can then query for the CRUSH bucket hierarchy:
  
 ceph_get_osd_crush_location(osd_id) - path
  
 The path includes hostname information, but we'd still like to get the IP. 
 The current API for doing this is:
  
 ceph_get_file_stripe_address(offset) - [sockaddr]*
  
 that returns an IP for each OSD holds replicas. The order of the output list 
 should be the same as the the OSD list, but It'd be nice to have a consistent 
 API that deals with OSD id, making the correspondence explicit.
Agreed. We should probably deprecate the get_file_stripe_address() and make 
them turn IDs into addresses on their own.

 For instance:
  
 ceph_get_file_stripe_address(osd_id) - sockaddr
How about  
ceph_get_osd_address(osd_id) - sockaddr
;)

  
 Another option is to have `ceph_get_osd_crush_location` return both the path 
 and a sockaddr.

No way — that's conflating two different things rather more than we should be. 
For one thing the sockaddr can change during a daemon restart but the crush 
location won't.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS locality API RFC

2013-03-14 Thread Greg Farnum
On Thursday, March 14, 2013 at 11:33 AM, Noah Watkins wrote:
  
 On Mar 14, 2013, at 11:29 AM, Greg Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  
  On Thursday, March 14, 2013 at 11:14 AM, Noah Watkins wrote:
   The current CephFS API is used to extract locality information as follows:

   First we get a list of OSD IDs:

   ceph_get_file_extent_osds(offset) - [OSD ID]*

   Using the OSD IDs we can then query for the CRUSH bucket hierarchy:

   ceph_get_osd_crush_location(osd_id) - path

   The path includes hostname information, but we'd still like to get the 
   IP. The current API for doing this is:

   ceph_get_file_stripe_address(offset) - [sockaddr]*

   that returns an IP for each OSD holds replicas. The order of the output 
   list should be the same as the the OSD list, but It'd be nice to have a 
   consistent API that deals with OSD id, making the correspondence explicit.
  Agreed. We should probably deprecate the get_file_stripe_address() and make 
  them turn IDs into addresses on their own.
  
  
  
 Is there an API deprecation protocol, or just -ENOTSUPP?
Well, for the moment I was thinking sticking DEPRECATED next to it and not 
using it anywhere else — but that is probably an acceptable choice instead. I 
doubt anybody's using it outside of the old Hadoop bindings. Which I am looking 
forward to being able to purge out of all memory…. ;)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] ceph: use i_release_count to indicate dir's completeness

2013-03-13 Thread Greg Farnum
Looks good, thanks. :)

We'll also be testing the first patch in this series.
-Greg

On Wednesday, March 13, 2013 at 4:44 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 Current ceph code tracks directory's completeness in two places.
 ceph_readdir() checks i_release_count to decide if it can set the
 I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
 flag. This indirection introduces locking complexity.
 
 This patch adds a new variable i_complete_count to ceph_inode_info.
 Set i_release_count's value to it when marking a directory complete.
 By comparing the two variables, we know if a directory is complete
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 Changes since V1:
 define i_complete_count as atomic_t
 
 fs/ceph/caps.c | 4 ++--
 fs/ceph/dir.c | 25 +
 fs/ceph/inode.c | 13 +++--
 fs/ceph/mds_client.c | 10 +++---
 fs/ceph/super.h | 42 --
 5 files changed, 45 insertions(+), 49 deletions(-)
 
 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 76634f4..124e8a1 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -490,7 +490,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
 struct ceph_cap *cap,
 ci-i_rdcache_gen++;
 
 /*
 - * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
 + * if we are newly issued FILE_SHARED, mark dir not complete; we
 * don't know what happened to this directory while we didn't
 * have the cap.
 */
 @@ -499,7 +499,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
 struct ceph_cap *cap,
 ci-i_shared_gen++;
 if (S_ISDIR(ci-vfs_inode.i_mode)) {
 dout( marking %p NOT complete\n, ci-vfs_inode);
 - ci-i_ceph_flags = ~CEPH_I_COMPLETE;
 + __ceph_dir_clear_complete(ci);
 }
 }
 }
 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 76821be..11966c4 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
 * falling back to a normal sync readdir if any dentries in the dir
 * are dropped.
 *
 - * I_COMPLETE tells indicates we have all dentries in the dir. It is
 + * Complete dir indicates that we have all dentries in the dir. It is
 * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
 * the MDS if/when the directory is modified).
 */
 @@ -198,8 +198,8 @@ more:
 filp-f_pos++;
 
 /* make sure a dentry wasn't dropped while we didn't have parent lock */
 - if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
 - dout( lost I_COMPLETE on %p; falling back to mds\n, dir);
 + if (!ceph_dir_is_complete(dir)) {
 + dout( lost dir complete on %p; falling back to mds\n, dir);
 err = -EAGAIN;
 goto out;
 }
 @@ -258,7 +258,7 @@ static int ceph_readdir(struct file *filp, void *dirent, 
 filldir_t filldir)
 if (filp-f_pos == 0) {
 /* note dir version at start of readdir so we can tell
 * if any dentries get dropped */
 - fi-dir_release_count = ci-i_release_count;
 + fi-dir_release_count = atomic_read(ci-i_release_count);
 
 dout(readdir off 0 - '.'\n);
 if (filldir(dirent, ., 1, ceph_make_fpos(0, 0),
 @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, 
 filldir_t filldir)
 if ((filp-f_pos == 2 || fi-dentry) 
 !ceph_test_mount_opt(fsc, NOASYNCREADDIR) 
 ceph_snap(inode) != CEPH_SNAPDIR 
 - (ci-i_ceph_flags  CEPH_I_COMPLETE) 
 + __ceph_dir_is_complete(ci) 
 __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
 spin_unlock(ci-i_ceph_lock);
 err = __dcache_readdir(filp, dirent, filldir);
 @@ -350,7 +350,8 @@ more:
 
 if (!req-r_did_prepopulate) {
 dout(readdir !did_prepopulate);
 - fi-dir_release_count--; /* preclude I_COMPLETE */
 + /* preclude from marking dir complete */
 + fi-dir_release_count--;
 }
 
 /* note next offset and last dentry name */
 @@ -428,9 +429,9 @@ more:
 * the complete dir contents in our cache.
 */
 spin_lock(ci-i_ceph_lock);
 - if (ci-i_release_count == fi-dir_release_count) {
 + if (atomic_read(ci-i_release_count) == fi-dir_release_count) {
 dout( marking %p complete\n, inode);
 - ci-i_ceph_flags |= CEPH_I_COMPLETE;
 + __ceph_dir_set_complete(ci, fi-dir_release_count);
 ci-i_max_offset = filp-f_pos;
 }
 spin_unlock(ci-i_ceph_lock);
 @@ -605,7 +606,7 @@ static struct dentry *ceph_lookup(struct inode *dir, 
 struct dentry *dentry,
 fsc-mount_options-snapdir_name,
 dentry-d_name.len) 
 !is_root_ceph_dentry(dir, dentry) 
 - (ci-i_ceph_flags  CEPH_I_COMPLETE) 
 + __ceph_dir_is_complete(ci) 
 (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
 spin_unlock(ci-i_ceph_lock);
 dout( dir %p complete, -ENOENT\n, dir);
 @@ -909,7 +910,7 @@ static int ceph_rename(struct inode *old_dir, struct 
 dentry *old_dentry,
 */
 
 /* d_move screws up d_subdirs order */
 - ceph_i_clear(new_dir, CEPH_I_COMPLETE);
 + ceph_dir_clear_complete(new_dir);
 
 d_move(old_dentry, new_dentry);
 
 @@ -1079,7 +1080,7 @@ static void ceph_d_prune(struct dentry *dentry)
 if (IS_ROOT(dentry))
 return;
 
 - /* if we 

Re: OSD memory leaks?

2013-03-13 Thread Greg Farnum
It sounds like maybe you didn't rename the new pool to use the old pool's name? 
Glance is looking for a specific pool to store its data in; I believe it's 
configurable but you'll need to do one or the other.
-Greg

On Wednesday, March 13, 2013 at 3:38 PM, Dave Spano wrote:

 Sebastien,
  
 I'm not totally sure yet, but everything is still working.  
  
  
 Sage and Greg,  
 I copied my glance image pool per the posting I mentioned previously, and 
 everything works when I use the ceph tools. I can export rbds from the new 
 pool and delete them as well.
  
 I noticed that the copied images pool does not work with glance.  
  
 I get this error when I try to create images in the new pool. If I put the 
 old pool back, I can create images no problem.  
  
 Is there something I'm missing in glance that I need to work with a pool 
 created in bobtail? I'm using Openstack Folsom.  
  
 File /usr/lib/python2.7/dist-packages/glance/api/v1/images.py, line 437, in 
 _upload  
 image_meta['size'])  
 File /usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 244, in add 
  
 image_size, order)  
 File /usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 207, in 
 _create_image  
 features=rbd.RBD_FEATURE_LAYERING)  
 File /usr/lib/python2.7/dist-packages/rbd.py, line 194, in create  
 raise make_ex(ret, 'error creating image')  
 PermissionError: error creating image
  
  
 Dave Spano  
  
  
  
  
 - Original Message -  
  
 From: Sébastien Han han.sebast...@gmail.com 
 (mailto:han.sebast...@gmail.com)  
 To: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com)  
 Cc: Greg Farnum g...@inktank.com (mailto:g...@inktank.com), ceph-devel 
 ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org), Sage Weil 
 s...@inktank.com (mailto:s...@inktank.com), Wido den Hollander 
 w...@42on.com (mailto:w...@42on.com), Sylvain Munaut 
 s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), 
 Samuel Just sam.j...@inktank.com (mailto:sam.j...@inktank.com), 
 Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com)  
 Sent: Wednesday, March 13, 2013 3:59:03 PM  
 Subject: Re: OSD memory leaks?  
  
 Dave,  
  
 Just to be sure, did the log max recent=1 _completely_ stod the  
 memory leak or did it slow it down?  
  
 Thanks!  
 --  
 Regards,  
 Sébastien Han.  
  
  
 On Wed, Mar 13, 2013 at 2:12 PM, Dave Spano dsp...@optogenics.com 
 (mailto:dsp...@optogenics.com) wrote:  
  Lol. I'm totally fine with that. My glance images pool isn't used too 
  often. I'm going to give that a try today and see what happens.  
   
  I'm still crossing my fingers, but since I added log max recent=1 to 
  ceph.conf, I've been okay despite the improper pg_num, and a lot of 
  scrubbing/deep scrubbing yesterday.  
   
  Dave Spano  
   
   
   
   
  - Original Message -  
   
  From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com)  
  To: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com)  
  Cc: ceph-devel ceph-devel@vger.kernel.org 
  (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com 
  (mailto:s...@inktank.com), Wido den Hollander w...@42on.com 
  (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com 
  (mailto:s.mun...@whatever-company.com), Samuel Just 
  sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav Gorbunov 
  vadi...@gmail.com (mailto:vadi...@gmail.com), Sébastien Han 
  han.sebast...@gmail.com (mailto:han.sebast...@gmail.com)  
  Sent: Tuesday, March 12, 2013 5:37:37 PM  
  Subject: Re: OSD memory leaks?  
   
  Yeah. There's not anything intelligent about that cppool mechanism. :)  
  -Greg  
   
  On Tuesday, March 12, 2013 at 2:15 PM, Dave Spano wrote:  
   
   I'd rather shut the cloud down and copy the pool to a new one than take 
   any chances of corruption by using an experimental feature. My guess is 
   that there cannot be any i/o to the pool while copying, otherwise you'll 
   lose the changes that are happening during the copy, correct?  

   Dave Spano  
   Optogenics  
   Systems Administrator  



   - Original Message -  

   From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com)  
   To: Sébastien Han han.sebast...@gmail.com 
   (mailto:han.sebast...@gmail.com)  
   Cc: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), 
   ceph-devel ceph-devel@vger.kernel.org 
   (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com 
   (mailto:s...@inktank.com), Wido den Hollander w...@42on.com 
   (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com 
   (mailto:s.mun...@whatever-company.com), Samuel Just 
   sam.j...@inktank.com (mailto:sam.j...@inktank.com), Vladislav 
   Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com)  
   Sent: Tuesday, March 12, 2013 4:20:13 PM  
   Subject: Re: OSD memory leaks?  

   On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote:  
Well to avoid un necessary data movement

Re: OSD memory leaks?

2013-03-12 Thread Greg Farnum
On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote:
 Well to avoid un necessary data movement, there is also an
 _experimental_ feature to change on fly the number of PGs in a pool.
  
 ceph osd pool set poolname pg_num numpgs --allow-experimental-feature
Don't do that. We've got a set of 3 patches which fix bugs we know about that 
aren't in bobtail yet, and I'm sure there's more we aren't aware of…
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com  

  
 Cheers!
 --
 Regards,
 Sébastien Han.
  
  
 On Tue, Mar 12, 2013 at 7:09 PM, Dave Spano dsp...@optogenics.com 
 (mailto:dsp...@optogenics.com) wrote:
  Disregard my previous question. I found my answer in the post below. 
  Absolutely brilliant! I thought I was screwed!
   
  http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/8924
   
  Dave Spano
  Optogenics
  Systems Administrator
   
   
   
  - Original Message -
   
  From: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com)
  To: Sébastien Han han.sebast...@gmail.com 
  (mailto:han.sebast...@gmail.com)
  Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den 
  Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum 
  g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut 
  s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), 
  ceph-devel ceph-devel@vger.kernel.org 
  (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com 
  (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com 
  (mailto:vadi...@gmail.com)
  Sent: Tuesday, March 12, 2013 1:41:21 PM
  Subject: Re: OSD memory leaks?
   
   
  If one were stupid enough to have their pg_num and pgp_num set to 8 on two 
  of their pools, how could you fix that?
   
   
  Dave Spano
   
   
   
  - Original Message -
   
  From: Sébastien Han han.sebast...@gmail.com 
  (mailto:han.sebast...@gmail.com)
  To: Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com)
  Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den 
  Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum 
  g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut 
  s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), 
  Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), 
  ceph-devel ceph-devel@vger.kernel.org 
  (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com 
  (mailto:sam.j...@inktank.com)
  Sent: Tuesday, March 12, 2013 9:43:44 AM
  Subject: Re: OSD memory leaks?
   
   Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd
   dump | grep 'rep size'
   
   
   
  Well it's still 450 each...
   
   The default pg_num value 8 is NOT suitable for big cluster.
   
  Thanks I know, I'm not new with Ceph. What's your point here? I
  already said that pg_num was 450...
  --
  Regards,
  Sébastien Han.
   
   
  On Tue, Mar 12, 2013 at 2:00 PM, Vladislav Gorbunov vadi...@gmail.com 
  (mailto:vadi...@gmail.com) wrote:
   Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd
   dump | grep 'rep size'
   The default pg_num value 8 is NOT suitable for big cluster.

   2013/3/13 Sébastien Han han.sebast...@gmail.com 
   (mailto:han.sebast...@gmail.com):
Replica count has been set to 2.
 
Why?
--
Regards,
Sébastien Han.
 
 
On Tue, Mar 12, 2013 at 12:45 PM, Vladislav Gorbunov vadi...@gmail.com 
(mailto:vadi...@gmail.com) wrote:
  FYI I'm using 450 pgs for my pools.
  
  
 Please, can you show the number of object replicas?
  
 ceph osd dump | grep 'rep size'
  
 Vlad Gorbunov
  
 2013/3/5 Sébastien Han han.sebast...@gmail.com 
 (mailto:han.sebast...@gmail.com):
  FYI I'm using 450 pgs for my pools.
   
  --
  Regards,
  Sébastien Han.
   
   
  On Fri, Mar 1, 2013 at 8:10 PM, Sage Weil s...@inktank.com 
  (mailto:s...@inktank.com) wrote:

   On Fri, 1 Mar 2013, Wido den Hollander wrote:
On 02/23/2013 01:44 AM, Sage Weil wrote:
 On Fri, 22 Feb 2013, S?bastien Han wrote:
  Hi all,
   
  I finally got a core dump.
   
  I did it with a kill -SEGV on the OSD process.
   
  https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008
   
  Hope we will get something out of it :-).
  
 AHA! We have a theory. The pg log isnt trimmed during scrub 
 (because teh
 old scrub code required that), but the new (deep) scrub can 
 take a very
 long time, which means the pg log will eat ram in the 
 meantime..
 especially under high iops.
 
 
 
Does the number of PGs influence the memory leak? So my theory 
is that when
you have a high number of PGs with a low number of objects per 
PG you don't

Re: OSD memory leaks?

2013-03-12 Thread Greg Farnum
Yeah. There's not anything intelligent about that cppool mechanism. :)
-Greg

On Tuesday, March 12, 2013 at 2:15 PM, Dave Spano wrote:

 I'd rather shut the cloud down and copy the pool to a new one than take any 
 chances of corruption by using an experimental feature. My guess is that 
 there cannot be any i/o to the pool while copying, otherwise you'll lose the 
 changes that are happening during the copy, correct?  
  
 Dave Spano  
 Optogenics  
 Systems Administrator  
  
  
  
 - Original Message -  
  
 From: Greg Farnum g...@inktank.com (mailto:g...@inktank.com)  
 To: Sébastien Han han.sebast...@gmail.com 
 (mailto:han.sebast...@gmail.com)  
 Cc: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), 
 ceph-devel ceph-devel@vger.kernel.org 
 (mailto:ceph-devel@vger.kernel.org), Sage Weil s...@inktank.com 
 (mailto:s...@inktank.com), Wido den Hollander w...@42on.com 
 (mailto:w...@42on.com), Sylvain Munaut s.mun...@whatever-company.com 
 (mailto:s.mun...@whatever-company.com), Samuel Just sam.j...@inktank.com 
 (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com 
 (mailto:vadi...@gmail.com)  
 Sent: Tuesday, March 12, 2013 4:20:13 PM  
 Subject: Re: OSD memory leaks?  
  
 On Tuesday, March 12, 2013 at 1:10 PM, Sébastien Han wrote:  
  Well to avoid un necessary data movement, there is also an  
  _experimental_ feature to change on fly the number of PGs in a pool.  
   
  ceph osd pool set poolname pg_num numpgs --allow-experimental-feature  
 Don't do that. We've got a set of 3 patches which fix bugs we know about that 
 aren't in bobtail yet, and I'm sure there's more we aren't aware of…  
 -Greg  
  
 Software Engineer #42 @ http://inktank.com | http://ceph.com  
  
   
  Cheers!  
  --  
  Regards,  
  Sébastien Han.  
   
   
  On Tue, Mar 12, 2013 at 7:09 PM, Dave Spano dsp...@optogenics.com 
  (mailto:dsp...@optogenics.com) wrote:  
   Disregard my previous question. I found my answer in the post below. 
   Absolutely brilliant! I thought I was screwed!  

   http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/8924  

   Dave Spano  
   Optogenics  
   Systems Administrator  



   - Original Message -  

   From: Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com) 

   To: Sébastien Han han.sebast...@gmail.com 
   (mailto:han.sebast...@gmail.com)  
   Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den 
   Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum 
   g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut 
   s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), 
   ceph-devel ceph-devel@vger.kernel.org 
   (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com 
   (mailto:sam.j...@inktank.com), Vladislav Gorbunov vadi...@gmail.com 
   (mailto:vadi...@gmail.com)  
   Sent: Tuesday, March 12, 2013 1:41:21 PM  
   Subject: Re: OSD memory leaks?  


   If one were stupid enough to have their pg_num and pgp_num set to 8 on 
   two of their pools, how could you fix that?  


   Dave Spano  



   - Original Message -  

   From: Sébastien Han han.sebast...@gmail.com 
   (mailto:han.sebast...@gmail.com)  
   To: Vladislav Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com)  
   Cc: Sage Weil s...@inktank.com (mailto:s...@inktank.com), Wido den 
   Hollander w...@42on.com (mailto:w...@42on.com), Gregory Farnum 
   g...@inktank.com (mailto:g...@inktank.com), Sylvain Munaut 
   s.mun...@whatever-company.com (mailto:s.mun...@whatever-company.com), 
   Dave Spano dsp...@optogenics.com (mailto:dsp...@optogenics.com), 
   ceph-devel ceph-devel@vger.kernel.org 
   (mailto:ceph-devel@vger.kernel.org), Samuel Just sam.j...@inktank.com 
   (mailto:sam.j...@inktank.com)  
   Sent: Tuesday, March 12, 2013 9:43:44 AM  
   Subject: Re: OSD memory leaks?  

Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd  
dump | grep 'rep size'  





   Well it's still 450 each...  

The default pg_num value 8 is NOT suitable for big cluster.  

   Thanks I know, I'm not new with Ceph. What's your point here? I  
   already said that pg_num was 450...  
   --  
   Regards,  
   Sébastien Han.  


   On Tue, Mar 12, 2013 at 2:00 PM, Vladislav Gorbunov vadi...@gmail.com 
   (mailto:vadi...@gmail.com) wrote:  
Sorry, i mean pg_num and pgp_num on all pools. Shown by the ceph osd  
dump | grep 'rep size'  
The default pg_num value 8 is NOT suitable for big cluster.  
 
2013/3/13 Sébastien Han han.sebast...@gmail.com 
(mailto:han.sebast...@gmail.com):  
 Replica count has been set to 2.  
  
 Why?  
 --  
 Regards,  
 Sébastien Han.  
  
  
 On Tue, Mar 12, 2013 at 12:45 PM, Vladislav Gorbunov 
 vadi...@gmail.com (mailto:vadi...@gmail.com) wrote:  
   FYI I'm using 450 pgs for my pools

Re: [PATCH 2/2] ceph: use i_release_count to indicate dir's completeness

2013-03-12 Thread Greg Farnum
On Monday, March 11, 2013 at 5:42 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 Current ceph code tracks directory's completeness in two places.
 ceph_readdir() checks i_release_count to decide if it can set the
 I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
 flag. This indirection introduces locking complexity.
 
 This patch adds a new variable i_complete_count to ceph_inode_info.
 Set i_release_count's value to it when marking a directory complete.
 By comparing the two variables, we know if a directory is complete
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 fs/ceph/caps.c | 4 ++--
 fs/ceph/dir.c | 25 +
 fs/ceph/inode.c | 13 +++--
 fs/ceph/mds_client.c | 10 +++---
 fs/ceph/super.h | 41 +++--
 5 files changed, 44 insertions(+), 49 deletions(-)
 
 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 76634f4..124e8a1 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -490,7 +490,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
 struct ceph_cap *cap,
 ci-i_rdcache_gen++;
 
 /*
 - * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
 + * if we are newly issued FILE_SHARED, mark dir not complete; we
 * don't know what happened to this directory while we didn't
 * have the cap.
 */
 @@ -499,7 +499,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
 struct ceph_cap *cap,
 ci-i_shared_gen++;
 if (S_ISDIR(ci-vfs_inode.i_mode)) {
 dout( marking %p NOT complete\n, ci-vfs_inode);
 - ci-i_ceph_flags = ~CEPH_I_COMPLETE;
 + __ceph_dir_clear_complete(ci);
 }
 }
 }
 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 76821be..11966c4 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
 * falling back to a normal sync readdir if any dentries in the dir
 * are dropped.
 *
 - * I_COMPLETE tells indicates we have all dentries in the dir. It is
 + * Complete dir indicates that we have all dentries in the dir. It is
 * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
 * the MDS if/when the directory is modified).
 */
 @@ -198,8 +198,8 @@ more:
 filp-f_pos++;
 
 /* make sure a dentry wasn't dropped while we didn't have parent lock */
 - if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
 - dout( lost I_COMPLETE on %p; falling back to mds\n, dir);
 + if (!ceph_dir_is_complete(dir)) {
 + dout( lost dir complete on %p; falling back to mds\n, dir);
 err = -EAGAIN;
 goto out;
 }
 @@ -258,7 +258,7 @@ static int ceph_readdir(struct file *filp, void *dirent, 
 filldir_t filldir)
 if (filp-f_pos == 0) {
 /* note dir version at start of readdir so we can tell
 * if any dentries get dropped */
 - fi-dir_release_count = ci-i_release_count;
 + fi-dir_release_count = atomic_read(ci-i_release_count);
 
 dout(readdir off 0 - '.'\n);
 if (filldir(dirent, ., 1, ceph_make_fpos(0, 0),
 @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, 
 filldir_t filldir)
 if ((filp-f_pos == 2 || fi-dentry) 
 !ceph_test_mount_opt(fsc, NOASYNCREADDIR) 
 ceph_snap(inode) != CEPH_SNAPDIR 
 - (ci-i_ceph_flags  CEPH_I_COMPLETE) 
 + __ceph_dir_is_complete(ci) 
 __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
 spin_unlock(ci-i_ceph_lock);
 err = __dcache_readdir(filp, dirent, filldir);
 @@ -350,7 +350,8 @@ more:
 
 if (!req-r_did_prepopulate) {
 dout(readdir !did_prepopulate);
 - fi-dir_release_count--; /* preclude I_COMPLETE */
 + /* preclude from marking dir complete */
 + fi-dir_release_count--;
 }
 
 /* note next offset and last dentry name */
 @@ -428,9 +429,9 @@ more:
 * the complete dir contents in our cache.
 */
 spin_lock(ci-i_ceph_lock);
 - if (ci-i_release_count == fi-dir_release_count) {
 + if (atomic_read(ci-i_release_count) == fi-dir_release_count) {
 dout( marking %p complete\n, inode);
 - ci-i_ceph_flags |= CEPH_I_COMPLETE;
 + __ceph_dir_set_complete(ci, fi-dir_release_count);
 ci-i_max_offset = filp-f_pos;
 }
 spin_unlock(ci-i_ceph_lock);
 @@ -605,7 +606,7 @@ static struct dentry *ceph_lookup(struct inode *dir, 
 struct dentry *dentry,
 fsc-mount_options-snapdir_name,
 dentry-d_name.len) 
 !is_root_ceph_dentry(dir, dentry) 
 - (ci-i_ceph_flags  CEPH_I_COMPLETE) 
 + __ceph_dir_is_complete(ci) 
 (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
 spin_unlock(ci-i_ceph_lock);
 dout( dir %p complete, -ENOENT\n, dir);
 @@ -909,7 +910,7 @@ static int ceph_rename(struct inode *old_dir, struct 
 dentry *old_dentry,
 */
 
 /* d_move screws up d_subdirs order */
 - ceph_i_clear(new_dir, CEPH_I_COMPLETE);
 + ceph_dir_clear_complete(new_dir);
 
 d_move(old_dentry, new_dentry);
 
 @@ -1079,7 +1080,7 @@ static void ceph_d_prune(struct dentry *dentry)
 if (IS_ROOT(dentry))
 return;
 
 - /* if we are not hashed, we don't affect I_COMPLETE */
 + /* if we are not hashed, we don't affect dir's completeness */
 if (d_unhashed(dentry))
 return;
 
 @@ -1087,7 +1088,7 @@ static 

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Greg Farnum
On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
 On 03/08/2013 07:05 PM, Greg Farnum wrote:
  On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
   On 03/07/2013 08:15 AM, Jim Schutt wrote:
On 03/06/2013 05:18 PM, Greg Farnum wrote:
 On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
 





   [snip]

  Do you want the MDS log at 10 or 20?
  
 More is better. ;)
 
 
 
 
OK, thanks.


   I've sent some mds logs via private email...

   -- Jim  
   
  I'm going to need to probe into this a bit more, but on an initial
  examination I see that most of your stats are actually happening very
  quickly — it's just that occasionally they take quite a while.
  
  
  
 Interesting...
  
  Going
  through the MDS log for one of those, the inode in question is
  flagged with needsrecover from its first appearance in the log —
  that really shouldn't happen unless a client had write caps on it and
  the client disappeared. Any ideas? The slowness is being caused by
  the MDS going out and looking at every object which could be in the
  file — there are a lot since the file has a listed size of 8GB.
  
  
  
 For this run, the MDS logging slowed it down enough to cause the
 client caps to occasionally go stale. I don't think it's the cause
 of the issue, because I was having it before I turned MDS debugging
 up. My client caps never go stale at, e.g., debug mds 5.

Oh, so this might be behaviorally different than you were seeing before? Drat.

You had said before that each newfstatat was taking tens of seconds, whereas in 
the strace log you sent along most of the individual calls were taking a bit 
less than 20 milliseconds. Do you have an strace of them individually taking 
much more than that, or were you just noticing that they took a long time in 
aggregate?
I suppose if you were going to run it again then just the message logging could 
also be helpful. That way we could at least check and see the message delays 
and if the MDS is doing other work in the course of answering a request.

 Otherwise, there were no signs of trouble while writing the files.
  
 Can you suggest which kernel client debugging I might enable that
 would help understand what is happening? Also, I have the full
 MDS log from writing the files, if that will help. It's big (~10 GiB).
  
  (There are several other mysteries here that can probably be traced
  to different varieties of non-optimal and buggy code as well — there
  is a client which has write caps on the inode in question despite it
  needing recovery, but the recovery isn't triggered until the stat
  event occurs, etc).
  
  
  
 OK, thanks for taking a look. Let me know if there is other
 logging I can enable that will be helpful.

I'm going to want to spend more time with the log I've got, but I'll think 
about if there's a different set of data we can gather less disruptively.  
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Estimating OSD memory requirements (was Re: stuff for v0.56.4)

2013-03-11 Thread Greg Farnum
On Monday, March 11, 2013 at 8:10 AM, Bryan K. Wright wrote:

 s...@inktank.com said:
  On Thu, 7 Mar 2013, Bryan K. Wright wrote:
  
  s...@inktank.com said:
   - pg log trimming (probably a conservative subset) to avoid memory bloat 
  
  
  
  Anything that reduces the size of OSD processes would be appreciated.
  You can probably do this with just
  log max recent = 1000
  By default it's keeping 100k lines of logs in memory, which can eat a lot of
  ram (but is great when debugging issues).
 
 
 
 Thanks for the tip about log max recent. I've made this 
 change, but it doesn't seem to significantly reduce the size of the 
 OSD processes.
 
 In general, are there some rules of thumb for estimated the
 memory requirements for OSDs? I see processes blow up to 8gb of 
 resident memory sometimes. If I need to allow for that much memory
 per OSD process, I may have to just walk away from ceph.
 
 Does the memory usage scale with the size of the disks?
 I've been trying to run 12 OSDs with 12 2TB disks on a single box.
 Would I be better off (memory-usage-wise) if I RAIDed the disks
 together and used a single OSD process?
 


Memory use depends on several things, but the most important are how many PGs 
the daemon is hosting, and whether it's undergoing recovery of some kind. 
(Absolute disk size is not involved.) If you're getting up to 8GB per, it 
sounds as if you may have a bit too many PGs.
You could try RAIDing some of your drives together instead, yes -- memory  CPU 
utilization is one of the trade offs there, balanced against larger discrete 
failure units and the loss of space or reliability (depending on the RAID 
chosen).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Greg Farnum
On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote:
 On 03/11/2013 09:48 AM, Greg Farnum wrote:
  On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
   
   For this run, the MDS logging slowed it down enough to cause the
   client caps to occasionally go stale. I don't think it's the cause
   of the issue, because I was having it before I turned MDS debugging
   up. My client caps never go stale at, e.g., debug mds 5.
  
  
  
  Oh, so this might be behaviorally different than you were seeing before? 
  Drat.
  
  You had said before that each newfstatat was taking tens of seconds,
  whereas in the strace log you sent along most of the individual calls
  were taking a bit less than 20 milliseconds. Do you have an strace of
  them individually taking much more than that, or were you just
  noticing that they took a long time in aggregate?
 
 
 
 When I did the first strace, I didn't turn on timestamps, and I was
 watching it scroll by. I saw several stats in a row take ~30 secs,
 at which point I got bored, and took a look at the strace man page to
 figure out how to get timestamps ;)
 
 Also, another difference is for that test, I was looking at files
 I had written the day before, whereas for the strace log I sent,
 there was only several minutes between writing and the strace of find.
 
 I thought I had eliminated the page cache issue by using fdatasync
 when writing the files. Perhaps the real issue is affected by that
 delay?

I'm not sure. I can't think of any mechanism by which waiting longer would 
increase the time lags, though, so I doubt it.

  I suppose if you were going to run it again then just the message
  logging could also be helpful. That way we could at least check and
  see the message delays and if the MDS is doing other work in the
  course of answering a request.
 
 
 
 I can do as many trials as needed to isolate the issue.
 
 What message debugging level is sufficient on the MDS; 1?
Yep, that will capture all incoming and outgoing messages. :) 
 
 If you want I can attempt to duplicate my memory of the first
 test I reported, writing the files today and doing the strace
 tomorrow (with timestamps, this time).
 
 Also, would it be helpful to write the files with minimal logging, in
 hopes of inducing minimal timing changes, then upping the logging
 for the stat phase?

Well that would give us better odds of not introducing failures of any kind 
during the write phase, and then getting accurate information on what's 
happening during the stats, so it probably would. Basically I'd like as much 
logging as possible without changing the states they system goes through. ;)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-08 Thread Greg Farnum
On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
 On 03/07/2013 08:15 AM, Jim Schutt wrote:
  On 03/06/2013 05:18 PM, Greg Farnum wrote:
   On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
   
  
  
  
 [snip]
  
Do you want the MDS log at 10 or 20?
   More is better. ;)
   
   
   
  OK, thanks.
  
 I've sent some mds logs via private email...
  
 -- Jim  
I'm going to need to probe into this a bit more, but on an initial examination 
I see that most of your stats are actually happening very quickly — it's just 
that occasionally they take quite a while. Going through the MDS log for one of 
those, the inode in question is flagged with needsrecover from its first 
appearance in the log — that really shouldn't happen unless a client had write 
caps on it and the client disappeared. Any ideas? The slowness is being caused 
by the MDS going out and looking at every object which could be in the file — 
there are a lot since the file has a listed size of 8GB.
(There are several other mysteries here that can probably be traced to 
different varieties of non-optimal and buggy code as well — there is a client 
which has write caps on the inode in question despite it needing recovery, but 
the recovery isn't triggered until the stat event occurs, etc).
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS running at 100% CPU, no clients

2013-03-07 Thread Greg Farnum
This isn't bringing up anything in my brain, but I don't know what that 
_sample() function is actually doing — did you get any farther into it?
-Greg

On Wednesday, March 6, 2013 at 6:23 PM, Noah Watkins wrote:

 Which, looks to be in a tight loop in the memory model _sample…
  
 (gdb) bt
 #0 0x7f0270d84d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
 #1 0x7f027046dd88 in std::__basic_filechar::xsgetn(char*, long) () from 
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #2 0x7f027046f4c5 in std::basic_filebufchar, std::char_traitschar 
 ::underflow() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #3 0x7f0270467ceb in std::basic_istreamchar, std::char_traitschar  
 std::getlinechar, std::char_traitschar, std::allocatorchar 
 (std::basic_istreamchar, std::char_traitschar , std::basic_stringchar, 
 std::char_traitschar, std::allocatorchar , char) () from 
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #4 0x0072bdd4 in MemoryModel::_sample(MemoryModel::snap*) ()
 #5 0x005658db in MDCache::check_memory_usage() ()
 #6 0x004ba929 in MDS::tick() ()
 #7 0x00794c65 in SafeTimer::timer_thread() ()
 #8 0x007958ad in SafeTimerThread::entry() ()
 #9 0x7f0270d7de9a in start_thread () from 
 /lib/x86_64-linux-gnu/libpthread.so.0
  
 On Mar 6, 2013, at 6:18 PM, Noah Watkins jayh...@cs.ucsc.edu 
 (mailto:jayh...@cs.ucsc.edu) wrote:
  
   
  On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu 
  (mailto:jayh...@cs.ucsc.edu) wrote:
   
   The MDS process in my cluster is running at 100% CPU. In fact I thought 
   the cluster came down, but rather an ls was taking a minute. There aren't 
   any clients active. I've left the process running in case there is any 
   probing you'd like to do on it:

   virt res cpu
   4629m 88m 5260 S 92 1.1 113:32.79 ceph-mds

   Thanks,
   Noah
   
   
   
   
  This is a ceph-mds child thread under strace. The only thread
  that appears to be doing anything.
   
  root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372
  Process 3372 attached - interrupt to quit
  read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050
  read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050
  read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050
  read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020
  read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020
  read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020
  read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020
  ...
   
  That file looks to be:
   
  ceph-mds 3337 root 1649r REG 0,3 0 266903 /proc/3337/maps
   
  (3337 is the parent process).
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changes to rados command

2013-03-07 Thread Greg Farnum
On Thursday, March 7, 2013 at 11:25 AM, Andrew Hume wrote:
 
 in order to make the rados command more useful in scripts,
 i'd like to make a change, specifically change to
 
 rados -p pool getomapval obj key [fmt]
 
 where fmt is an optional formatting parameter.
 i've implemented 'str' which will print the value as an unadorned string.
 
 what is the process for doing this?
 
Patch submission, you mean? Github pull requests, sending a pull request with 
git URL to the list, or sending straight patches the list are all good. I'll 
like you more if you give me a URL of some form instead of making me get the 
patches out of email and into my git repo correctly, though. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changes to rados command

2013-03-07 Thread Greg Farnum
(Re-added the list for future reference)

Well, you'll need to learn how to use git at a basic level in order to be able 
to work effectively on Ceph (or most other open-source projects).
Some links that might be helpful:
http://www.joelonsoftware.com/items/2010/03/17.html
http://www.ibm.com/developerworks/library/l-git-subversion-1/
http://try.github.com/

I haven't been through these all thoroughly, but the first one should describe 
the mental model changes that motivate git, the second looks to be a deep 
tutorial, and the third will teach you the mechanics. :)


Github pull requests are a Github nicety; their website will teach you how to 
use them. A simple git URL just requires that your git repository be accessible 
over the internet, and then you tell us what the URL is and what branch to pull 
from, and we can do so. (This of course requires that your changes actually be 
in a branch, so you'll need to have the commits arranged nicely and such.)
-Greg


On Thursday, March 7, 2013 at 12:42 PM, Andrew Hume wrote:

 i don't know how to do teh first two (but i am a quickish learner).
 i know how to type git diff | mail already.
 if you can guide me a little on how to do teh git things, i'll do those.
 
 
 On Mar 7, 2013, at 1:37 PM, Greg Farnum wrote:
  On Thursday, March 7, 2013 at 11:25 AM, Andrew Hume wrote:
   
   in order to make the rados command more useful in scripts,
   i'd like to make a change, specifically change to
   
   rados -p pool getomapval obj key [fmt]
   
   where fmt is an optional formatting parameter.
   i've implemented 'str' which will print the value as an unadorned string.
   
   what is the process for doing this?
  Patch submission, you mean? Github pull requests, sending a pull request 
  with git URL to the list, or sending straight patches the list are all 
  good. I'll like you more if you give me a URL of some form instead of 
  making me get the patches out of email and into my git repo correctly, 
  though. :)
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 
 
 ---
 Andrew Hume
 623-551-2845 (VO and best)
 973-236-2014 (NJ)
 and...@research.att.com (mailto:and...@research.att.com)



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] ceph: increase i_release_count when clear I_COMPLETE flag

2013-03-07 Thread Greg Farnum
I'm pulling this in for now to make sure this clears out that ENOENT bug we hit 
— but shouldn't we be fixing ceph_i_clear() to always bump the i_release_count? 
It doesn't seem like it would ever be correct without it, and these are the 
only two callers.  

The second one looks good to us and we'll test it but of course that can't go 
upstream through our tree.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thursday, March 7, 2013 at 3:36 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
  
 If some dentries were pruned or FILE_SHARED cap was revoked while
 readdir is in progress. make sure ceph_readdir() does not mark the
 directory as complete.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 fs/ceph/caps.c | 1 +
 fs/ceph/dir.c | 13 +++--
 2 files changed, 12 insertions(+), 2 deletions(-)
  
 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 76634f4..35cebf3 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -500,6 +500,7 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
 struct ceph_cap *cap,
 if (S_ISDIR(ci-vfs_inode.i_mode)) {
 dout( marking %p NOT complete\n, ci-vfs_inode);
 ci-i_ceph_flags = ~CEPH_I_COMPLETE;
 + ci-i_release_count++;
 }
 }
 }
 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 76821be..068304c 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -909,7 +909,11 @@ static int ceph_rename(struct inode *old_dir, struct 
 dentry *old_dentry,
 */
  
 /* d_move screws up d_subdirs order */
 - ceph_i_clear(new_dir, CEPH_I_COMPLETE);
 + struct ceph_inode_info *ci = ceph_inode(new_dir);
 + spin_lock(ci-i_ceph_lock);
 + ci-i_ceph_flags = ~CEPH_I_COMPLETE;
 + ci-i_release_count++;
 + spin_unlock(ci-i_ceph_lock);
  
 d_move(old_dentry, new_dentry);
  
 @@ -1073,6 +1077,7 @@ static int ceph_snapdir_d_revalidate(struct dentry 
 *dentry,
 */
 static void ceph_d_prune(struct dentry *dentry)
 {
 + struct ceph_inode_info *ci;
 dout(ceph_d_prune %p\n, dentry);
  
 /* do we have a valid parent? */
 @@ -1087,7 +1092,11 @@ static void ceph_d_prune(struct dentry *dentry)
 * we hold d_lock, so d_parent is stable, and d_fsdata is never
 * cleared until d_release
 */
 - ceph_i_clear(dentry-d_parent-d_inode, CEPH_I_COMPLETE);
 + ci = ceph_inode(dentry-d_parent-d_inode);
 + spin_lock(ci-i_ceph_lock);
 + ci-i_ceph_flags = ~CEPH_I_COMPLETE;
 + ci-i_release_count++;
 + spin_unlock(ci-i_ceph_lock);
 }
  
 /*
 --  
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CephFS Space Accounting and Quotas (was: CephFS First product release discussion)

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
 On 03/05/2013 12:33 PM, Sage Weil wrote:
Running 'du' on each directory would be much faster with Ceph since it
accounts tracks the subdirectories and shows their total size with an 
'ls
-al'.
 
Environments with 100k users also tend to be very dynamic with adding 
and
removing users all the time, so creating separate filesystems for them 
would
be very time consuming.
 
Now, I'm not talking about enforcing soft or hard quotas, I'm just 
talking
about knowing how much space uid X and Y consume on the filesystem.

   
   
  The part I'm most unclear on is what use cases people have where uid X and  
  Y are spread around the file system (not in a single or small set of sub  
  directories) and per-user (not, say, per-project) quotas are still  
  necessary. In most environments, users get their own home directory and  
  everything lives there...
  
  
  
 Hmmm, is there a tool I should be using that will return the space
 used by a directory, and all its descendants?
  
 If it's 'du', that tool is definitely not fast for me.
  
 I'm doing an 'strace du -s path', where path has one
 subdirectory which contains ~600 files. I've got ~200 clients
 mounting the file system, and each client wrote 3 files in that
 directory.
  
 I'm doing the 'du' from one of those nodes, and the strace is showing
 me du is doing a 'newfstat' for each file. For each file that was
 written on a different client from where du is running, that 'newfstat'
 takes tens of seconds to return. Which means my 'du' has been running
 for quite some time and hasn't finished yet
  
 I'm hoping there's another tool I'm supposed to be using that I
 don't know about yet. Our use case includes tens of millions
 of files written from thousands of clients, and whatever tool
 we use to do space accounting needs to not walk an entire directory
 tree, checking each file.

Check out the directory sizes with ls -l or whatever — those numbers are 
semantically meaningful! :)

Unfortunately we can't (currently) use those recursive statistics to do 
proper hard quotas on subdirectories as they're lazily propagated following 
client ops, not as part of the updates. (Lazily in the technical sense — it's 
actually quite fast in general). But they'd work fine for soft quotas if 
somebody wrote the code, or to block writes on a slight time lag.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 11:58 AM, Jim Schutt wrote:
 On 03/06/2013 12:13 PM, Greg Farnum wrote:
  Check out the directory sizes with ls -l or whatever — those numbers are 
  semantically meaningful! :)
  
  
 That is just exceptionally cool!
  
   
  Unfortunately we can't (currently) use those recursive statistics
  to do proper hard quotas on subdirectories as they're lazily
  propagated following client ops, not as part of the updates. (Lazily
  in the technical sense — it's actually quite fast in general). But
  they'd work fine for soft quotas if somebody wrote the code, or to
  block writes on a slight time lag.
  
  
  
 'ls -lh dir' seems to be just the thing if you already know dir.
  
 And it's perfectly suitable for our use case of not scheduling
 new jobs for users consuming too much space.
  
 I was thinking I might need to find a subtree where all the
 subdirectories are owned by the same user, on the theory that
 all the files in such a subtree would be owned by that same
 user. E.g., we might want such a capability to manage space per
 user in shared project directories.
  
 So, I tried 'find dir -type d -exec ls -lhd {} \;'
  
 Unfortunately, that ended up doing a 'newfstatat' on each file
 under dir, evidently to learn if it was a directory. The
 result was that same slowdown for files written on other clients.
  
 Is there some other way I should be looking for directories if I
 don't already know what they are?
  
 Also, this issue of stat on files created on other clients seems
 like it's going to be problematic for many interactions our users
 will have with the files created by their parallel compute jobs -
 any suggestion on how to avoid or fix it?
  

Brief background: stat is required to provide file size information, and so 
when you do a stat Ceph needs to find out the actual file size. If the file is 
currently in use by somebody, that requires gathering up the latest metadata 
from them.
Separately, while Ceph allows a client and the MDS to proceed with a bunch of 
operations (ie, mknod) without having it go to disk first, it requires anything 
which is visible to a third party (another client) be durable on disk for 
consistency reasons.

These combine to mean that if you do a stat on a file which a client currently 
has buffered writes for, that buffer must be flushed out to disk before the 
stat can return. This is the usual cause of the slow stats you're seeing. You 
should be able to adjust dirty data thresholds to encourage faster writeouts, 
do fsyncs once a client is done with a file, etc in order to minimize the 
likelihood of running into this.
Also, I'd have to check but I believe opening a file with LAZY_IO or whatever 
will weaken those requirements — it's probably not the solution you'd like here 
but it's an option, and if this turns out to be a serious issue then config 
options to reduce consistency on certain operations are likely to make their 
way into the roadmap. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
 On 03/06/2013 01:21 PM, Greg Farnum wrote:
Also, this issue of stat on files created on other clients seems
like it's going to be problematic for many interactions our users
will have with the files created by their parallel compute jobs -
any suggestion on how to avoid or fix it?

   
   
  Brief background: stat is required to provide file size information,
  and so when you do a stat Ceph needs to find out the actual file
  size. If the file is currently in use by somebody, that requires
  gathering up the latest metadata from them. Separately, while Ceph
  allows a client and the MDS to proceed with a bunch of operations
  (ie, mknod) without having it go to disk first, it requires anything
  which is visible to a third party (another client) be durable on disk
  for consistency reasons.
   
  These combine to mean that if you do a stat on a file which a client
  currently has buffered writes for, that buffer must be flushed out to
  disk before the stat can return. This is the usual cause of the slow
  stats you're seeing. You should be able to adjust dirty data
  thresholds to encourage faster writeouts, do fsyncs once a client is
  done with a file, etc in order to minimize the likelihood of running
  into this. Also, I'd have to check but I believe opening a file with
  LAZY_IO or whatever will weaken those requirements — it's probably
  not the solution you'd like here but it's an option, and if this
  turns out to be a serious issue then config options to reduce
  consistency on certain operations are likely to make their way into
  the roadmap. :)
  
  
  
 That all makes sense.
  
 But, it turns out the files in question were written yesterday,
 and I did the stat operations today.
  
 So, shouldn't the dirty buffer issue not be in play here?
Probably not. :/


 Is there anything else that might be going on?
In that case it sounds like either there's a slowdown on disk access that is 
propagating up the chain very bizarrely, there's a serious performance issue on 
the MDS (ie, swapping for everything), or the clients are still holding onto 
capabilities for the files in question and you're running into some issues with 
the capability revocation mechanisms.
Can you describe your setup a bit more? What versions are you running, kernel 
or userspace clients, etc. What config options are you setting on the MDS? 
Assuming you're on something semi-recent, getting a perfcounter dump from the 
MDS might be illuminating as well.

We'll probably want to get a high-debug log of the MDS during these slow stats 
as well.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
 When I'm doing these stat operations the file system is otherwise
 idle.

What's the cluster look like? This is just one active MDS and a couple hundred 
clients?

 What is happening is that once one of these slow stat operations
 on a file completes, it never happens again for that file, from
 any client. At least, that's the case if I'm not writing to
 the file any more. I haven't checked if appending to the files
 restarts the behavior.

I assume it'll come back, but if you could verify that'd be good.

 
 On the client side I'm running with 3.8.2 + the ceph patch queue
 that was merged into 3.9-rc1.
 
 On the server side I'm running recent next branch (commit 0f42eddef5),
 with the tcp receive socket buffer option patches cherry-picked.
 I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
 rather than pg_bits to set initial number of PGs (same for pgp_num),
 and a patch that lets me run with just one pool that contains both
 data and metadata. I'm testing data distribution uniformity with 512K PGs.
 
 My MDS tunables are all at default settings.
 
  
  We'll probably want to get a high-debug log of the MDS during these slow 
  stats as well.
 
 OK.
 
 Do you want me to try to reproduce with a more standard setup?
No, this is fine. 
 
 Also, I see Sage just pushed a patch to pgid decoding - I expect
 I need that as well, if I'm running the latest client code.

Yeah, if you've got the commit it references you'll want it.

 Do you want the MDS log at 10 or 20?
More is better. ;)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CephFS First product release discussion

2013-03-05 Thread Greg Farnum
This is a companion discussion to the blog post at 
http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!

The short and slightly alternate version: I spent most of about two weeks 
working on bugs related to snapshots in the MDS, and we started realizing that 
we could probably do our first supported release of CephFS and the related 
infrastructure much sooner if we didn't need to support all of the whizbang 
features. (This isn't to say that the base feature set is stable now, but it's 
much closer than when you turn on some of the other things.) I'd like to get 
feedback from you in the community on what minimum supported feature set would 
prompt or allow you to start using CephFS in real environments — not what you'd 
*like* to see, but what you *need* to see. This will allow us at Inktank to 
prioritize more effectively and hopefully get out a supported release much more 
quickly! :)

The current proposed feature set is basically what's left over after we've 
trimmed off everything we can think to split off, but if any of the proposed 
included features are also particularly important or don't matter, be sure to 
mention them (NFS export in particular — it works right now but isn't in great 
shape due to NFS filehandle caching).

Thanks,
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com  


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS First product release discussion

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 10:08 AM, Wido den Hollander wrote:
 On 03/05/2013 06:03 PM, Greg Farnum wrote:
  This is a companion discussion to the blog post at 
  http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
   
  The short and slightly alternate version: I spent most of about two weeks 
  working on bugs related to snapshots in the MDS, and we started realizing 
  that we could probably do our first supported release of CephFS and the 
  related infrastructure much sooner if we didn't need to support all of the 
  whizbang features. (This isn't to say that the base feature set is stable 
  now, but it's much closer than when you turn on some of the other things.) 
  I'd like to get feedback from you in the community on what minimum 
  supported feature set would prompt or allow you to start using CephFS in 
  real environments — not what you'd *like* to see, but what you *need* to 
  see. This will allow us at Inktank to prioritize more effectively and 
  hopefully get out a supported release much more quickly! :)
   
  The current proposed feature set is basically what's left over after we've 
  trimmed off everything we can think to split off, but if any of the 
  proposed included features are also particularly important or don't matter, 
  be sure to mention them (NFS export in particular — it works right now but 
  isn't in great shape due to NFS filehandle caching).
  
 Great news! Although RBD and RADOS itself are already great, a lot of  
 applications would still require a shared filesystem.
  
 Think about a (Cloud|Open)Stack environment with thousands of instances  
 running but also need some form of shared filesystem.
  
 One thing I'm missing though is user-quotas, have they been discussed at  
 all and what would the work to implement those involve?
  
 I know it would require a lot more tracking per file so it's not that  
 easy and would certainly not make it into a first release, but are they  
 on the roadmap at all?

Not at present. I think there are some tickets related to this in the tracker 
as feature requests, but CephFS needs more groundwork about multi-tenancy in 
general before we can do reasonable planning around a robust user quota 
feature. (Near-real-time hacks are possible now based around the rstats 
infrastructure and I believe somebody has built them, though I've never seen 
them myself.)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: clean up skipped message logic

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 7:33 AM, Alex Elder wrote:
 (This patch is available as the top commit in branch
 review/wip-4324 in the ceph-client git repository.)
 
 In ceph_con_in_msg_alloc() it is possible for a connection's
 alloc_msg method to indicate an incoming message should be skipped.
 By default, read_partial_message() initializes the skip variable
 to 0 before it gets provided to ceph_con_in_msg_alloc().
 
 The osd client, mon client, and mds client each supply an alloc_msg
 method. The mds client always assigns skip to be 0.
 
 The other two leave the skip value of as-is, or assigns it to zero,
 except:
 - if no (osd or mon) request having the given tid is found, in
 which case skip is set to 1 and NULL is returned; or
 - in the osd client, if the data of the reply message is not
 adequate to hold the message to be read, it assigns skip
 value 1 and returns NULL.
 So the returned message pointer will always be NULL if skip is ever
 non-zero.
 
 Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
 state of affairs more obvious. Add a comment explaining how a null
 message pointer can mean either a message that should be skipped or
 a problem allocating a message.
 
 This resolves:
 http://tracker.ceph.com/issues/4324
 
 Reported-by: Greg Farnum g...@inktank.com (mailto:g...@inktank.com)
 Signed-off-by: Alex Elder el...@inktank.com (mailto:el...@inktank.com)
 ---
 net/ceph/messenger.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 5bf1bb5..644cb6c 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -2860,18 +2860,21 @@ static int ceph_con_in_msg_alloc(struct
 ceph_connection *con, int *skip)
 ceph_msg_put(msg);
 return -EAGAIN;
 }
 - con-in_msg = msg;
 - if (con-in_msg) {
 + if (msg) {
 + BUG_ON(*skip);
 + con-in_msg = msg;
 con-in_msg-con = con-ops-get(con);
 BUG_ON(con-in_msg-con == NULL);
 - }
 - if (*skip) {
 - con-in_msg = NULL;
 - return 0;
 - }
 - if (!con-in_msg) {
 - con-error_msg =
 - error allocating memory for incoming message;
 + } else {
 + /*
 + * Null message pointer means either we should skip
 + * this message or we couldn't allocate memory. The
 + * former is not an error.
 + */
 + if (*skip)
 + return 0;
 + con-error_msg = error allocating memory for incoming message;
 +
 return -ENOMEM;
 }
 memcpy(con-in_msg-hdr, con-in_hdr, sizeof(con-in_hdr));
 -- 
 1.7.9.5

Reviewed-by: Greg Farnum g...@inktank.com 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk?

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 5:54 AM, Wido den Hollander wrote:
 On 03/05/2013 05:33 AM, Xing Lin wrote:
  Hi Gregory,
   
  Thanks for your reply.
   
  On 03/04/2013 09:55 AM, Gregory Farnum wrote:
   The journal [min|max] sync interval values specify how frequently
   the OSD's FileStore sends a sync to the disk. However, data is still
   written into the normal filesystem as it comes in, and the normal
   filesystem continues to schedule normal dirty data writeouts. This is
   good — it means that when we do send a sync down you don't need to
   wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to
   disk before it's completed.
   
   
   
  I do not think I understand this well. When the writeahead journal mode
  is in use, would you please explain what happens to a single 4M write
  request? I assume that an entry in the journal will be created for this
  write request and after this entry is flushed to the journal disk, Ceph
  returns successful. There should be no IO to the osd's disk. All IOs are
  supposed to go to the journal disk. At a later time, Ceph will start to
  apply these changes to the normal filesystem by reading from the first
  entry at which its previous synchronization stops. Finally, it will read
  this entry and apply this write change to the normal file system. Could
  you please point out where is wrong in my understanding? Thanks,
  
  
  
 All the data goes to the disk in write-back mode so it isn't safe yet  
 until the flush is called. That's why it goes into the journal first, to  
 be consistent at all times.
  
 If you would buffer everything in the journal and flush that at once you  
 would overload the disk for that time.
  
 Let's say you have 300MB in the journal after 10 seconds and you want to  
 flush that at once. That would mean that specific disk is unable to do  
 any other operations then writing with 60MB/sec for 5 seconds.
  
 It's better to always write in write-back mode to the disk and flush at  
 a certain point.
  
 In the meantime the scheduler can do it's job to balance between the  
 reads and the writes.
  
 Wido
Yep, what Wido said. Specifically, we do force the data to the journal with an 
fsync or equivalent before responding to the client, but once it's stable on 
the journal we give it to the filesystem (without doing any sort of forced 
sync). This is necessary — all reads are served from the filesystem.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com  


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When ceph synchronizes journal to disk? / read request

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 12:37 AM, Dieter Kasper wrote:
 Hi Gregory,
 
 another interesting aspect for me is:
 How will a read-request for this block/sub-block (pending between journal and 
 OSD)
 be satisfied (assuming the client will not cache) ?
 Will this read go to the journal or to the OSD ?
 
All read requests are satisfied from the main OSD store filesystem. Satisfying 
reads from the journal would be extraordinarily complicated and not buy us 
anything that I can think of. (In fact the journal is only read during 
recovery).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ceph: don't early drop Fw cap

2013-03-05 Thread Greg Farnum
On Monday, March 4, 2013 at 5:57 PM, Yan, Zheng wrote:
 On 03/05/2013 02:26 AM, Gregory Farnum wrote:
  On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com 
  (mailto:zheng.z@intel.com) wrote:
   From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)

   ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
   cap dirty before data is copied to page cache and inode size is
   updated. The optimization avoids slow cap revocation caused by
   balance_dirty_pages(), but introduces inode size update race. If
   ceph_check_caps() flushes the dirty cap before the inode size is
   updated, MDS can miss the new inode size. So just remove the
   optimization.

   Signed-off-by: Yan, Zheng zheng.z@intel.com 
   (mailto:zheng.z@intel.com)
   ---
   fs/ceph/file.c | 42 +-
   1 file changed, 17 insertions(+), 25 deletions(-)

   diff --git a/fs/ceph/file.c b/fs/ceph/file.c
   index a949805..28ef273 100644
   --- a/fs/ceph/file.c
   +++ b/fs/ceph/file.c
   @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, 
   const struct iovec *iov,
   if (ceph_snap(inode) != CEPH_NOSNAP)
   return -EROFS;

   + sb_start_write(inode-i_sb);
   retry_snap:
   - if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL))
   - return -ENOSPC;
   + if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL)) {
   + ret = -ENOSPC;
   + goto out;
   + }
   __ceph_do_pending_vmtruncate(inode);
   dout(aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n,
   inode, ceph_vinop(inode), pos, (unsigned)iov-iov_len,
   @@ -750,29 +753,10 @@ retry_snap:
   ret = ceph_sync_write(file, iov-iov_base, iov-iov_len,
   iocb-ki_pos);
   } else {
   - /*
   - * buffered write; drop Fw early to avoid slow
   - * revocation if we get stuck on balance_dirty_pages
   - */
   - int dirty;
   -
   - spin_lock(ci-i_ceph_lock);
   - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
   - spin_unlock(ci-i_ceph_lock);
   - ceph_put_cap_refs(ci, got);
   -
   - ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
   - if ((ret = 0 || ret == -EIOCBQUEUED) 
   - ((file-f_flags  O_SYNC) || IS_SYNC(file-f_mapping-host)
   - || ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_NEARFULL))) {
   - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
   - if (err  0)
   - ret = err;
   - }
   -
   - if (dirty)
   - __mark_inode_dirty(inode, dirty);
   - goto out;
   + mutex_lock(inode-i_mutex);
   + ret = __generic_file_aio_write(iocb, iov, nr_segs,
   + iocb-ki_pos);
   + mutex_unlock(inode-i_mutex);
   
   
   
  Hmm, you're here passing in a different value than the removed
  generic_file_aio_write() call did — iocb-ki_pos instead of pos.
  Everything else is using the pos parameter so I rather expect that
  should still be used here?
  
  
  
 They always have the same value, see the BUG_ON in generic_file_aio_write()
  
  Also a quick skim of the interfaces makes me think that the two
  versions aren't interchangeable — __generic_file_aio_write() also
  handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
  
  
  
 ceph has its own code that handles O_SYNC case. I want to make 
 sb_start_write()
 covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() 
 here.
  
 Regards
 Yan, Zheng
  
Ah, yep — sounds good!
-Greg  


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] ceph: misc fixes

2013-03-05 Thread Greg Farnum
I've merged this series into the testing branch, with appropriate Reviewed-by 
tags from me (and Sage on #4). Thanks much for the code and helping me go 
through it. :) 
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Monday, March 4, 2013 at 7:38 PM, Yan, Zheng wrote:

 On 03/05/2013 02:49 AM, Gregory Farnum wrote:
  On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com wrote:
   From: Yan, Zheng zheng.z@intel.com
   
   These patches are also in:
   git://github.com/ukernel/linux.git (http://github.com/ukernel/linux.git) 
   wip-ceph
  
  
  
  1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's
  concerns on 4 and my questions on 6 I'll be happy to pull these in. :)
 
 
 
 I rechecked locking assumption in patch 4, nothing goes wrong.
 
 Regards
 Yan, Zheng
 
  -Greg 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] ceph: only set message data pointers if non-empty

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 5:53 AM, Alex Elder wrote:
 The ceph file system doesn't typically send information in the
 data portion of a message. (It relies on some functionality
 exported by the osd client to read and write page data.)
 
 There are two spots it does send data though. The value assigned to
 an extended attribute is held in one or more pages allocated by
 ceph_sync_setxattr(). Eventually those pages are assigned to a
 request message in create_request_message().
 
 The second spot is when sending a reconnect message, where a
 ceph pagelist is used to build up an array of snaprealm_reconnect
 structures to send to the mds.
 
 Change it so we only assign the outgoing data information for
 these messages if there is outgoing data to send.
 
 This is related to:
 http://tracker.ceph.com/issues/4284
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
 fs/ceph/mds_client.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 42400ce..ae83aa9 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -1718,7 +1718,12 @@ static struct ceph_msg
 *create_request_message(struct ceph_mds_client *mdsc,
 msg-front.iov_len = p - msg-front.iov_base;
 msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
 
 - ceph_msg_data_set_pages(msg, req-r_pages, req-r_num_pages, 0);
 + if (req-r_num_pages) {
 + /* outbound data set only by ceph_sync_setxattr() */
 + BUG_ON(!req-r_pages);
 + ceph_msg_data_set_pages(msg, req-r_pages,
 + req-r_num_pages, 0);
 + }
 
 msg-hdr.data_len = cpu_to_le32(req-r_data_len);
 msg-hdr.data_off = cpu_to_le16(0);
 @@ -2599,10 +2604,13 @@ static void send_mds_reconnect(struct
 ceph_mds_client *mdsc,
 goto fail;
 }
 
 - ceph_msg_data_set_pagelist(reply, pagelist);
 if (recon_state.flock)
 reply-hdr.version = cpu_to_le16(2);
 - reply-hdr.data_len = cpu_to_le32(pagelist-length);
 + if (pagelist-length) {
 + /* set up outbound data if we have any */
 + reply-hdr.data_len = cpu_to_le32(pagelist-length);
 + ceph_msg_data_set_pagelist(reply, pagelist);
 + }
 ceph_con_send(session-s_con, reply);
 
 mutex_unlock(session-s_mutex);
 
Reviewed-by: Greg Farnum g...@inktank.com
Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] libceph: activate message data assignment checks

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 5:53 AM, Alex Elder wrote:
 The mds client no longer tries to assign zero-length message data,
 and the osd client no longer sets its data info more than once.
 This allows us to activate assertions in the messenger to verify
 these things never happen.
 
 This resolves both of these:
 http://tracker.ceph.com/issues/4263
 http://tracker.ceph.com/issues/4284
 
 Signed-off-by: Alex Elder el...@inktank.com (mailto:el...@inktank.com)
 ---
 net/ceph/messenger.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 97506ac..5bf1bb5 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -2677,10 +2677,10 @@ EXPORT_SYMBOL(ceph_con_keepalive);
 void ceph_msg_data_set_pages(struct ceph_msg *msg, struct page **pages,
 unsigned int page_count, size_t alignment)
 {
 - /* BUG_ON(!pages); */
 - /* BUG_ON(!page_count); */
 - /* BUG_ON(msg-pages); */
 - /* BUG_ON(msg-page_count); */
 + BUG_ON(!pages);
 + BUG_ON(!page_count);
 + BUG_ON(msg-pages);
 + BUG_ON(msg-page_count);
 
 msg-pages = pages;
 msg-page_count = page_count;
 @@ -2691,8 +2691,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_pages);
 void ceph_msg_data_set_pagelist(struct ceph_msg *msg,
 struct ceph_pagelist *pagelist)
 {
 - /* BUG_ON(!pagelist); */
 - /* BUG_ON(msg-pagelist); */
 + BUG_ON(!pagelist);
 + BUG_ON(msg-pagelist);
 
 msg-pagelist = pagelist;
 }
 @@ -2700,8 +2700,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_pagelist);
 
 void ceph_msg_data_set_bio(struct ceph_msg *msg, struct bio *bio)
 {
 - /* BUG_ON(!bio); */
 - /* BUG_ON(msg-bio); */
 + BUG_ON(!bio);
 + BUG_ON(msg-bio);
 
 msg-bio = bio;
 }
 @@ -2709,8 +2709,8 @@ EXPORT_SYMBOL(ceph_msg_data_set_bio);
 
 void ceph_msg_data_set_trail(struct ceph_msg *msg, struct ceph_pagelist
 *trail)
 {
 - /* BUG_ON(!trail); */
 - /* BUG_ON(msg-trail); */
 + BUG_ON(!trail);
 + BUG_ON(msg-trail);
 
 msg-trail = trail;
 }
 -- 
 1.7.9.5
 


Reviewed-by: Greg Farnum g...@inktank.com

I'll leave #4 for Josh to review. :)

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd locking and handling broken clients

2012-06-13 Thread Greg Farnum
On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote:
 Greg,
  
 My understanding of Ceph code internals is far too limited to comment on
 your specific points, but allow me to ask a naive question.
  
 Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
 Reservations? If you had server-side (OSD) persistence of information of
 the this device is in use by X type (where anything other than X would
 get an I/O error when attempting to access data), and you had a manual,
 authenticated override akin to SCSI PR preemption, plus key
 registration/exchange for that authentication, then you would at least
 have to have the combination of a misbehaving OSD plus a malicious
 client for data corruption. A non-malicious but just broken client
 probably won't do.
  
 Clearly I may be totally misguided, as Ceph is fundamentally
 decentralized and SCSI isn't, but if PR-ish behavior comes even close to
 what you're looking for, grabbing those ideas would look better to me
 than designing your own wheel.

Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally 
decentralized. :) I'm not familiar with the SCSI PR mechanism either, but it 
looks to me like it deals in entirely local information — the equivalent with 
RBD would require performing a locking operation on every object in the RBD 
image before you accessed it. We could do that, but then opening an image would 
take time linear in its size… :(


On Wednesday, June 13, 2012 at 4:14 PM, Tommi Virtanen wrote:
 On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  2) Client fencing. See http://tracker.newdream.net/issues/2531. There
  is an existing blacklist functionality in the OSDs/OSDMap, where you
  can specify an entity_addr_t (consisting of an IP, a port, and a
  nonce — so essentially unique per-process) which is not allowed to
  communicate with the cluster any longer. The problem with this is that
  
 Does that work even after a TCP connection close  re-establish, where
 the client now has a new source port address? (Perhaps the port is 0
 for clients?)

Precisely — client ports are 0 since they never accept incoming connections.



 You know, I'd be really happy if this could be achieved by means of
 removing cephx keys.

Unfortunately, that wouldn't really solve the problem without dramatically 
decreasing the rotation interval for cluster access keys which cephx shares. 
Alternative (entirely theoretical) security schemes might, but they're well 
behind what's feasible for us to work on any time soon...



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds dump

2012-06-08 Thread Greg Farnum
On Friday, June 8, 2012 at 12:31 PM, Tommi Virtanen wrote:
 On Fri, Jun 8, 2012 at 12:22 PM, Martin Wilderoth
 martin.wilder...@linserv.se (mailto:martin.wilder...@linserv.se) wrote:
  I have removed the data and metadata pool. Do I need to create them again, 
  or
  will they be created automaticly.
  Maybe I need the undocumented way of creating the mds map ?. I would like to
  get an empty cephfs to play with again.
  
  
  
 Just create them again with rados mkpool, that gets you back to square one.
Actually, this doesn't — the MDS uses pool IDs for its access, not pool names, 
so you need to do a bit more (Sage illustrated the simplest route for handling 
that).
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount: 10.0.6.10:/: can't read superblock

2012-06-05 Thread Greg Farnum
For future reference, that error was because the active MDS server was in 
replay. I can't tell why it didn't move on to active from what you posted, but 
I imagine it just got a little stuck since restarting made it work out. 
-Greg


On Tuesday, June 5, 2012 at 1:05 PM, Martin Wilderoth wrote:

 Hello Again,
 
 I restarted the mds on all servers and then it worked again
 
 /Regards Martin
 
  Hello 
  
   Hi Martin, 
   
   On 06/05/2012 08:07 PM, Martin Wilderoth wrote: 
Hello 

Is there a way to recover this error. 

mount -t ceph 10.0.6.10:/ /mnt -vv -o 
name=admin,secret=XXX 
[ 506.640433] libceph: loaded (mon/osd proto 15/24, osdmap 5/6 5/6) 
[ 506.650594] ceph: loaded (mds proto 32) 
[ 506.652353] libceph: client0 fsid 
a9d5f9e1-4bb9-4fab-b79b-ba4457631b01 
[ 506.670876] Intel AES-NI instructions are not detected. 
[ 506.678861] libceph: mon0 10.0.6.10:6789 session established 
mount: 10.0.6.10:/: can't read superblock 
   
   
   
   Could you share some more information? For example the output from: ceph 
   -s 
  
  2012-06-05 20:25:05.307914 pg v1189604: 1152 pgs: 1152 active+clean; 191 GB 
  data, 393 GB used, 973 GB / 1379 GB avail 
  012-06-05 20:25:05.315871 mds e60: 1/1/1 up {0=c=up:replay}, 2 up:standby 
  2012-06-05 20:25:05.315965 osd e1106: 8 osds: 8 up, 8 in 
  2012-06-05 20:25:05.316165 log 2012-06-05 20:24:50.425527 mon.0 
  10.0.6.10:6789/0 75 : [INF] mds.? 10.0.6.11:6800/22974 up:boot 
  2012-06-05 20:25:05.316371 mon e1: 3 mons at 
  {a=10.0.6.10:6789/0,b=10.0.6.11:6789/0,c=10.0.6.12:6789/0} 
  
  
   
   Did you change anything to the cluster since it worked? And what version 
   are you running? 
  
  
  
  I have not done any changes installed at version 0.46 upgraded earlier and 
  have been testing with 
  ceph and ceph-fuse and backuppc. It was during the ceph-fuse it hanged. 
  
  Current version 
  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 
  
One of my mds logs has 24G of data. 
   
   Is it still running? 
  I have restarted mds.a and mds.b they seems to be running. But not 
  everything. 
  mds.a was stoped not sure mds.b but it has a big logfile. 
  
   

I have some rbd devices that I would like to keep. 
   
   RBD doesn't use the MDS nor the POSIX filesystem, so you will probably 
   be fine, but we need the output of ceph -s first. 
   
   Does this work? 
   $ rbd ls 
  
  
  this works I'm still using the rbd with no problem 
   $ rados -p rbd ls 
  
  
  seems to work reports something simmilar to 
  rb.0.2.052e 
  rb.0.0.02f2 
  rb.0.7.0345 
  rb.0.7.0896 
  rb.0.0.0102 
  rb.0.9.0172 
  rb.0.1.0350 
  rb.0.4.0180 
  rb.0.4.068b 
  rb.0.5.054c 
  rb.0.2.01e1 
  
   Wido 
   

/Regards Martin 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Greg Farnum
This is probably the same/similar to http://tracker.newdream.net/issues/2462, 
no? There's a log there, though I've no idea how helpful it is.


On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:

 Can you send the osd logs? The merge_log crashes are probably fixable
 if I can see the logs.
 
 The leveldb crash is almost certainly a result of memory corruption.
 
 Thanks
 -Sam
 
 On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com 
 (mailto:t...@inktank.com) wrote:
  On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   Results : Worked like a charm during two days, apart btrfs warn messages
   then OSD begin to crash 1 after all 'domino style'.
  
  
  
  Sorry to hear that. Reading through your message, there seem to be
  several problems; whether they are because of the same root cause, I
  can't tell.
  
  Quick triage to benefit the other devs:
  
  #1: kernel crash, no details available
   1 of the physical machine was in kernel oops state - Nothing was remote
  
  
  
  #2: leveldb corruption? may be memory corruption that started
  elsewhere.. Sam, does this look like the leveldb issue you saw?
   [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
   0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
   (Aborted) **
  
  
  ...
   13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
   leveldb::Slice const) const+0x4d) [0x6ef69d]
   14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
   const)+0x9f) [0x6fdd9f]
  
  
  
  #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
   0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): 
   In function
   'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
   int)'
   thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
   osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head = olog.tail  
   olog.head =
   log.tail)
  
  
  
  #4: unknown btrfs warnings, there should an actual message above this
  traceback; believed fixed in latest kernel
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
   [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
   [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
   [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
   [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
   [8105a9f0] ? add_wait_queue+0x60/0x60
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
   [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
   [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-06-04 Thread Greg Farnum
On Thursday, May 24, 2012 at 5:29 AM, Felix Feinhals wrote:
 Hi,
  
 i was using the Debian Packages, but i tried now from source.
 I used the same version from GIT
 (cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
 report.
 Then i applied your patch but again the same crash, i think the
 backtrace is also the same:
  
 (gdb) thread 1
 [Switching to thread 1 (Thread 9564)]#0 0x7f33a3e58ebb in raise
 (sig=value optimized out)
 at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
 41 in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
 (gdb) backtrace
 #0 0x7f33a3e58ebb in raise (sig=value optimized out)
 at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
 #1 0x0081423e in reraise_fatal (signum=11) at
 global/signal_handler.cc:58 (http://signal_handler.cc:58)
 #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 
 (http://signal_handler.cc:104)
 #3 signal handler called
 #4 SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
 at mds/snap.cc:112 (http://snap.cc:112)
 #5 0x0055d58b in MDCache::check_realm_past_parents
 (this=0x27a7200, realm=0x0)
 at mds/MDCache.cc:4495 (http://MDCache.cc:4495)
 #6 0x00572eec in
 MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
 at mds/MDCache.cc:4533 (http://MDCache.cc:4533)
 #7 0x005931a0 in MDCache::rejoin_gather_finish
 (this=0x27a7200) at mds/MDCache.cc: (http://MDCache.cc:)
 #8 0x0059b9d5 in MDCache::rejoin_send_rejoins
 (this=0x27a7200) at mds/MDCache.cc:3388 (http://MDCache.cc:3388)
 #9 0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
 mds/MDS.cc:1404 (http://MDS.cc:1404)
 #10 0x004c253a in MDS::handle_mds_map (this=0x27bc000,
 m=value optimized out)
 at mds/MDS.cc:968 (http://MDS.cc:968)
 #11 0x004c4513 in MDS::handle_core_message (this=0x27bc000,
 m=0x27ab800) at mds/MDS.cc:1651 (http://MDS.cc:1651)
 #12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
 at mds/MDS.cc:1790 (http://MDS.cc:1790)
 #13 0x004c628b in MDS::ms_dispatch (this=0x27bc000,
 m=0x27ab800) at mds/MDS.cc:1602 (http://MDS.cc:1602)
 #14 0x00732609 in Messenger::ms_deliver_dispatch
 (this=0x279f680) at msg/Messenger.h:178
 #15 SimpleMessenger::dispatch_entry (this=0x279f680) at
 msg/SimpleMessenger.cc:363 (http://SimpleMessenger.cc:363)
 #16 0x007207ad in SimpleMessenger::DispatchThread::entry() ()
 #17 0x7f33a3e508ca in start_thread (arg=value optimized out) at
 pthread_create.c:300
 #18 0x7f33a26d892d in clone () at
 ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
 #19 0x in ?? ()
  
 Any more ideas? :)
 Or can i get you more debugging output?

Sorry for the delay — I'm afraid that's a hazard of using the MDS before we're 
ready to support it. :(
Anyway, I haven't had a lot of time to look into this, but that makes it look 
like there's an actual problem, where one of the inodes can't find the 
SnapRealm which it lives in. Things that will make this easier to diagnose 
(in the event that somebody gets the time) include generating high-level debug 
logs and placing them somewhere accessible (start up the MDS with debug mds = 
20 added to the config file); if you want you could also try the below patch 
(which will cause the MDS to dump its full inode cache upon triggering this 
bug) and we can see if there's anything really obvious.
(This is a fine thing to make bug reports on at tracker.newdream.net, btw — and 
that allows attachments of things like log files.)
-Greg

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 143faca..6aa5923 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -4527,6 +4527,11 @@ void MDCache::choose_lock_states_and_reconnect_caps()
dout(15)   chose lock states on   *in  dendl;

SnapRealm *realm = in-find_snaprealm();
+ if (!realm) {
+ dout(0)  serious error, could not find snaprealm for in   *in
+  , triggering cache dump  dendl;
+ dump_cache();
+ }

check_realm_past_parents(realm);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-06-04 Thread Greg Farnum
On Thursday, May 31, 2012 at 4:58 PM, Noah Watkins wrote:
 
 On May 31, 2012, at 3:39 PM, Greg Farnum wrote:
   
   Nevermind to my last comment. Hmm, I've seen this, but very rarely.
  Noah, do you have any leads on this? Do you think it's a bug in your Java 
  code or in the C/++ libraries?
 
 
 
 I _think_ this is because the JVM uses its own threading library, and Ceph 
 assumes pthreads and pthread compatible mutexes--is that assumption about 
 Ceph correct? Hence the error that looks like Mutex::lock(bool) being 
 reference for context during the segfault. To verify this all that is needed 
 is some synchronization added to the Java.
I'm not quite sure what you mean here. Ceph is definitely using pthread 
threading and mutexes, but I don't see how the use of a different threading 
library can break pthread mutexes (which are just using the kernel futex stuff, 
AFAIK).
But I admit I'm not real good at handling those sorts of interactions, so maybe 
I'm missing something?

 There are only two segfaults that I've ever encountered, one in which the C 
 wrappers are used with an unmounted client, and the error Nam is seeing 
 (although they could be related). I will re-submit an updated patch for the 
 former, which should rule that out as the culprit.
 
 Nam: where are you grabbing the Java patches from? I'll push some updates.
 
 
 The only other scenario that comes to mind is related to signaling:
 
 The RADOS Java wrappers suffered from an interaction between the JVM and 
 RADOS client signal handlers, in which either the JVM or RADOS would replace 
 the handlers for the other (not sure which order). Anyway, the solution was 
 to link in the JVM libjsig.so signal chaining library. This might be the same 
 thing we are seeing here, but I'm betting it is the first theory I mentioned.
Hmm. I think that's an issue we've run into but I thought it got fixed for 
librados. Perhaps I'm mixing that up with libceph, or just pulling past 
scenarios out of thin air. It never manifested as Mutex count bugs, though!
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-06-04 Thread Greg Farnum
On Monday, June 4, 2012 at 1:47 PM, Noah Watkins wrote:
 On Mon, Jun 4, 2012 at 1:17 PM, Greg Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
 
  I'm not quite sure what you mean here. Ceph is definitely using pthread 
  threading and mutexes, but I don't see how the use of a different threading 
  library can break pthread mutexes (which are just using the kernel futex 
  stuff, AFAIK).
  But I admit I'm not real good at handling those sorts of interactions, so 
  maybe I'm missing something?
 
 
 
 The basic idea was that threads in Java did not map 1:1 with kernel
 threads (think co-routines), which would break a lot of stuff,
 especially futex. Looking at some documentation, old JVMs had
 something called Green Threads, but have now been abandoned in favor
 of native threads. So maybe this theory is now irrelevant, and
 evidence seems to suggest you're right and Java is using native
 threads.

Gotcha, that makes sense.
 
 
   The RADOS Java wrappers suffered from an interaction between the JVM and 
   RADOS client signal handlers, in which either the JVM or RADOS would 
   replace the handlers for the other (not sure which order). Anyway, the 
   solution was to link in the JVM libjsig.so signal chaining library. This 
   might be the same thing we are seeing here, but I'm betting it is the 
   first theory I mentioned.
 
  Hmm. I think that's an issue we've run into but I thought it got fixed for 
  librados. Perhaps I'm mixing that up with libceph, or just pulling past 
  scenarios out of thin air. It never manifested as Mutex count bugs, though!
 
 I haven't tested the Rados wrappers in a while. I've never had to link
 in the signal chaining library for libcephfs.
 
 I wonder if the Mutex::lock(bool) being printed out is a red herring... 
Well, it's a SIGSEGV. So my guess is that's the frame that happens to be going 
outside its allowed bounds, probably because it's the first frame actually 
accessing the memory off of a bad (probably NULL) pointer. For instance, if it 
not only failed to mount the client, but even to create the context object? 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iozone test crashed on ceph

2012-06-01 Thread Greg Farnum
On Thursday, May 31, 2012 at 5:58 PM, udit agarwal wrote:
 Hi,
 I have set up ceph system with a client, mon and mds on one system which is
 connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But 
 when
 I ran iozone test with a 5G file, the process got killed and our ceph system
 hanged. Can anyone please help me with this.

What do you mean, the process got killed? It hung and some task watcher 
killed it? Or it got OOMed?
How did you determine that the ceph system hung? The cluster stopped 
responding to requests, or just the local mount point?
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Greg Farnum
On Thursday, May 31, 2012 at 7:43 AM, Noah Watkins wrote:
 
 On May 31, 2012, at 6:20 AM, Nam Dang wrote:
 
   Stack: [0x7ff6aa828000,0x7ff6aa929000],
   sp=0x7ff6aa9274f0, free space=1021k
   Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
   code)
   C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9
   
   Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
   j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
   j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
   j 
   Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
   j Benchmark$StatsDaemon.benchmarkOne()V+22
   j Benchmark$StatsDaemon.run()V+26
   v ~StubRoutines::call_stub
  
 
 
 
 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
Noah, do you have any leads on this? Do you think it's a bug in your Java code 
or in the C/++ libraries?
Nam: it definitely shouldn't be segfaulting just because a monitor went down. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple named clusters on same nodes

2012-05-29 Thread Greg Farnum
On Thursday, May 24, 2012 at 1:58 AM, Amon Ott wrote:
 On Thursday 24 May 2012 wrote Amon Ott:
  Attached is a patch based on current git stable that makes mkcephfs work
  fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon
  data (default ceph instead of supplied cluster name), so I put in a
  workaround.
  
  Please have a look and consider inclusion as well as fixing mon data path.
  Thanks.
 
 
 
 And another patch for the init script to handle multiple clusters.

Amon:
Thanks for the patches! Unfortunately nobody who's competent to review these 
(ie, not me) has time to look into them right now, but they're on the queue 
when TV or Sage gets some time. :)
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error 5 when trying to mount Ceph 0.47.1

2012-05-25 Thread Greg Farnum
On Thursday, May 24, 2012 at 10:58 PM, Nam Dang wrote:
 Hi,
 
 I've just started working with Ceph for a couple of weeks. At the
 moment, I'm trying to setup a small cluster with 1 monitor, 1 MDS and
 6 OSDs. However, I cannot mount ceph to the system no matter which
 node I'm executing the mounting command on.
 
 My nodes run Ubuntu 11.10 with kernal 3.0.0-12
 Seeing some other people also faced similar problems, I attached the
 result of running ceph -s as followed:
 
 2012-05-25 23:52:17.802590 pg v434: 1152 pgs: 189 active+clean, 963
 stale+active+clean; 8730 bytes data, 3667 MB used, 844 GB / 893 GB
 avail
 2012-05-25 23:52:17.806759 mds e12: 1/1/1 up {0=1=up:replay}
 2012-05-25 23:52:17.806827 osd e30: 6 osds: 1 up, 1 in
 2012-05-25 23:52:17.806966 log 2012-05-25 23:44:14.584879 mon.0
 192.168.172.178:6789/0 2 : [INF] mds.? 192.168.172.179:6800/6515
 up:boot
 2012-05-25 23:52:17.807086 mon e1: 1 mons at {0=192.168.172.178:6789/0}
 
 I tried to use the mount -t ceph node:port:/ [destination] but I keep
 getting mount error 5 = Input/output error
 
 I also check if the firewall is blocking anything with nmap -sT -p
 6789 [monNode]
 
 My ceph version is 0.47.1, installed with sudo apt-get on the system.
 I've spent a couple of days googling with no avails, and the
 documentation does not address this issue at all.
 
 Thank you very much for your help,

Notice how the MDS status is up:replay? That means it restarted at some point 
and is currently replaying the journal, which is why your client can't connect.

Ordinarily journal replay happens very quickly (a couple to several seconds, 
depending mostly on length), so if it's been in that state for a while 
something has gone wrong. And indeed only 1 out of 6 of your OSDs is up, and 
most of your PGs are stale because the people responsible for them aren't 
running. This is preventing the MDS from retrieving objects.

So we need to figure out why your OSDs are down. Did you fail to start them? 
Have they crashed and left behind backtraces or core dumps?
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'rbd map' asynchronous behavior

2012-05-25 Thread Greg Farnum
That looks like a bug that isn't familiar to Josh or I. Can you create a report 
in the tracker and provide as much debug info as you can come up with? :)


On Friday, May 25, 2012 at 3:15 AM, Andrey Korolyov wrote:

 Hi,
 
 Newer kernel rbd driver throws a quite strange messages on map|unmap,
 comparing to 3.2 branch:
 
 rbd map 'path' # device appears as /dev/rbd1 instead of rbd0, then
 rbd unmap /dev/rbd1 # causes following trace, w/ vanilla 3.4.0 from 
 kernel.org (http://kernel.org):
 
 [ 99.700802] BUG: scheduling while atomic: rbd/3846/0x0002
 [ 99.700857] Modules linked in: btrfs ip6table_filter ip6_tables
 iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp
 libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl
 auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph
 loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250
 serial_core
 [ 99.700899] Pid: 3846, comm: rbd Not tainted 3.4.0 #3
 [ 99.700902] Call Trace:
 [ 99.700910] [81464d68] ? __schedule+0x96/0x625
 [ 99.700916] [8105f98a] ? __queue_work+0x254/0x27c
 [ 99.700921] [81465d39] ? _raw_spin_lock_irqsave+0x2a/0x32
 [ 99.700926] [81069b6d] ? complete+0x31/0x40
 [ 99.700931] [8105f10a] ? flush_workqueue_prep_cwqs+0x16e/0x180
 [ 99.700947] [81463bd8] ? schedule_timeout+0x21/0x1af
 [ 99.700951] [8107165d] ? enqueue_entity+0x67/0x13d
 [ 99.700955] [81464ad9] ? wait_for_common+0xc5/0x143
 [ 99.700959] [8106d5fc] ? try_to_wake_up+0x217/0x217
 [ 99.700963] [81063952] ? kthread_stop+0x30/0x50
 [ 99.700967] [81060979] ? destroy_workqueue+0x148/0x16b
 [ 99.700977] [a004ce07] ? ceph_osdc_stop+0x1f/0xaa [libceph]
 [ 99.700984] [a00463b4] ? ceph_destroy_client+0x10/0x44 [libceph]
 [ 99.700989] [a00652ae] ? rbd_client_release+0x38/0x4b [rbd]
 [ 99.700993] [a0065719] ? rbd_put_client.isra.10+0x28/0x3d [rbd]
 [ 99.700998] [a006609d] ? rbd_dev_release+0xc3/0x157 [rbd]
 [ 99.701003] [81287387] ? device_release+0x41/0x72
 [ 99.701007] [81202b95] ? kobject_release+0x4e/0x6a
 [ 99.701025] [a0065156] ? rbd_remove+0x102/0x11e [rbd]
 [ 99.701035] [8114b058] ? sysfs_write_file+0xd3/0x10f
 [ 99.701044] [810f8796] ? vfs_write+0xaa/0x136
 [ 99.701052] [810f8a07] ? sys_write+0x45/0x6e
 [ 99.701062] [8146a839] ? system_call_fastpath+0x16/0x1b
 [ 99.701170] BUG: scheduling while atomic: rbd/3846/0x0002
 [ 99.701220] Modules linked in: btrfs ip6table_filter ip6_tables
 iptable_filter ip_tables ebtable_nat ebtables x_tables iscsi_tcp
 libiscsi_tcp libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl
 auth_rpcgss lockd sunrpc kvm_intel kvm bridge stp llc ipv6 rbd libceph
 loop 8250_pnp pcspkr firewire_ohci coretemp firewire_core hwmon 8250
 serial_core
 [ 99.701251] Pid: 3846, comm: rbd Not tainted 3.4.0 #3
 [ 99.701253] Call Trace:
 [ 99.701257] [81464d68] ? __schedule+0x96/0x625
 [ 99.701261] [81465ef9] ? _raw_spin_unlock_irq+0x5/0x2e
 [ 99.701265] [81069f92] ? finish_task_switch+0x4c/0xc1
 [ 99.701268] [8146525b] ? __schedule+0x589/0x625
 [ 99.701272] [812084b2] ? ip4_string+0x5a/0xc8
 [ 99.701276] [81208cbd] ? string.isra.3+0x39/0x9f
 [ 99.701281] [81208e33] ? ip4_addr_string.isra.5+0x5a/0x76
 [ 99.701285] [81208b7a] ? number.isra.1+0x10e/0x218
 [ 99.701290] [81463bd8] ? schedule_timeout+0x21/0x1af
 [ 99.701294] [81464ad9] ? wait_for_common+0xc5/0x143
 [ 99.701298] [8106d5fc] ? try_to_wake_up+0x217/0x217
 [ 99.701303] [8105f24c] ? flush_workqueue+0x130/0x2a5
 [ 99.701309] [a00463b9] ? ceph_destroy_client+0x15/0x44 [libceph]
 [ 99.701314] [a00652ae] ? rbd_client_release+0x38/0x4b [rbd]
 [ 99.701319] [a0065719] ? rbd_put_client.isra.10+0x28/0x3d [rbd]
 [ 99.701324] [a006609d] ? rbd_dev_release+0xc3/0x157 [rbd]
 [ 99.701328] [81287387] ? device_release+0x41/0x72
 [ 99.701334] [81202b95] ? kobject_release+0x4e/0x6a
 [ 99.701343] [a0065156] ? rbd_remove+0x102/0x11e [rbd]
 [ 99.701352] [8114b058] ? sysfs_write_file+0xd3/0x10f
 [ 99.701361] [810f8796] ? vfs_write+0xaa/0x136
 [ 99.701369] [810f8a07] ? sys_write+0x45/0x6e
 [ 99.701377] [8146a839] ? system_call_fastpath+0x16/0x1b
 
 
 On Wed, May 16, 2012 at 12:24 PM, Andrey Korolyov and...@xdel.ru 
 (mailto:and...@xdel.ru) wrote:
   This is most likely due to a recently-fixed problem.
   The fix is found in this commit, although there were
   other changes that led up to it:
   32eec68d2f rbd: don't drop the rbd_id too early
   It is present starting in Linux kernel 3.3; it appears
   you are running 2.6?
  
  
  
  Nope, it`s just Debian kernel naming - they continue to name 3.x with
  2.6 and I`m following them at own build naming. I have tried that on
  3.2 first time, and just a couple of minutes ago on 

Re: RBD format changes and layering

2012-05-25 Thread Greg Farnum
On Thursday, May 24, 2012 at 4:05 PM, Josh Durgin wrote:

snip 
 
 One thing that's not addressed in the earlier design is how to make
 images read-only. The simplest way would be to only support layering
 on top of snapshots, which are read-only by definition.
 
 Another way would be to allow images to be set read-only or
 read-write, and disallow setting images with children read-write. Are
 there many use cases that would justify this second, more complicated
 way?

I'm pretty sure we want to require images to be based on snapshots. It's 
actually more flexible than read-write flags: service providers could provide 
several Ubuntu 12.04 installs with different packages available by simply 
snapshotting as they go through the install procedure. If they instead had to 
go to an endpoint and then mark the image read-only, they would need to 
duplicate all the shared data.

 
 Copy-up
 ===
 
 Another feature we want to include with layering is the ability to
 copy all remaining data from the parent image to the child image, to
 break the dependency of the latter on the former. This does not change
 snapshots that were taken earlier though - they still rely on the
 parent image. Thus, the children of a parent image will need to
 include snapshots as well, and the reference to the parent image will
 be needed to interact with snapshots. Thus, we can't just remove the
 information pointing the parent. Instead, we can add a boolean
 has_parent field that is stored in the header and with each snapshot,
 since some snapshots may be taken when the parent was still used, and
 some after all the data has been copied to the child.

I understand why you're maintaining a reference to the parent image for old 
snapshots, but it makes me a little uneasy. This limitation means that you 
either need to delete snapshots or you need to maintain access to the parent 
image, which makes me a sad panda.
Have you looked into options for doing a full local copy of the needed parent 
data? I realize there are several tricky problems, but given some of the usage 
scenarios for layering (ie, migration) it would be an advantage.

My last question is about recursive layering. I know it's been discussed some, 
and *hopefully* it won't impact the actual on-disk layout of RBD images; do you 
have enough of a design sketched out to be sure? (One example: given the 
security concerns you've raised, I think layered images are going to need to 
list themselves as a child of each of their ancestors, rather than letting that 
information be absorbed by the intermediate image. Can the plan for storing 
parent pointers handle that?)
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 1:51 AM, Stefan Priebe - Profihost AG wrote:
 Am 24.05.2012 10:22, schrieb Wido den Hollander:
  On 24-05-12 09:38, Stefan Priebe - Profihost AG wrote:
   
   ~# rados -p data ls|wc -l
   46631
  
  
  
  That is weird, I thought the bench tool cleaned up it's mess.
  
  Imho it should cleanup after it's done, but there might be a reason why
  it's not. Did you abort the benchmark or did you let it do the whole run?
 
 
 No it doesn't BUG?
It doesn't because you might want to leave around the data for read 
benchmarking (or so that your cluster is full of data).
There should probably be an option to clean up bench data, though! I've created 
a bug: http://tracker.newdream.net/issues/2477
 
 
 ~# rados -p data ls
 ~#
 ~# rados -p data bench 20 write -t 16
 ...
 ~# rados -p data ls| wc -l
 589
 
   I do not use the data pool so it is seperate ;-) i only use the rbd pool
   for block devices.
   
   So i will free the space with:
   for i in `rados -p data ls`; do echo $i; rados -p data rm $i; done
  
  
  
  rados -p data ls|xargs -n 1 rados -p data rm
  
  I love shorter commands ;)
 me too i just tried it without -n and hoped that this works but rados
 didn't support more than 1 file per command and i didn't remembered -n1 ;)
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I have some problem to mount ceph file system

2012-05-24 Thread Greg Farnum
That's not an option any more, since malicious clients can fake it so easily. 
:(  


On Wednesday, May 23, 2012 at 10:35 PM, FrankWOO Su wrote:

 So in this version, can i do some settings about mount command limited by IP ?
  
 any example ??
  
 Thanks
 -Frank
  
 2012/5/24 Sage Weil s...@inktank.com (mailto:s...@inktank.com)
  On Wed, 23 May 2012, Gregory Farnum wrote:
   On Wed, May 23, 2012 at 1:51 AM, Frank frankwoo@gmail.com 
   (mailto:frankwoo@gmail.com) wrote:
Hello
I have a question about ceph.
 
When I mount ceph, I do the command as follow :
 
# mount -t ceph -o name=admin,secret=XX 10.1.0.1:6789/ 
(http://10.1.0.1:6789/) /mnt/ceph -vv
 
now I create an user foo and make a secretkey by ceph-authtool like 
that :
 
# ceph-authtool /etc/ceph/keyring.bin -n client.foo --gen-key
 
then I add the key into ceph :
 
# ceph auth add client.foo osd 'allow *' mon 'allow *' mds 'allow' -i
/etc/ceph/keyring.bin
 
so i can mount ceph by foo :
 
# mount -t ceph -o name=foo,secret=XOXOXO 10.1.0.1:6789/ 
(http://10.1.0.1:6789/) /mnt/ceph -vv
 
my question is if i don't want foo that has permission to mount 
10.1.0.1:6789/ (http://10.1.0.1:6789/)
 
HOW TO DO ITÿÿ
 
if there is a directory foo
 
I want he can mount 10.1.0.1:6789:/foo/
 
but have no access to mount 10.1.0.1:6789:/

   I'm afraid that's not an option with Ceph right now, that I'm aware
   of. It was built and designed for a trusted set of servers and
   clients, and while we're slowly carving out areas of security, this
   isn't one we've done yet.
   If it's an important feature for you, you should create a feature
   request in the tracker (tracker.newdream.net 
   (http://tracker.newdream.net)) for it, which we will
   prioritize and work on once we've moved to focus on the full
   filesystem. :)
   
   
  http://tracker.newdream.net/issues/1237
   
  (tho the final config will probably not look like that; suggestions
  welcome.)
   
  sag
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS re-exporting CEPH cluster

2012-05-24 Thread Greg Farnum
On Wednesday, May 23, 2012 at 10:14 PM, Madhusudhana U wrote:
 Hi all,
 Can anyone tried re-exporting CEPH cluster via NFS with success (I mean to 
 say, mount the CEPH cluster in one of the machine and then export that via 
 NFS to clients)? I need to do this bcz of my client kernel version and some
 EDA tools compatibility.Can someone suggest me how I can successfully 
 re-export CEPH over NFS ?

Have you tried something and it failed? Or are you looking for suggestions?
If the former, please report the failure. :)
If the latter: http://ceph.com/wiki/Re-exporting_NFS
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSDMap::apply_incremental not updating crush map

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 10:58 AM, Adam Crume wrote:
 I'm trying to simulate adding an OSD to a cluster. I set up an
 OSDMap::Incremental and apply it, but nothing ever gets mapped to the
 new OSD. Apparently, the crush map never gets updated. Do I have to
 do that manually?

Yes. If you need help, check out the OSDMonitor::prepare_command code crush 
section. :)
 
 It seems like apply_incremental should do it
 automatically.

apply_incremental has no idea where the new ID is located in terms of failure 
domains.
 
 My test case is below. It shows that the OSDMap is
 updated to have 11 OSDs, but the crush map still shows only 10.
 
 Thanks,
 Adam Crume
 
 #include assert.h
 #include osd/OSDMap.h
 #include common/code_environment.h
 
 int main() {
 OSDMap *osdmap = new OSDMap();
 CephContext *cct = new CephContext(CODE_ENVIRONMENT_UTILITY);
 uuid_d fsid;
 int num_osds = 10;
 osdmap-build_simple(cct, 1, fsid, num_osds, 7, 8);
 for(int i = 0; i  num_osds; i++) {
 osdmap-set_state(i, osdmap-get_state(i) | CEPH_OSD_UP |
 CEPH_OSD_EXISTS);
 osdmap-set_weight(i, CEPH_OSD_IN);
 }
 
 int osd_num = 10;
 OSDMap::Incremental inc(osdmap-get_epoch() + 1);
 inc.new_max_osd = osdmap-get_max_osd() + 1;
 inc.new_weight[osd_num] = CEPH_OSD_IN;
 inc.new_state[osd_num] = CEPH_OSD_UP | CEPH_OSD_EXISTS;
 inc.new_up_client[osd_num] = entity_addr_t();
 inc.new_up_internal[osd_num] = entity_addr_t();
 inc.new_hb_up[osd_num] = entity_addr_t();
 inc.new_up_thru[osd_num] = inc.epoch;
 uuid_d new_uuid;
 new_uuid.generate_random();
 inc.new_uuid[osd_num] = new_uuid;
 int e = osdmap-apply_incremental(inc);
 assert(e == 0);
 printf(State for 10: %d, State for 0: %d\n,
 osdmap-get_state(10), osdmap-get_state(0));
 printf(10 exists: %s\n, osdmap-exists(10) ? yes : no);
 printf(10 is in: %s\n, osdmap-is_in(10) ? yes : no);
 printf(10 is up: %s\n, osdmap-is_up(10) ? yes : no);
 printf(OSDMap max OSD: %d\n, osdmap-get_max_osd());
 printf(CRUSH max devices: %d\n, osdmap-crush-get_max_devices());
 }
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to free space from rados bench comman?

2012-05-24 Thread Greg Farnum
On Thursday, May 24, 2012 at 11:05 AM, Josh Durgin wrote:
 Why not have the read benchmark write data itself, and then benchmark
 reading? Then both read and write benchmarks can clean up after
 themselves.
 
 It's a bit odd to have the read benchmark depend on you running a write
 benchmark first.
 
 Josh 
We've talked about that and decided we didn't like it. I think it was about 
being able to repeat large read benchmarks without having to wait for all the 
data to get written out first, and also (although this was never implemented) 
being able to implement random read benchmarks and things in ways that allowed 
you to make the cache cold first.
Which is not to say that changing it is a bad idea; I could be talked into that 
or somebody else could do it. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-22 Thread Greg Farnum


On Tuesday, May 22, 2012 at 3:12 AM, Felix Feinhals wrote:

 I am not quite sure on how to get you the coredump infos. I installed
 all ceph-dbg packages and executed:
 
 gdb /usr/bin/ceph-mds core
 
 snip
 
 GNU gdb (GDB) 7.0.1-debian
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law. Type show copying
 and show warranty for details.
 This GDB was configured as x86_64-linux-gnu.
 For bug reporting instructions, please see:
 http://www.gnu.org/software/gdb/bugs/...
 Reading symbols from /usr/bin/ceph-mds...Reading symbols from
 /usr/lib/debug/usr/bin/ceph-mds...done.
 (no debugging symbols found)...done.
 [New Thread 22980]
 [New Thread 22984]
 [New Thread 22986]
 [New Thread 22979]
 [New Thread 22970]
 [New Thread 22981]
 [New Thread 22971]
 [New Thread 22976]
 [New Thread 22973]
 [New Thread 22975]
 [New Thread 22974]
 [New Thread 22972]
 [New Thread 22978]
 [New Thread 22982]
 
 warning: Can't read pathname for load map: Input/output error.
 Reading symbols from /lib/libpthread.so.0...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libpthread.so.0
 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libcrypto++.so.8
 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libuuid.so.1
 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/librt.so.1
 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libtcmalloc.so.0
 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libstdc++.so.6
 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libm.so.6
 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libgcc_s.so.1
 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libc.so.6
 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
 symbols found)...done.
 Loaded symbols for /lib64/ld-linux-x86-64.so.2
 Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libunwind.so.7
 Core was generated by `/usr/bin/ceph-mds -i c --pid-file
 /var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'.
 Program terminated with signal 11, Segmentation fault.
 #0 0x7f10c00d2ebb in raise () from /lib/libpthread.so.0
 

Argh. This is finicky and annoying; don't feel bad. :) There are two 
possibilities here:
1) If I remember correctly, PATH and the actual debug symbol install locations 
often don't match up. Check out where the debug packages actually installed to, 
and make sure that directory is in PATH when running gdb.
2) The default thread you're getting a backtrace on doesn't look to be the one 
we actually care about (notice how the backtrace is through completely 
different parts of the code); it's conceivable that there just aren't any debug 
symbols for those libraries. Try running thread apply all bt (I think that's 
the right command) and looking for one that matches the backtrace in the log 
file. Then switch to it (thread x where x is the thread number) and get the 
backtrace of that.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to mount a specific pool in cephs

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 5:02 AM, Grant Ashman wrote:
 Tommi Virtanen tommi.virtanen at dreamhost.com (http://dreamhost.com) 
 writes:
  
  You don't mount pools directly; there's filesystem metadata (as
  managed by metadata servers) that is needed too.
  
  What you probably want is to specify that a subtree of your ceph dfs
  stores the file data in a separate pool, using cephfs
  /mnt/ceph/some/subtree set_layout --pool 6. Note that a numerical
  pool id is currently required.
  
  http://ceph.newdream.net/docs/master/man/8/cephfs/
  
  You can mount any subtree of the ceph dfs directly, using
  10.32.0.10:6789:/some/subtree in your mount command.
 
 
 
 Hi Tommi,
 
 We have tried setting the layout as described below with:
 'cephfs /mnt/ceph-backup/ set_layout --pool 3'
 
 However, I only ever receive the following output;
 'Error setting layout: Invalid argument'
 
 I can run other view the current layout of the mount 
 point, but cannot change the pool layout.
 
 My understanding of the numeric pool is as follows:
 
 root@dsan-test:/mnt# ceph osd dump -o -|grep 'pool'
 pool 0 'data' rep size 2 crush_ruleset 0
 pool 1 'metadata' rep size 2 crush_ruleset 1
 pool 2 'rbd' rep size 2 crush_ruleset 2
 pool 3 'backup' rep size 2 crush_ruleset 0 
 
 (omitted detail I thought unnecessary)
 
 Therefore the backup pool which we specifically want to mount is 3?
 
 Are you able to assist with the syntax for cephfs set_layout?
That's the right pool ID; yes. I believe the problem is that the cephfs tool 
currently requires you to fill in all the fields, not just the one you wish to 
change. Try that (setting all the other values to match what you see when you 
view the layout). :)
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
What does your test look like? With multiple large IOs in flight we can 
regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs 
in flight performance degrades accordingly. 


On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote:

 Hi list,
 
 my ceph block testcluster is now running fine.
 
 Setup:
 4x ceph servers
 - 3x mon with /mon on local os SATA disk
 - 4x OSD with /journal on tmpfs and /srv on intel ssd
 
 all of them use 2x 1Gbit/s lacp trunk.
 
 1x KVM Host system (2x 1Gbit/s lacp trunk)
 
 With one KVM i do not get more than 40MB/s and my network link is just
 at 40% of 1Gbit/s.
 
 Is this expected? If not where can i start searching / debugging?
 
 Thanks,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
 Am 22.05.2012 21:35, schrieb Greg Farnum:
  What does your test look like? With multiple large IOs in flight we can 
  regularly fill up a 1GbE link on our test clusters. With smaller or fewer 
  IOs in flight performance degrades accordingly.
 
 
 
 iperf shows 950Mbit/s so this is OK (from KVM host to OSDs)
 
 sorry:
 dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M 
 count=1000;
 1000+0 records in
 1000+0 records out
 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s
 
 1000+0 records in
 1000+0 records out
 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s

Huh. That's less than I would expect. Especially since it ought to be going 
through the page cache.
What version of RBD is KVM using here?

Can you (from the KVM host) run
rados -p data bench seq 60 -t 1
rados -p data bench seq 60 -t 16
and paste the final output from both?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd crush add - uknown command crush

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 1:15 PM, Sławomir Skowron wrote:
 /usr/bin/ceph -v
 ceph version 0.47.1 (commit:f5a9404445e2ed5ec2ee828aa53d73d4a002f7a5)
  
 root@obs-10-177-66-4:/# /usr/bin/ceph osd crush add 1 osd.1 1.0
 pool=default rack=unknownrack host=obs-10-177-66-4
 root@obs-10-177-66-4:/# unknown command crush
  
 Something has changed (there is no change in doc) or it's bug ??
Gah, something changed.

Use ceph osd crush set…  

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 1:30 PM, Stefan Priebe wrote:
 Am 22.05.2012 21:52, schrieb Greg Farnum:
  On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote:
  Huh. That's less than I would expect. Especially since it ought to be going 
  through the page cache.
  What version of RBD is KVM using here?
  
  
 v0.47.1
  
  Can you (from the KVM host) run
  rados -p data bench seq 60 -t 1
  rados -p data bench seq 60 -t 16
  and paste the final output from both?
  
  
 OK here it is first with write then with seq read.
  
 # rados -p data bench 60 write -t 1
 # rados -p data bench 60 write -t 16
 # rados -p data bench 60 seq -t 1
 # rados -p data bench 60 seq -t 16
  
 Output is here:
 http://pastebin.com/iFy8GS7i

Heh, yep, sorry about the commands — haven't run them personally in a while. :)

Anyway, it looks like you're just paying a synchronous write penalty, since 
with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 
you're getting 100MB/s. (If you bump up past 16 or increase the size of each 
with -b you may find yourself getting even more.)
So try enabling RBD writeback caching — see 
http://marc.info/?l=ceph-develm=133758599712768w=2
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug slow rbd block device

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote:
 Am 22.05.2012 22:49, schrieb Greg Farnum:
  Anyway, it looks like you're just paying a synchronous write penalty
  
  
 What does that exactly mean? Shouldn't one threaded write to four  
 260MB/s devices gives at least 100Mb/s?

Well, with dd you've got a single thread issuing synchronous IO requests to the 
kernel. We could have it set up so that those synchronous requests get split 
up, but they aren't, and between the kernel and KVM it looks like when it needs 
to make a write out to disk it sends one request at a time to the Ceph backend. 
So you aren't writing to four 260MB/s devices; you are writing to one 260MB/s 
device without any pipelining — meaning you send off a 4MB write, then wait 
until it's done, then send off a second 4MB write, then wait until it's done, 
etc.
Frankly I'm surprised you aren't getting a bit more throughput than you're 
seeing (I remember other people getting much more out of less beefy boxes), but 
it doesn't much matter because what you really want to do is enable the 
client-side writeback cache in RBD, which will dispatch multiple requests at 
once and not force writes to be committed before reporting back to the kernel. 
Then you should indeed be writing to four 260MB/s devices at once. :)

  
  since with 1 write at a time you're getting 30-40MB/s out of rados bench, 
  but with 16 you're getting100MB/s.
  (If you bump up past 16 or increase the size of each with -b you may  
  
 find yourself getting even more.)
 yep noticed that.
  
  So try enabling RBD writeback caching — see 
  http://marc.info/?l=ceph-develm=133758599712768w=2
 will test tomorrow. Thanks.
  
 Stefan  


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to mount a specific pool in cephs

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 2:12 PM, Grant Ashman wrote:
  That's the right pool ID; yes. I believe the problem is that the cephfs 
  tool currently requires you to fill in all the fields, not ?just the one 
  you wish to change. Try that (setting all the other values to match what 
  you see when you view the layout). :) - Greg
  
  
  
 Hi,
  
 I've tried setting all values the same as the current layout - changing only 
 the pool number and I still get;
  
 root@dsan-test:/mnt/ceph-test# cephfs /mnt/ceph-backup show_layout
 layout.data_pool: 0
 layout.object_size: 4194304
 layout.stripe_unit: 4194304
 layout.stripe_count: 1
 root@dsan-test:/mnt/ceph-test#  
 root@dsan-test:/mnt/ceph-test# cephfs /mnt/ceph-backup set_layout -p 3 -s 
 4194304 -u 4194304 -c 1
 Error setting layout: Invalid argument
  
 If I leave all values exactly the same i.e '-p 0' the command runs without 
 any error output. However, changing the pool from anything but 0 results in 
 'Error setting layout: Invalid argument'
  
 Any ideas?
Oh, I got this conversation confused with another one. You also need to specify 
the pool as a valid pool to store filesystem data in, if you haven't done that 
already:
ceph mds add_data_pool poolname


And you may not actually need to specify all options — I'd been assuming since 
it broke but I don't remember if that's actually the case (I think it's changed 
over the versions). 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to mount a specific pool in cephs

2012-05-22 Thread Greg Farnum
On Tuesday, May 22, 2012 at 2:31 PM, Grant Ashman wrote:
 Greg Farnum greg at inktank.com (http://inktank.com) writes:
  Oh, I got this conversation confused with another one.
  You also need to specify the pool as a valid pool to
  store filesystem data in, if you haven't done that already:
  ceph mds add_data_pool poolname
  
  
  
 Thanks Greg,
  
 However, I still get the same error :(
 When I specify the add data pool I get the following:
 (with or without the additional values)
  
 root@dsan-test:~# ceph mds add_data_pool backup
 added data pool 0 to mdsmap

Okay, that's not right — it should say pool 3.
According to the docs I found you ran that correctly, but let's try running 
ceph mds add_data_pool 3 and see if that resolves correctly.
*goes to look at code*
Argh, yep, it's expecting a pool ID, not a pool name. Gah.
  
 root@dsan-test:~# cephfs /mnt/ceph-backup set_layout -p 3
 Error setting layout: Invalid argument
  
 Seems to be pointing at pool 0 instead of pool 3 like it should?
  
 Sorry if this has all been covered before, I've not found any  
 resolution and the ability to mount specific pools I.E data
 for production data and backup for 1N backup data is a huge
 priority to begin using Ceph.

No problem — we haven't generated docs for this yet and obviously we need to at 
some point.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible memory leak in mon?

2012-05-03 Thread Greg Farnum
On Wednesday, May 2, 2012 at 11:24 PM, Vladimir Bashkirtsev wrote:
 Greg,
  
 Apologies for multiple emails: my mail server is backed by ceph now and  
 it struggled this morning (separate issue). So my mail server reported  
 back to my mailer that sending of email failed when obviously it was not  
 the case.

Interesting — I presume you're using the file system? That's not something 
we've heard of anybody doing with Ceph before. :)

  
 [root@gamma ~]# ceph -s
 2012-05-03 15:46:55.640951 mds e2666: 1/1/1 up {0=1=up:active}, 1  
 up:standby
 2012-05-03 15:46:55.647106 osd e10728: 6 osds: 6 up, 6 in
 2012-05-03 15:46:55.654052 log 2012-05-03 15:46:26.557084 mon.2  
 172.16.64.202:6789/0 2878 : [INF] mon.2 calling new monitor election
 2012-05-03 15:46:55.654425 mon e7: 3 mons at  
 {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0}
 2012-05-03 15:46:56.961624 pg v1251669: 600 pgs: 2 creating, 598  
 active+clean; 309 GB data, 963 GB used, 1098 GB / 2145 GB avail
  
 Loggin is on but nothing obvious in there: logs quite small. Number of  
 ceph health logged (ceph monitored by nagios and so this record appears  
 every 5 minutes), monitors periodically call for election (different  
 periods between 1 to 15 minutes as it looks). That's it.

Hrm. Generally speaking the monitors shouldn't call for elections unless 
something changes (one of them crashes) or the leader monitor is slowing down.
Can you increase the debug_mon to 20, the debug_ms to 1, and post one of the 
logs somewhere? The Live Debugging section of http://ceph.com/wiki/Debugging 
should give you what you need. :)
  
  
 Regards,
 Vladimir


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weighted distributed processing.

2012-05-02 Thread Greg Farnum


On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote:

 Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
  Hello All,
  First off, I'm sending this email to three discussion groups:
  gear...@googlegroups.com (mailto:gear...@googlegroups.com) - distributed 
  processing library
  ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org) - 
  distributed file system
  archivemat...@googlegroups.com (mailto:archivemat...@googlegroups.com) - my 
  project's discussion list, a  
  distributed processing system.
   
  I'd like to start a discussion about something I'll refer to as weighted  
  distributed task based processing.
  Presently, we are using gearman's library's to meet our distributed  
  processing needs. The majority of our processing is file based, and our  
  processing stations are accessing the files over an nfs share. We are  
  looking at replacing the nfs server share with a distributed file  
  systems, like ceph.
   
  It occurs to me that our processing times could theoretically be reduced  
  by by assigning tasks to processing clients where the file resides, over  
  places where it would need to be copied over the network. In order for  
  this to happen, the gearman server would need to get file location  
  information from the ceph system.
  
  
  
 If I understand the design of CEPH completely, it spreads I/O at the
 block level, not the file level.
  
 So there is little point in weighting since it seeks to spread the whole
 file across all the machines/block devices in the cluster. Even if you
 do ask ceph which servers is file X on, which I'm sure it could tell
 you, You will end up with high weights for most of the servers, and no
 real benefit.
  
 In this scenario, you're just better off having a really powerful network
 and CEPH will balance the I/O enough that you can scale out the I/O
 independently of the compute resources. This seems like a huge win, as
 I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit
 switches are still not super cheap, but they are probably cheaper than
 software engineer hours.
  
 If your network is not up to the task of transferring all those blocks
 around, you probably need to focus instead on something that keeps whole
 files in a certain place. One such system would be MogileFS. This has a
 database with a list of keys that say where the data lives, and in fact
 the protocol the MogileFS tracker uses will tell you all the places a
 key lives. You could then place a hint in the payload and have 2 levels
 of workers. The pseudo becomes:
  
 -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname'
 -client sends task w/ filename to 'dispatch_foo'  
 -dispatcher looks at filename, asks mogile where the file is, looks at
 recent queue lengths in gearman, and decides whether or not it is enough
 of a win to direct the job to the host where the file is, or to farm it
 out to somewhere that is less busy.
  
 This will take a lot of poking at to get tuned right, but it should be
 tunable to a single number, the ratio of localized queue length versus
 non-localized queue length.
  
  pseudo:
  gearman client creates a task  includes a weight, of type ceph file
  gearman server identifies the file  polls the ceph system for clients  
  that have this file
  ceph system returns a list of clients that have the file locally
  gearman assigns the task
  . if there is a client available for processing that has the file locally
  . assign it there
  . (that client has local access to the file, still on the ceph  
  system)
  . else
  . assign to other client
  . (that processing client will pull the file from the ceph system  
  over the network)
   
   
  I call it a weighted distributed processing system, because it reminds  
  me of a weighted die: The outcome is influenced to a certain direction  
  (in the task assignment).
   
  I wanted to start this as a discussion, rather than filing feature  
  requests, because of the complex nature of the requests, and the nicer  
  medium for feedback, clarification and refinement.
   
  I'd be very interested to hear feedback on the idea,
  Joseph Perry
  

https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1
^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten 
the original email — at least I didn't!).

Clint is mostly correct: Ceph does not store files in a single location. It's 
not block-based in the sense of 4K disk blocks though — instead it breaks up 
files into (by default) 4MB chunks. It's possible to change this default to a 
larger number though; our Hadoop bindings break files into 64MB chunks. And it 
is possible to retrieve this location data using the cephfs tool:
./cephfs  
not enough parameters!
usage: cephfs path command [options]*
Commands:
show_layout -- view the layout information on a file or dir
set_layout -- set the layout on an empty file,
or the default layout on a 

Re: weighted distributed processing.

2012-05-02 Thread Greg Farnum
(Trimmed CC:) apparently neither Gearman nor Archivematica lists allow posting 
from non-members, which leads to some wonderful spam from Google and is going 
to make holding a cross-list conversation…difficult.  


On Wednesday, May 2, 2012 at 4:26 PM, Greg Farnum wrote:

  
  
 On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote:
  
  Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
   Hello All,
   First off, I'm sending this email to three discussion groups:
   gear...@googlegroups.com (mailto:gear...@googlegroups.com) - distributed 
   processing library
   ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org) - 
   distributed file system
   archivemat...@googlegroups.com (mailto:archivemat...@googlegroups.com) - 
   my project's discussion list, a  
   distributed processing system.

   I'd like to start a discussion about something I'll refer to as weighted  
   distributed task based processing.
   Presently, we are using gearman's library's to meet our distributed  
   processing needs. The majority of our processing is file based, and our  
   processing stations are accessing the files over an nfs share. We are  
   looking at replacing the nfs server share with a distributed file  
   systems, like ceph.

   It occurs to me that our processing times could theoretically be reduced  
   by by assigning tasks to processing clients where the file resides, over  
   places where it would need to be copied over the network. In order for  
   this to happen, the gearman server would need to get file location  
   information from the ceph system.
   
   
   
   
   
  If I understand the design of CEPH completely, it spreads I/O at the
  block level, not the file level.
   
  So there is little point in weighting since it seeks to spread the whole
  file across all the machines/block devices in the cluster. Even if you
  do ask ceph which servers is file X on, which I'm sure it could tell
  you, You will end up with high weights for most of the servers, and no
  real benefit.
   
  In this scenario, you're just better off having a really powerful network
  and CEPH will balance the I/O enough that you can scale out the I/O
  independently of the compute resources. This seems like a huge win, as
  I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit
  switches are still not super cheap, but they are probably cheaper than
  software engineer hours.
   
  If your network is not up to the task of transferring all those blocks
  around, you probably need to focus instead on something that keeps whole
  files in a certain place. One such system would be MogileFS. This has a
  database with a list of keys that say where the data lives, and in fact
  the protocol the MogileFS tracker uses will tell you all the places a
  key lives. You could then place a hint in the payload and have 2 levels
  of workers. The pseudo becomes:
   
  -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname'
  -client sends task w/ filename to 'dispatch_foo'  
  -dispatcher looks at filename, asks mogile where the file is, looks at
  recent queue lengths in gearman, and decides whether or not it is enough
  of a win to direct the job to the host where the file is, or to farm it
  out to somewhere that is less busy.
   
  This will take a lot of poking at to get tuned right, but it should be
  tunable to a single number, the ratio of localized queue length versus
  non-localized queue length.
   
   pseudo:
   gearman client creates a task  includes a weight, of type ceph file
   gearman server identifies the file  polls the ceph system for clients  
   that have this file
   ceph system returns a list of clients that have the file locally
   gearman assigns the task
   . if there is a client available for processing that has the file locally
   . assign it there
   . (that client has local access to the file, still on the ceph  
   system)
   . else
   . assign to other client
   . (that processing client will pull the file from the ceph system  
   over the network)


   I call it a weighted distributed processing system, because it reminds  
   me of a weighted die: The outcome is influenced to a certain direction  
   (in the task assignment).

   I wanted to start this as a discussion, rather than filing feature  
   requests, because of the complex nature of the requests, and the nicer  
   medium for feedback, clarification and refinement.

   I'd be very interested to hear feedback on the idea,
   Joseph Perry
   
  
  
  
 https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1
 ^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten 
 the original email — at least I didn't!).
  
 Clint is mostly correct: Ceph does not store files in a single location. It's 
 not block-based in the sense of 4K disk blocks though — instead it breaks up 
 files into (by default) 4MB chunks. It's possible

Re: Possible memory leak in mon?

2012-05-02 Thread Greg Farnum
On Wednesday, May 2, 2012 at 3:28 PM, Vladimir Bashkirtsev wrote:
 Dear devs,
 
 I have three mons and two of them suddenly consumed around 4G of RAM 
 while third one happily lived with 150M. This immediately prompts few 
 questions:
 
 1. What is expected memory use of mon? I believed that mon merely 
 directs clients to relevant OSDs and should not consume a lot of 
 resources - please correct me if I am wrong.
 2. In both cases where mon consumed a lot of memory it was preceded by 
 disk-full condition and both machines where incidents happened are 64 
 bit, rest of cluster 32 bit. mon fs and log files happened to be in the 
 same partition - ceph osd produced a lot of messages, filled up disk, 
 mon crashed (no core as disk was full), manually deleted logs, restarted 
 mon without any issue, some time later found mon using 4G of RAM. 
 Running 0.45. Should I deliberately recreate conditions and crash mon to 
 get more debug info (if you need it of course, and if yes then what)?
 3. Does figure 4G per process coming from 32 bit pointers in mon? Or mon 
 potentially can consume more than 4G?
 
 Regards,
 Vladimir
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html

First: one email is enough. 

Second: in normal use your monitors should not consume very much memory. It 
sounds like something's wrong. Can you please provide the output of ceph -s?
Also, do you have any monitor logging on? My best guess is that for some reason 
the monitors aren't all communicating with each other and so they are buffering 
messages.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: global_init fails when only specifying monitor address

2012-04-26 Thread Greg Farnum
On Thursday, April 26, 2012 at 9:33 AM, Sage Weil wrote:
 On Thu, 26 Apr 2012, Wido den Hollander wrote:
  Hi,
  
  I tried to connect to a small Ceph setup on my desktop without cephx and 
  that
  failed:
  
  root@stack01:~# ceph -m wido-desktop.widodh.nl:6789 
  (http://wido-desktop.widodh.nl:6789) -s
  global_init: unable to open config file.
  root@stack01:~#
  
  I however worked with:
  
  root@stack01:~# ceph -m wido-desktop.widodh.nl:6789 
  (http://wido-desktop.widodh.nl:6789) -c /dev/null -s
  2012-04-26 14:55:33.828524 pg v148: 594 pgs: 594 active+clean; 0 bytes
  data, 7740 KB used, 70571 MB / 76800 MB avail
  2012-04-26 14:55:33.829622 mds e1: 0/0/1 up
  2012-04-26 14:55:33.836144 osd e14: 3 osds: 3 up, 3 in
  2012-04-26 14:55:33.886429 log 2012-04-26 14:52:50.674430 osd.1
  [2a00:f10:11c:ab:52e5:49ff:fec2:c976]:6807/28366 12 : [INF] 1.2b scrub ok
  2012-04-26 14:55:33.892423 mon e1: 1 mons at
  {desktop=[2a00:f10:11c:ab:52e5:49ff:fec2:c976]:6789/0}
  root@stack01:~#
  
  I quick look at global_init.cc (http://global_init.cc) showed me why this 
  happened, it simply looks
  for a configuration file to open and when it can't it fails.
  
  But if a monitor address is set, a config file shouldn't be mandatory.
  
  It could be accomplished rather simple by setting the flag
  CINIT_FLAG_NO_DEFAULT_CONFIG_FILE if a mon_host has been set, but to do that
  conf-parse_argv(args); should move a few lines up.
  
  Comments? Thoughts?
 
 I wonder if the simplest thing to do is:
 
 - never error out on missing config in the default search path
 - always error out on missing config if it was explicitly specified via 
 -c foo or CEPH_CONF in environment.
 
 ?
 
 sage
I think this is probably right. I think that we may even error out correctly if 
we don't have values specified that we need, but we'll need to check that.
I'm working on similar stuff as I look at monitor cluster additions for Carl, 
so I'll look at this today.
-Greg 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Log files with 0.45

2012-04-20 Thread Greg Farnum
I checked with Sam on this and it turns out he created a new subsystem whose 
output you can control with the debug optracker (or --debug-optracker) option 
(in the same way as the other debug log systems). In 0.45 the output for that 
system was at inappropriately high levels (1) and it's fixed in our current 
master (5 now), but you probably want to set debug optracker = 0 in your 
config. That should restore things to the way they used to be!
(And sorry for the long  wait, Danny.)
-Greg


On Friday, April 20, 2012 at 1:05 PM, Nick Bartos wrote:

 Is there a recommended log config for production systems? I'm also
 trying to decrease the verbosity in 0.45, using the options specified
 here: http://ceph.newdream.net/wiki/Debugging. Setting them down to
 '1' doesn't end the insane log sprawl I'm seeing.
 
 On Tue, Apr 17, 2012 at 10:09 PM, Greg Farnum
 gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote:
  On Tuesday, April 17, 2012 at 9:53 PM, Danny Kukawka wrote:
   Hi,
   
   did something change with the default log levels for OSDs on v0.45? With
   0.43 and IIRC also 0.44* the logfiles had a acceptable size, but now I
   get by default ~3Gbyte per OSD over 12 hours without any change in the
   config file.
   
   Danny
  I think some extra event notifications got stuck in the logs for OSD 
  operations; they're nice for debugging but may well have a log level higher 
  than they should. They should be easy to compress a lot, though!
  
  Can you comment on this, Sam?
  -Greg
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-19 Thread Greg Farnum
On Thursday, April 19, 2012 at 2:07 AM, Damien Churchill wrote:
 On 18 April 2012 21:41, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
  That should get everything back up and running. The one sour note is that 
  due to the bug in the past, your data (ie, filesystem) and vmimages pools 
  have gotten conflated. It shouldn't cause any issues (they use very 
  different naming schemes), but they're tied together in terms of 
  replication and the raw pool statistics.
  (If that's important you can create a new pool and move the rbd images to 
  it.)
 
 
 
 Thanks a lot Greg!
 
 All back up and running now. What negative side effects could having
 the pools mixed together have, given that I'm not doing any special
 placement of them?

There shouldn't be any negative side effects from it at all. It just means that 
you've got a mixed namespace, and if you don't care about that none of our 
current tools do either. (Something like the still-entirely-theoretical 
ceph-fsck probably wouldn't appreciate it, but we don't have anything like that 
right now.) 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-18 Thread Greg Farnum
On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote:
 Hi,
  
 I changed the values and the performance is still very good and the  
 memory footprint is much smaller.
  
 OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
 OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty  
 OR tx.. bigish)
 OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty  
 (keep this smallish)
 // note: the max amount of in flight dirty data is roughly (max - target)
  
 But I am not quite sure about the meaning of the values.
 client_oc_size Max size of the cache?
 client_oc_max_dirty max dirty value before the writeback starts?
 client_oc_target_dirty ???
  

Right now the cache writeout algorithms are based on amount of dirty data, 
rather than something like how long the data has been dirty.  
client_oc_size is the max (and therefore typical) size of the cache.
client_oc_max_dirty is the largest amount of dirty data in the cache — if this 
much is dirty and you try to dirty more, the dirtier (a write of some kind) 
will block until some of the other dirty data has been committed.
client_oc_target_dirty is the amount of dirty data that will trigger the cache 
to start flushing data out.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-18 Thread Greg Farnum
On Tuesday, April 17, 2012 at 11:41 PM, Damien Churchill wrote:
 On 17 April 2012 17:49, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
  Do you know what version this was created with, and what upgrades you've 
  been through? My best guess right now is that there's a problem with the 
  encoding and decoding that I'm going to have to track down, and more 
  context will make it a lot easier. :)
 
 
 
 Hmmm that's testing my memory, I'd say that cluster has been alive at
 least since 0.34. Occasionally I think there was a version skipped,
 not sure if that could cause any issues?

Okay. So the good news is that we can see what's broken now and have a kludge 
to prevent it happening to others; the bad news is we still have no idea how it 
actually occurred. :( But I don't think it's worth investing the time given 
what we have available, so all we can do is repair your cluster. 

Are you building your binaries from source, and can you run a patched version 
of the monitors? If you can I'll give you a patch to enable a simple command 
that should make things work; otherwise we'll need to start editing things by 
hand. (Yucky)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-18 Thread Greg Farnum
On Wednesday, April 18, 2012 at 12:04 PM, Damien Churchill wrote:
 On 18 April 2012 19:41, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
  
  Are you building your binaries from source, and can you run a patched 
  version of the monitors? If you can I'll give you a patch to enable a 
  simple command that should make things work; otherwise we'll need to start 
  editing things by hand. (Yucky)
  -Greg
 
 
 
 I was using the Ubuntu packages but I can quite happily build my own
 packages if you give me the patch :-)
 
 I agree it's a waste of time if it's not obvious what's caused it,
 could be some obscure cause occurred due to upgrading between older
 versions.

Okay, assuming you're still on 0.41.1, can you checkout the git branch 
for-damien and build it? 
Then shut down your monitors, replace their executables with the freshly-built 
ones, and run
ceph osd pool set vmimages pg_num 320 --a-dev-told-me-to
and
ceph osd pool set vmimages pgp_num 320



That should get everything back up and running. The one sour note is that due 
to the bug in the past, your data (ie, filesystem) and vmimages pools have 
gotten conflated. It shouldn't cause any issues (they use very different naming 
schemes), but they're tied together in terms of replication and the raw pool 
statistics.
(If that's important you can create a new pool and move the rbd images to it.)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-17 Thread Greg Farnum
On Monday, April 16, 2012 at 10:32 PM, Damien Churchill wrote:
 On 17 April 2012 01:06, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
  Yep!
  We looked into this more today and have discovered some definite oddness. 
  Have you by any chance tried to change the number of PGs in your pools?
 
 
 
 I haven't no (at least certainly not on purpose!). All I've really
 done is copy a bit of stuff onto the unix fs and create a few rbd
 volumes, as well as upgrade it when a new version comes out.

Drat, that means there's actually a problem to track down somewhere. 
Do you know what version this was created with, and what upgrades you've been 
through? My best guess right now is that there's a problem with the encoding 
and decoding that I'm going to have to track down, and more context will make 
it a lot easier. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Log files with 0.45

2012-04-17 Thread Greg Farnum
On Tuesday, April 17, 2012 at 9:53 PM, Danny Kukawka wrote:
 Hi,
 
 did something change with the default log levels for OSDs on v0.45? With
 0.43 and IIRC also 0.44* the logfiles had a acceptable size, but now I
 get by default ~3Gbyte per OSD over 12 hours without any change in the
 config file.
 
 Danny 
I think some extra event notifications got stuck in the logs for OSD 
operations; they're nice for debugging but may well have a log level higher 
than they should. They should be easy to compress a lot, though!

Can you comment on this, Sam? 
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librados aio completion

2012-04-16 Thread Greg Farnum
On Sunday, April 15, 2012 at 9:44 PM, Sage Weil wrote:
 We just switched the completion callbacks so that they are called 
 asynchronously from another thread. This makes the locking less weird for 
 certain callers and lets you call back into librados in your callback 
 safely.
 
 This breaks one of the functional tests, which sets a bool in the 
 callback, does wait_for_complete() on the aio handle, and then asserts 
 that the bool is set. There's now a race between the caller's thread and 
 the completion thread.
 
 Do we just call this a broken test, or do we want some way of blocking on 
 the aio handle until the completion has been called?
 


I think we have to block the aio handle until the completion has been called. 
Expecting users to (constantly re-)implement that themselves is just silly, and 
anybody who needs to wait_for_complete() is going to expect the completion to 
have been called.
I don't remember exactly how wait_for_complete() is triggered, but it can't be 
too difficult to move it into the completion thread.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librados aio completion

2012-04-16 Thread Greg Farnum
On Monday, April 16, 2012 at 2:07 PM, Sage Weil wrote:
 On Mon, 16 Apr 2012, Greg Farnum wrote:
  Or set the bool to true, then do the callback, then signal? 
 
 
 That's sort of what I was getting at with 
 wait_for_complete_and_callback_returned(). We could make 
 wait_for_complete() do that, although it should be a second bool because 
 cond.Wait() can wake up nondeterministically (because of a signal or 
 something). For example I could clear the callback pointer after it 
 returns, and make the wait loop check for the bool and callback_ptr == 
 NULL.
 
 It just means the wait_for_complete() does not actually wait for 
 is_complete(), but is_complete()  did callback.


Okay. I'm just thinking that we need wait_for_complete() to be the stronger 
variant, since that's how it previously behaved. (Whereas I believe that 
previously is_complete() and is_safe() functioned correctly inside callbacks, 
correct? So we need to preserve that behavior as well.)
If we want the weaker guarantee for some reason, we can add a 
wait_for_complete_response() or similar.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-16 Thread Greg Farnum


On Friday, April 13, 2012 at 12:42 PM, Damien Churchill wrote:

 Hi,
  
 On 13 April 2012 20:30, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
  On Thursday, April 12, 2012 at 8:29 AM, Damien Churchill wrote:
   On 11 April 2012 00:40, Greg Farnum gregory.far...@dreamhost.com 
   (mailto:gregory.far...@dreamhost.com) wrote:
 
A quick glance through these shows that all the pg_temp requests aren't 
actually requesting any changes from the monitor. It's either a very 
serious mon bug which happened a while ago (unlikely, given the 
restarts and ongoing map changes, etc), or an OSD bug. I think we want 
logs from both osd.0 and osd.3 at the same time, from what I'm seeing. 
:)
-Greg





   Just to make sure all bases are covered:

   http://damoxc.net/ceph/ceph-logs-20120412142537.tar.gz

   This contains all 5 osd logs and all 3 monitor logs, everything
   restarted with debug logging prior to capturing the logs.
   
   
   
  I (and Sam) spent some time looking at this very closely. It continues to 
  tell me that the OSD and the monitor are disagreeing on whether osd 3 
  should be in the pg temp set for some things, but they seem to agree on 
  everything else….
  Can you zip up for me:
  1) The files matching osdmap* of osd0's store from the current/meta/ 
  directory,
  2) The contents of your lead monitor's osdmap and osdmap_full directories?
  
  
  
 Here they are
  
 http://damoxc.net/ceph/osdmap.0.tar.gz
 http://damoxc.net/ceph/mon.node21.osdmap.tar.gz
  
 Hopefully I got the right files :)
Yep!
We looked into this more today and have discovered some definite oddness. Have 
you by any chance tried to change the number of PGs in your pools?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-13 Thread Greg Farnum
On Thursday, April 12, 2012 at 8:29 AM, Damien Churchill wrote:
 On 11 April 2012 00:40, Greg Farnum gregory.far...@dreamhost.com 
 (mailto:gregory.far...@dreamhost.com) wrote:
   
  A quick glance through these shows that all the pg_temp requests aren't 
  actually requesting any changes from the monitor. It's either a very 
  serious mon bug which happened a while ago (unlikely, given the restarts 
  and ongoing map changes, etc), or an OSD bug. I think we want logs from 
  both osd.0 and osd.3 at the same time, from what I'm seeing. :)
  -Greg
  
  
  
 Just to make sure all bases are covered:
  
 http://damoxc.net/ceph/ceph-logs-20120412142537.tar.gz
  
 This contains all 5 osd logs and all 3 monitor logs, everything
 restarted with debug logging prior to capturing the logs.

I (and Sam) spent some time looking at this very closely. It continues to tell 
me that the OSD and the monitor are disagreeing on whether osd 3 should be in 
the pg temp set for some things, but they seem to agree on everything else….  
Can you zip up for me:
1) The files matching osdmap* of osd0's store from the current/meta/ directory,
2) The contents of your lead monitor's osdmap and osdmap_full directories?

We can check these for differences and then run them through some of our tools 
and stuff to try and identify the issue.
Thanks!
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-12 Thread Greg Farnum
On Thursday, April 12, 2012 at 12:45 PM, Sage Weil wrote:
 On Thu, 12 Apr 2012, Martin Mailand wrote:
  The other point is, that the cache is not KSM enabled, therefore identical
  pages will not be merged, could that be changed, what would be the downside?
   
  So maybe we could reduce the memory footprint of the cache, but keep it's
  performance.
  
  
  
 I'm not familiar with the performance implications of KSM, but the  
 objectcacher doesn't modify existing buffers in place, so I suspect it's a  
 good candidate. And it looks like there's minimal effort in enabling  
 it...


But if you're supposed to advise the kernel that the memory is a good 
candidate, then probably we shouldn't be making that madvise call on every 
buffer (I imagine it's doing a sha1 on each page and then examining a tree) — 
especially since we (probably) flush all that data out relatively quickly. And 
RBD doesn't currently have any information about whether the data is OS or user 
data… (I guess in future, with layering, we could call madvise on pages which 
were read from an underlying gold image.)
Also, TV is wondering if the data is even page-aligned or not? I can't recall 
off-hand.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Statically binding ports for ceph-osd

2012-04-11 Thread Greg Farnum
You're unlikely to hit it since you're setting all addresses, but we somehow 
managed to introduce an error even in that small patch -- you may want to pull 
in commit cd4a760e9b22047fa5a45d0211ec4130809d725e as well.
-Greg

On Tuesday, April 10, 2012 at 5:13 PM, Nick Bartos wrote:
 Good enough for me, I'll just patch it for the short term.
 
 Thanks!
 
 On Tue, Apr 10, 2012 at 4:51 PM, Sage Weil s...@newdream.net 
 (mailto:s...@newdream.net) wrote:
  On Tue, 10 Apr 2012, Nick Bartos wrote:
   Awesome, thanks so much!  Can I assume this will make it into the next
   ceph stable release?  I'll probably just backport it now before we
   actually start using it, so I don't have to change the config later.
   
  
  
  v0.45 is out today/tomorrow, but it'll be in v0.46.
  
  sage
  
  
   
   On Tue, Apr 10, 2012 at 4:16 PM, Greg Farnum
   gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) 
   wrote:
Yep, you're absolutely correct. Might as well let users specify the 
whole address rather than just the port, though ? since your patch 
won't apply to current upstream due to some heartbeating changes I 
whipped up another one which adds the osd heartbeat addr option. It's 
pushed it to master in commit 6fbac10dc68e67d1c700421f311cf5e26991d39c, 
but you'll want to backport (easy) or carry your change until you 
upgrade (and remember to change the config!). :)
Thanks for the report!
-Greg


On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote:

 After doing some more looking at the code, it appears that this option
 is not supported. I created a small patch (attached) which adds the
 functionality. Is there any way we could get this, or something like
 this, applied upstream? I think this is important functionality for
 firewalled environments, and seems like a simple fix since all the
 other services (including ones for ceph-mon and ceph-mds) already
 allow you to specify a static port.
 
 
 On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com 
 (mailto:n...@pistoncloud.com) wrote:
  I'm trying to get ceph-osd's listening ports to be set statically 
  for
  firewall reasons. I am able to get 2 of the 3 ports set statically,
  however the 3rd one is still getting set dynamically.
  
  I am using:
  
  [osd.48]
  host = 172.16.0.13
  cluster addr = 172.16.0.13:6944
  public addr = 172.16.0.13:6945
  
  The daemon will successfully bind to 6944 and 6945, but also binds 
  to
  6800. What additional option do I need? I started looking at the
  code and thought hb addr = 172.16.0.13:6946 would do it, but
  specifying that option seems to have no effect (or at least does not
  achieve the desired result).
  
 
 
 
 
 Attachments:
 - ceph-0.41-osd_hb_port.patch
 


   
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org 
   (mailto:majord...@vger.kernel.org)
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
  
  
 
 
 
 



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make libcephfs return error when unmounted?

2012-04-11 Thread Greg Farnum
On Wednesday, April 11, 2012 at 11:25 AM, Noah Watkins wrote:
 
 On Apr 11, 2012, at 11:22 AM, Greg Farnum wrote:
 
  On Wednesday, April 11, 2012 at 11:12 AM, Noah Watkins wrote:
   Hi all,
   
   -Noah 
  I'm not sure where the -1004 came from,
 
 ceph_mount(..) seems to return some random error codes (-1000, 1001) already 
 :)

grumble legacy undocumented grr /grumble
Let's try to use standard error codes where available, and (if we have to 
create our own) document any new ones with user-accessible names and 
explanations. I don't know which one is best but I see a lot of applicable 
choices when scanning errno-base et al.

Also, what Yehuda said. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make libcephfs return error when unmounted?

2012-04-11 Thread Greg Farnum
On Wednesday, April 11, 2012 at 2:59 PM, Noah Watkins wrote:
 
 On Apr 11, 2012, at 11:22 AM, Yehuda Sadeh Weinraub wrote:
 
  Also need to check that cmount is initialized. I'd add a helper:
  
  Client *ceph_get_client(struct ceph_mount_info *cmont)
  {
  if (cmount  cmount-is_mounted())
  return cmount-get_client();
  
  return NULL;
  }
 
 
 
 How useful is checking cmount != NULL here? This defensive check depends on 
 users initializing their cmount pointers to NULL, but the API doesn't do 
 anything to require this initialization assumption.
 
 - Noah 
I had a whole email going until I realized you were just right. So, yeah, that 
wouldn't do anything since a cmount they forgot to have the API initialize is 
just going to hold random data. Urgh.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make libcephfs return error when unmounted?

2012-04-11 Thread Greg Farnum


On Wednesday, April 11, 2012 at 3:34 PM, Yehuda Sadeh Weinraub wrote:

 On Wed, Apr 11, 2012 at 3:13 PM, Greg Farnum
 gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote:
  On Wednesday, April 11, 2012 at 2:59 PM, Noah Watkins wrote:
   
   On Apr 11, 2012, at 11:22 AM, Yehuda Sadeh Weinraub wrote:
   
Also need to check that cmount is initialized. I'd add a helper:

Client *ceph_get_client(struct ceph_mount_info *cmont)
{
if (cmount  cmount-is_mounted())
return cmount-get_client();

return NULL;
}
   
   
   
   
   
   How useful is checking cmount != NULL here? This defensive check depends 
   on users initializing their cmount pointers to NULL, but the API doesn't 
   do anything to require this initialization assumption.
   
   - Noah
  I had a whole email going until I realized you were just right. So, yeah, 
  that wouldn't do anything since a cmount they forgot to have the API 
  initialize is just going to hold random data. Urgh.
 
 
 
 There's no destructor either, maybe it's a good time to add one?
 
 Yehuda 
Actually, there is. The problem is that to the client it's an opaque pointer 
under many(most?) circumstances, so that it can be used by C users. 
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Statically binding ports for ceph-osd

2012-04-10 Thread Greg Farnum
Yep, you're absolutely correct. Might as well let users specify the whole 
address rather than just the port, though — since your patch won't apply to 
current upstream due to some heartbeating changes I whipped up another one 
which adds the osd heartbeat addr option. It's pushed it to master in commit 
6fbac10dc68e67d1c700421f311cf5e26991d39c, but you'll want to backport (easy) or 
carry your change until you upgrade (and remember to change the config!). :)
Thanks for the report!
-Greg


On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote:

 After doing some more looking at the code, it appears that this option
 is not supported. I created a small patch (attached) which adds the
 functionality. Is there any way we could get this, or something like
 this, applied upstream? I think this is important functionality for
 firewalled environments, and seems like a simple fix since all the
 other services (including ones for ceph-mon and ceph-mds) already
 allow you to specify a static port.
  
  
 On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com 
 (mailto:n...@pistoncloud.com) wrote:
  I'm trying to get ceph-osd's listening ports to be set statically for
  firewall reasons. I am able to get 2 of the 3 ports set statically,
  however the 3rd one is still getting set dynamically.
   
  I am using:
   
  [osd.48]
  host = 172.16.0.13
  cluster addr = 172.16.0.13:6944
  public addr = 172.16.0.13:6945
   
  The daemon will successfully bind to 6944 and 6945, but also binds to
  6800. What additional option do I need? I started looking at the
  code and thought hb addr = 172.16.0.13:6946 would do it, but
  specifying that option seems to have no effect (or at least does not
  achieve the desired result).
  
  
  
 Attachments:  
 - ceph-0.41-osd_hb_port.patch
  



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-10 Thread Greg Farnum
On Tuesday, April 10, 2012 at 4:00 PM, Samuel Just wrote:
 Can you send along the osd log as well for comparison?
 -Sam
 
 On Tue, Apr 10, 2012 at 3:03 PM, Damien Churchill dam...@gmail.com 
 (mailto:dam...@gmail.com) wrote:
  Here are the monitor logs, they're from the monitor starts, however I
  restarted the osd shortly afterwards and left it to run for 5 or so
  minutes.
  
  http://damoxc.net/ceph/mon.node21.log.gz
  http://damoxc.net/ceph/mon.node22.log.gz
  http://damoxc.net/ceph/mon.node23.log.gz
  
  On 10 April 2012 22:49, Samuel Just sam.j...@dreamhost.com 
  (mailto:sam.j...@dreamhost.com) wrote:
   Nothing apparent from the backtrace. I need monitor logs from when
   the osd is sending pg_temp requests. Can you restart the osd and post
   the osd and all three monitor logs from when you restarted the osd?
   You'll have to enable monitor logging on all three.
   -Sam
  
 


A quick glance through these shows that all the pg_temp requests aren't 
actually requesting any changes from the monitor. It's either a very serious 
mon bug which happened a while ago (unlikely, given the restarts and ongoing 
map changes, etc), or an OSD bug. I think we want logs from both osd.0 and 
osd.3 at the same time, from what I'm seeing. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Statically binding ports for ceph-osd

2012-04-10 Thread Greg Farnum
I think we've already branched off 0.45, so it'll have to wait until 0.46 
unless we decide to pull it over. Sage could probably be talked into it if you 
ask nicely?  
-Greg


On Tuesday, April 10, 2012 at 4:45 PM, Nick Bartos wrote:

 Awesome, thanks so much! Can I assume this will make it into the next
 ceph stable release? I'll probably just backport it now before we
 actually start using it, so I don't have to change the config later.
  
 On Tue, Apr 10, 2012 at 4:16 PM, Greg Farnum
 gregory.far...@dreamhost.com (mailto:gregory.far...@dreamhost.com) wrote:
  Yep, you're absolutely correct. Might as well let users specify the whole 
  address rather than just the port, though — since your patch won't apply to 
  current upstream due to some heartbeating changes I whipped up another one 
  which adds the osd heartbeat addr option. It's pushed it to master in 
  commit 6fbac10dc68e67d1c700421f311cf5e26991d39c, but you'll want to 
  backport (easy) or carry your change until you upgrade (and remember to 
  change the config!). :)
  Thanks for the report!
  -Greg
   
   
  On Tuesday, April 10, 2012 at 12:56 PM, Nick Bartos wrote:
   
   After doing some more looking at the code, it appears that this option
   is not supported. I created a small patch (attached) which adds the
   functionality. Is there any way we could get this, or something like
   this, applied upstream? I think this is important functionality for
   firewalled environments, and seems like a simple fix since all the
   other services (including ones for ceph-mon and ceph-mds) already
   allow you to specify a static port.


   On Mon, Apr 9, 2012 at 5:27 PM, Nick Bartos n...@pistoncloud.com 
   (mailto:n...@pistoncloud.com) wrote:
I'm trying to get ceph-osd's listening ports to be set statically for
firewall reasons. I am able to get 2 of the 3 ports set statically,
however the 3rd one is still getting set dynamically.
 
I am using:
 
[osd.48]
host = 172.16.0.13
cluster addr = 172.16.0.13:6944
public addr = 172.16.0.13:6945
 
The daemon will successfully bind to 6944 and 6945, but also binds to
6800. What additional option do I need? I started looking at the
code and thought hb addr = 172.16.0.13:6946 would do it, but
specifying that option seems to have no effect (or at least does not
achieve the desired result).





   Attachments:
   - ceph-0.41-osd_hb_port.patch
   
  



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >