date:20130320

Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-20 Thread Sage Weil

On Wed, 20 Mar 2013, Yan, Zheng wrote:
 On 03/20/2013 07:09 AM, Greg Farnum wrote:
  Hmm, this is definitely narrowing the race (probably enough to never hit 
  it), but it's not actually eliminating it (if the restart happens after 4 
  billion requests?). More importantly this kind of symptom makes me worry 
  that we might be papering over more serious issues with colliding states in 
  the Table on restart.
  I don't have the MDSTable semantics in my head so I'll need to look into 
  this later unless somebody else volunteers to do so?
 
 Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
 increases for each stage. I don't think there are any more colliding 
 states in the table. The table client/server use two phase commit. it's 
 similar to client request that involves multiple MDS. the reqid is 
 analogy to client request id. The difference is client request ID is 
 unique because new client always get an unique session id.

Each time a tid is consumed (at least for an update) it is journaled in 
the EMetaBlob::table_tids list, right?  So we could actually take a max 
from journal replay and pick up where we left off?  That seems like the 
cleanest.

I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
avoid that possibility.

sage

 
 Thanks
 Yan, Zheng
 
  -Greg
  
  Software Engineer #42 @ http://inktank.com | http://ceph.com
  
  
  On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
  
  From: Yan, Zheng zheng.z@intel.com
   
  When a MDS becomes active, the table server re-sends 'agree' messages
  for old prepared request. If the recoverd MDS starts a new table request
  at the same time, The new request's ID can happen to be the same as old
  prepared request's ID, because current table client assigns request ID
  from zero after MDS restarts.
   
  Signed-off-by: Yan, Zheng zheng.z@intel.com 
  (mailto:zheng.z@intel.com)
  ---
  src/mds/MDS.cc (http://MDS.cc) | 3 +++
  src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +
  src/mds/MDSTableClient.h | 2 ++
  3 files changed, 10 insertions(+)
   
  diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc 
  (http://MDS.cc)
  index bb1c833..859782a 100644
  --- a/src/mds/MDS.cc (http://MDS.cc)
  +++ b/src/mds/MDS.cc (http://MDS.cc)
  @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
  dout(2)  boot_start   step  : opening snap table  dendl;  
  snapserver-load(gather.new_sub());
  }
  +
  + anchorclient-init();
  + snapclient-init();
   
  dout(2)  boot_start   step  : opening mds log  dendl;
  mdlog-open(gather.new_sub());
  diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
  b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
  index ea021f5..beba0a3 100644
  --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
  +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
  @@ -34,6 +34,11 @@
  #undef dout_prefix
  #define dout_prefix *_dout  mds.  mds-get_nodeid()  
  .tableclient(  get_mdstable_name(table)  ) 
   
  +void MDSTableClient::init()
  +{
  + // make reqid unique between MDS restarts
  + last_reqid = (uint64_t)mds-mdsmap-get_epoch()  32;
  +}
   
  void MDSTableClient::handle_request(class MMDSTableRequest *m)
  {
  diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
  index e15837f..78035db 100644
  --- a/src/mds/MDSTableClient.h
  +++ b/src/mds/MDSTableClient.h
  @@ -63,6 +63,8 @@ public:
  MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
  virtual ~MDSTableClient() {}
   
  + void init();
  +
  void handle_request(MMDSTableRequest *m);
   
  void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, 
  Context *onfinish);
  --  
  1.7.11.7
  
  
  
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-20 Thread Yan, Zheng

On 03/20/2013 02:15 PM, Sage Weil wrote:
 On Wed, 20 Mar 2013, Yan, Zheng wrote:
 On 03/20/2013 07:09 AM, Greg Farnum wrote:
 Hmm, this is definitely narrowing the race (probably enough to never hit 
 it), but it's not actually eliminating it (if the restart happens after 4 
 billion requests?). More importantly this kind of symptom makes me worry 
 that we might be papering over more serious issues with colliding states in 
 the Table on restart.
 I don't have the MDSTable semantics in my head so I'll need to look into 
 this later unless somebody else volunteers to do so?

 Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
 increases for each stage. I don't think there are any more colliding 
 states in the table. The table client/server use two phase commit. it's 
 similar to client request that involves multiple MDS. the reqid is 
 analogy to client request id. The difference is client request ID is 
 unique because new client always get an unique session id.
 
 Each time a tid is consumed (at least for an update) it is journaled in 
 the EMetaBlob::table_tids list, right?  So we could actually take a max 
 from journal replay and pick up where we left off?  That seems like the 
 cleanest.

This approach works only if client journal the reqid before it sending the 
request
to the server. but current implementation is client journal the reqid when it 
receives
server's agree message.

Regards
Yan, Zheng 
 
 I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
 avoid that possibility.
 
 sage
 

 Thanks
 Yan, Zheng

 -Greg

 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com
  
 When a MDS becomes active, the table server re-sends 'agree' messages
 for old prepared request. If the recoverd MDS starts a new table request
 at the same time, The new request's ID can happen to be the same as old
 prepared request's ID, because current table client assigns request ID
 from zero after MDS restarts.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 3 +++
 src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +
 src/mds/MDSTableClient.h | 2 ++
 3 files changed, 10 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc 
 (http://MDS.cc)
 index bb1c833..859782a 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
 dout(2)  boot_start   step  : opening snap table  dendl;  
 snapserver-load(gather.new_sub());
 }
 +
 + anchorclient-init();
 + snapclient-init();
  
 dout(2)  boot_start   step  : opening mds log  dendl;
 mdlog-open(gather.new_sub());
 diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
 b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 index ea021f5..beba0a3 100644
 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 @@ -34,6 +34,11 @@
 #undef dout_prefix
 #define dout_prefix *_dout  mds.  mds-get_nodeid()  
 .tableclient(  get_mdstable_name(table)  ) 
  
 +void MDSTableClient::init()
 +{
 + // make reqid unique between MDS restarts
 + last_reqid = (uint64_t)mds-mdsmap-get_epoch()  32;
 +}
  
 void MDSTableClient::handle_request(class MMDSTableRequest *m)
 {
 diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
 index e15837f..78035db 100644
 --- a/src/mds/MDSTableClient.h
 +++ b/src/mds/MDSTableClient.h
 @@ -63,6 +63,8 @@ public:
 MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
 virtual ~MDSTableClient() {}
  
 + void init();
 +
 void handle_request(MMDSTableRequest *m);
  
 void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, 
 Context *onfinish);
 --  
 1.7.11.7




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deb/rpm package purge

2013-03-20 Thread Laszlo Boszormenyi (GCS)

On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote:
 On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote:
  On the other hand, 'dpkg --purge' is to remove everything the package
  has installed and/or generated. This includes debconf answers as well.
  With other words, purge is used to make the system totally clean of the
  package. As such, if the sysadmin install the package again, all debconf
  questions will be asked again and all generated files will be generated
  again from scratch.
 
 I understand that part, but the policy isn't very clear about files that 
 are not part of the package but are generated as a result of the package 
 being installed (i.e., user data).
 Forgive me, I just learnt English and my wording may not be that clear
for a natural speaker.

 As a point of comparison, mysql removes the config files but not 
 /var/lib/mysql.
 As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I
may mix with an other (database related) package.

 The question is, is that okay/typical/desireable/recommended/a bad idea?
 I can rephrase my words. Purge removes the (binary) package files, its
configuration and logs (its generated files). To emphasis, user files
are _not_ fall into this category and must remain as-is, _intact_.
Some packages writes a console message that 'your files remain at xxx,
they were not removed' on purge. Others just leave the dpkg warning
directory not empty so not removed which means user files may have
left there and that may be the reason the directory is not empty.
 I'm in a rush, but hopefully will be able to note policy parts in the
afternoon (CET).

 The less important question is whether /var/log/ceph should be removed; 
 I'm assuming yes?
 Yes, logs are going to be removed.

Hope I could answer your question now. Please note me if I should clear
more parts of my answer.
Laszlo/GCS

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-20 Thread Yan, Zheng

On 03/20/2013 02:15 PM, Sage Weil wrote:
 On Wed, 20 Mar 2013, Yan, Zheng wrote:
 On 03/20/2013 07:09 AM, Greg Farnum wrote:
 Hmm, this is definitely narrowing the race (probably enough to never hit 
 it), but it's not actually eliminating it (if the restart happens after 4 
 billion requests?). More importantly this kind of symptom makes me worry 
 that we might be papering over more serious issues with colliding states in 
 the Table on restart.
 I don't have the MDSTable semantics in my head so I'll need to look into 
 this later unless somebody else volunteers to do so?

 Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
 increases for each stage. I don't think there are any more colliding 
 states in the table. The table client/server use two phase commit. it's 
 similar to client request that involves multiple MDS. the reqid is 
 analogy to client request id. The difference is client request ID is 
 unique because new client always get an unique session id.
 
 Each time a tid is consumed (at least for an update) it is journaled in 
 the EMetaBlob::table_tids list, right?  So we could actually take a max 
 from journal replay and pick up where we left off?  That seems like the 
 cleanest.
 
 I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
 avoid that possibility.
 

Can we re-use the client request ID as table client request ID ?

Regards
Yan, Zheng

 sage
 

 Thanks
 Yan, Zheng

 -Greg

 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com
  
 When a MDS becomes active, the table server re-sends 'agree' messages
 for old prepared request. If the recoverd MDS starts a new table request
 at the same time, The new request's ID can happen to be the same as old
 prepared request's ID, because current table client assigns request ID
 from zero after MDS restarts.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 3 +++
 src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +
 src/mds/MDSTableClient.h | 2 ++
 3 files changed, 10 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc 
 (http://MDS.cc)
 index bb1c833..859782a 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
 dout(2)  boot_start   step  : opening snap table  dendl;  
 snapserver-load(gather.new_sub());
 }
 +
 + anchorclient-init();
 + snapclient-init();
  
 dout(2)  boot_start   step  : opening mds log  dendl;
 mdlog-open(gather.new_sub());
 diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
 b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 index ea021f5..beba0a3 100644
 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 @@ -34,6 +34,11 @@
 #undef dout_prefix
 #define dout_prefix *_dout  mds.  mds-get_nodeid()  
 .tableclient(  get_mdstable_name(table)  ) 
  
 +void MDSTableClient::init()
 +{
 + // make reqid unique between MDS restarts
 + last_reqid = (uint64_t)mds-mdsmap-get_epoch()  32;
 +}
  
 void MDSTableClient::handle_request(class MMDSTableRequest *m)
 {
 diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
 index e15837f..78035db 100644
 --- a/src/mds/MDSTableClient.h
 +++ b/src/mds/MDSTableClient.h
 @@ -63,6 +63,8 @@ public:
 MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
 virtual ~MDSTableClient() {}
  
 + void init();
 +
 void handle_request(MMDSTableRequest *m);
  
 void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, 
 Context *onfinish);
 --  
 1.7.11.7




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deb/rpm package purge

2013-03-20 Thread Sage Weil

On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote:
 On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote:
  On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote:
   On the other hand, 'dpkg --purge' is to remove everything the package
   has installed and/or generated. This includes debconf answers as well.
   With other words, purge is used to make the system totally clean of the
   package. As such, if the sysadmin install the package again, all debconf
   questions will be asked again and all generated files will be generated
   again from scratch.
  
  I understand that part, but the policy isn't very clear about files that 
  are not part of the package but are generated as a result of the package 
  being installed (i.e., user data).
  Forgive me, I just learnt English and my wording may not be that clear
 for a natural speaker.
 
  As a point of comparison, mysql removes the config files but not 
  /var/lib/mysql.
  As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I
 may mix with an other (database related) package.
 
  The question is, is that okay/typical/desireable/recommended/a bad idea?
  I can rephrase my words. Purge removes the (binary) package files, its
 configuration and logs (its generated files). To emphasis, user files
 are _not_ fall into this category and must remain as-is, _intact_.
 Some packages writes a console message that 'your files remain at xxx,
 they were not removed' on purge. Others just leave the dpkg warning
 directory not empty so not removed which means user files may have
 left there and that may be the reason the directory is not empty.
  I'm in a rush, but hopefully will be able to note policy parts in the
 afternoon (CET).

Thanks, Laszlo, that's exactly what I was after!  Sorry for the confusing 
exchange.  :)

Sounds like in this case, the fix is simply to leave /var/lib/ceph 
untouched.

We'll need to update teuthology ceph.py and nuke to clean up /var/lib/ceph 
(for qa runs), and I think we should add a ceph-deploy 'purgedata' command 
to clean out /var/lib/ceph on a given host.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

waiting for 1 open ops to drain

2013-03-20 Thread Dave (Bob)

I am using ceph 0.58 and kernel 3.9-rc2 and btrfs on my osds.

I have an osd that starts up but blocks with the log message 'waiting
for 1 open ops to drain'.

This never happens, and I can't get the osd 'up'.

I need to clear this problem. I have recently had an osd go problematic
and I have recreated a fresh btrfs filesystem on the problem osd drive.
I have also added a completely new osd.

The 'waiting for 1 open ops to drain' problem has occurred before the
cluster has recovered from the earlier surgery and I need to get the
data from this osd.

I have increased the number of copies from 2 to 3 to give me more
resilience in the future, but that has not taken effect yet.

Once I get the cluster back to health, I will mkfs.btrfs and rebuild
this osd and one other that is a legacy from earlier kernel/ceph versions.

How can I tell the osd not to bother with waiting for its open ops to drain?

Thank you in anticipation.

David

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deb/rpm package purge

2013-03-20 Thread Mark Nelson


On 03/20/2013 07:48 AM, Sage Weil wrote:

On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote:

On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote:

On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote:

On the other hand, 'dpkg --purge' is to remove everything the package
has installed and/or generated. This includes debconf answers as well.
With other words, purge is used to make the system totally clean of the
package. As such, if the sysadmin install the package again, all debconf
questions will be asked again and all generated files will be generated
again from scratch.


I understand that part, but the policy isn't very clear about files that
are not part of the package but are generated as a result of the package
being installed (i.e., user data).

  Forgive me, I just learnt English and my wording may not be that clear
for a natural speaker.


As a point of comparison, mysql removes the config files but not
/var/lib/mysql.

  As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I
may mix with an other (database related) package.


The question is, is that okay/typical/desireable/recommended/a bad idea?

  I can rephrase my words. Purge removes the (binary) package files, its
configuration and logs (its generated files). To emphasis, user files
are _not_ fall into this category and must remain as-is, _intact_.
Some packages writes a console message that 'your files remain at xxx,
they were not removed' on purge. Others just leave the dpkg warning
directory not empty so not removed which means user files may have
left there and that may be the reason the directory is not empty.
  I'm in a rush, but hopefully will be able to note policy parts in the
afternoon (CET).


Thanks, Laszlo, that's exactly what I was after!  Sorry for the confusing
exchange.  :)

Sounds like in this case, the fix is simply to leave /var/lib/ceph
untouched.

We'll need to update teuthology ceph.py and nuke to clean up /var/lib/ceph
(for qa runs), and I think we should add a ceph-deploy 'purgedata' command
to clean out /var/lib/ceph on a given host.


It's not as important given that it won't outright destroy the cluster, 
but perhaps we should also leave /etc/ceph untouched on purge if a 
ceph.conf file has been placed in it (since that also was not installed 
by the package, but rather by a user?).  I figure we should probably try 
to get it right now.  The message about the directory not being empty 
sounds good.


My thought here is:

- remove anything created by the packages in /var/lib/ceph that has been 
untouched since package installation.

- remove /var/lib/ceph if it has been untouched
- remove /etc/ceph if it has been untouched

Thoughts?



Thanks!
sage



Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deb/rpm package purge

2013-03-20 Thread James Page

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 20/03/13 14:48, Mark Nelson wrote:
 
 We'll need to update teuthology ceph.py and nuke to clean up 
 /var/lib/ceph (for qa runs), and I think we should add a
 ceph-deploy 'purgedata' command to clean out /var/lib/ceph on a
 given host.
 
 It's not as important given that it won't outright destroy the
 cluster, but perhaps we should also leave /etc/ceph untouched on
 purge if a ceph.conf file has been placed in it (since that also
 was not installed by the package, but rather by a user?).  I figure
 we should probably try to get it right now.  The message about the
 directory not being empty sounds good.
 
 My thought here is:
 
 - remove anything created by the packages in /var/lib/ceph that has
 been untouched since package installation. - remove /var/lib/ceph
 if it has been untouched - remove /etc/ceph if it has been
 untouched

If those directories are created by dpkg rather than maintainer
scripts (i.e. in ceph.dirs rather than ceph.postinst) you should
should get that all for free; if the directories contain anything dpkg
does not know about it will just not removed them.


- -- 
James Page
Ubuntu Core Developer
Debian Maintainer
james.p...@ubuntu.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBCAAGBQJRSfFoAAoJEL/srsug59jDJgsP/0zD4aKFtimFPh02/zdfJ3X0
BEY8Jmnmt3HcCxSPbGleZ+2p/38iLfLz8HdM1ZpOwGDVIv13N45vG0wW9zF/843R
8vKoGHJY7gAt/uY1fqa115m9txeNXAIZoaxwjrd6Zd31pgPvTBmfZhVsFKUnk7E5
9JmUs/K8gjjAZajhkUKgddp2ID70n/WGdHR+iu5cy72TyuvXVBQV1OmyYi9lMIxM
yHGnCM/X7x5DE1g61x532VP2D0gAegA2WWURoqQ6vAM3IZfVGVuvat+HdzZ8Ej8z
LfTk+8n2alTj6s1Xp3KGbb/D231MIi3VBaFMQx5pBlM29lAv8OYKidRQpZc9bZe2
5m5vhDutp4ZOZmqxDvDdayZgb/s8uVuodT2XK7qn4KbBRDEJN5aJKiUzXH0wTVTZ
Lg/A/criFuzRP+ZH/Sh1pfSnLkNrrLMbdTglUv4krM2L6ZPOmEU3fh+UIXkr2u9t
f6lnu4fVBwikDy/4hDztVL76IqB3wjnxYlJ1uHMHOrCugDeLRsGHdbCdFcoZonRW
rUhdrtqzuuSbPHkzs/dEMCm4vF439YdmuL4WGsfzxEu+djcESDtzuAw4D1LRO12V
JbEc9s8+L84oBJUv5dCSDG333jWc/8eihSs1ZG33NKZsWsNheppLR6aeXbg/nGi1
53R63uM1Lv7f1UaTxvH8
=6adU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-20 Thread Greg Farnum

On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
 On 03/20/2013 02:15 PM, Sage Weil wrote:
  On Wed, 20 Mar 2013, Yan, Zheng wrote:
   On 03/20/2013 07:09 AM, Greg Farnum wrote:
Hmm, this is definitely narrowing the race (probably enough to never 
hit it), but it's not actually eliminating it (if the restart happens 
after 4 billion requests?). More importantly this kind of symptom makes 
me worry that we might be papering over more serious issues with 
colliding states in the Table on restart.
I don't have the MDSTable semantics in my head so I'll need to look 
into this later unless somebody else volunteers to do so?



   Not just 4 billion requests, MDS restart has several stage, mdsmap epoch  
   increases for each stage. I don't think there are any more colliding  
   states in the table. The table client/server use two phase commit. it's  
   similar to client request that involves multiple MDS. the reqid is  
   analogy to client request id. The difference is client request ID is  
   unique because new client always get an unique session id.
   
   
   
  Each time a tid is consumed (at least for an update) it is journaled in  
  the EMetaBlob::table_tids list, right? So we could actually take a max  
  from journal replay and pick up where we left off? That seems like the  
  cleanest.
   
  I'm not too worried about 2^32 tids, I guess, but it would be nicer to  
  avoid that possibility.
  
  
  
 Can we re-use the client request ID as table client request ID ?
  
 Regards
 Yan, Zheng

Not sure what you're referring to here — do you mean the ID of the filesystem 
client request which prompted the update? I don't think that would work as 
client requests actually require two parts to be unique (the client GUID and 
the request seq number), and I'm pretty sure a single client request can spawn 
multiple Table updates.

As I look over this more, it sure looks to me as if the effect of the code we 
have (when non-broken) is to rollback every non-committed request by an MDS 
which restarted — the only time it can handle the TableServer's agree with a 
different response is if the MDS was incorrectly marked out by the map. Am I 
parsing this correctly, Sage? Given that, and without having looked at the code 
more broadly, I think we want to add some sort of implicit or explicit 
handshake letting each of them know if the MDS actually disappeared. We use the 
process/address nonce to accomplish this in other places…
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/39] mds: preserve subtree bounds until slave commit

2013-03-20 Thread Greg Farnum

Reviewed-by: Greg Farnum g...@inktank.com 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 When replaying an operation that rename a directory inode to non-auth subtree,
 if the inode has subtree bounds, we should prevent them from being trimmed
 until slave commit.
 
 This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay()
 should be called before MDCache::finish_uncommitted_slave_update().
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 21 +++--
 src/mds/Mutation.h | 5 ++---
 src/mds/journal.cc (http://journal.cc) | 13 +
 3 files changed, 22 insertions(+), 17 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index fddcfc6..684e70b 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -3016,10 +3016,10 @@ void 
 MDCache::add_uncommitted_slave_update(metareqid_t reqid, int master, MDSlav
 {
 assert(uncommitted_slave_updates[master].count(reqid) == 0);
 uncommitted_slave_updates[master][reqid] = su;
 - if (su-rename_olddir)
 - uncommitted_slave_rename_olddir[su-rename_olddir]++;
 + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); 
 ++p)
 + uncommitted_slave_rename_olddir[*p]++;
 for(setCInode*::iterator p = su-unlinked.begin(); p != su-unlinked.end(); 
 ++p)
 - uncommitted_slave_unlink[*p]++;
 + uncommitted_slave_unlink[*p]++;
 }
 
 void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master)
 @@ -3031,11 +3031,12 @@ void 
 MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master)
 if (uncommitted_slave_updates[master].empty())
 uncommitted_slave_updates.erase(master);
 // discard the non-auth subtree we renamed out of
 - if (su-rename_olddir) {
 - uncommitted_slave_rename_olddir[su-rename_olddir]--;
 - if (uncommitted_slave_rename_olddir[su-rename_olddir] == 0) {
 - uncommitted_slave_rename_olddir.erase(su-rename_olddir);
 - CDir *root = get_subtree_root(su-rename_olddir);
 + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); 
 ++p) {
 + CDir *dir = *p;
 + uncommitted_slave_rename_olddir[dir]--;
 + if (uncommitted_slave_rename_olddir[dir] == 0) {
 + uncommitted_slave_rename_olddir.erase(dir);
 + CDir *root = get_subtree_root(dir);
 if (root-get_dir_auth() == CDIR_AUTH_UNDEF)
 try_trim_non_auth_subtree(root);
 }
 @@ -6052,8 +6053,8 @@ bool MDCache::trim_non_auth_subtree(CDir *dir)
 {
 dout(10)  trim_non_auth_subtree(  dir  )   *dir  dendl;
 
 - // preserve the dir for rollback
 - if (uncommitted_slave_rename_olddir.count(dir))
 + if (uncommitted_slave_rename_olddir.count(dir) || // preserve the dir for 
 rollback
 + my_ambiguous_imports.count(dir-dirfrag()))
 return true;
 
 bool keep_dir = false;
 diff --git a/src/mds/Mutation.h b/src/mds/Mutation.h
 index 55b84eb..5013f04 100644
 --- a/src/mds/Mutation.h
 +++ b/src/mds/Mutation.h
 @@ -315,13 +315,12 @@ struct MDSlaveUpdate {
 bufferlist rollback;
 elistMDSlaveUpdate*::item item;
 Context *waiter;
 - CDir* rename_olddir;
 + setCDir* olddirs;
 setCInode* unlinked;
 MDSlaveUpdate(int oo, bufferlist rbl, elistMDSlaveUpdate* list) :
 origop(oo),
 item(this),
 - waiter(0),
 - rename_olddir(0) {
 + waiter(0) {
 rollback.claim(rbl);
 list.push_back(item);
 }
 diff --git a/src/mds/journal.cc (http://journal.cc) b/src/mds/journal.cc 
 (http://journal.cc)
 index 5b3bd71..3375e40 100644
 --- a/src/mds/journal.cc (http://journal.cc)
 +++ b/src/mds/journal.cc (http://journal.cc)
 @@ -1131,10 +1131,15 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, 
 MDSlaveUpdate *slaveup)
 if (olddir) {
 if (olddir-authority() != CDIR_AUTH_UNDEF 
 renamed_diri-authority() == CDIR_AUTH_UNDEF) {
 + assert(slaveup); // auth to non-auth, must be slave prepare
 listfrag_t leaves;
 renamed_diri-dirfragtree.get_leaves(leaves);
 - for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p)
 - renamed_diri-get_or_open_dirfrag(mds-mdcache, *p);
 + for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) {
 + CDir *dir = renamed_diri-get_or_open_dirfrag(mds-mdcache, *p);
 + // preserve subtree bound until slave commit
 + if (dir-authority() == CDIR_AUTH_UNDEF)
 + slaveup-olddirs.insert(dir);
 + }
 }
 
 mds-mdcache-adjust_subtree_after_rename(renamed_diri, olddir, false);
 @@ -1143,7 +1148,7 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, 
 MDSlaveUpdate *slaveup)
 CDir *root = mds-mdcache-get_subtree_root(olddir);
 if (root-get_dir_auth() == CDIR_AUTH_UNDEF) {
 if (slaveup) // preserve the old dir until slave commit
 - slaveup-rename_olddir = olddir;
 + slaveup-olddirs.insert(olddir);
 else
 mds-mdcache-try_trim_non_auth_subtree(root);
 }
 @@ -2122,10 +2127,10 @@ void ESlaveUpdate::replay(MDS *mds)
 case

Re: [PATCH 03/39] mds: fix MDCache::adjust_bounded_subtree_auth()

2013-03-20 Thread Greg Farnum

Reviewed-by: Greg Farnum g...@inktank.com


Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 There are cases that need both create new bound and swallow intervening
 subtree. For example: A MDS exports subtree A with bound B and imports
 subtree B with bound C at the same time. The MDS crashes, exporting
 subtree A fails, but importing subtree B succeed. During recovery, the
 MDS may create new bound C and swallow subtree B.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index 684e70b..19dc60b 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -980,15 +980,21 @@ void MDCache::adjust_bounded_subtree_auth(CDir *dir, 
 setCDir* bounds, pairin
 }
 else {
 dout(10)   want bound   *bound  dendl;
 + CDir *t = get_subtree_root(bound-get_parent_dir());
 + if (subtrees[t].count(bound) == 0) {
 + assert(t != dir);
 + dout(10)   new bound   *bound  dendl;
 + adjust_subtree_auth(bound, t-authority());
 + }
 // make sure it's nested beneath ambiguous subtree(s)
 while (1) {
 - CDir *t = get_subtree_root(bound-get_parent_dir());
 - if (t == dir) break;
 while (subtrees[dir].count(t) == 0)
 t = get_subtree_root(t-get_parent_dir());
 dout(10)   swallowing intervening subtree at   *t  dendl;
 adjust_subtree_auth(t, auth);
 try_subtree_merge_at(t);
 + t = get_subtree_root(bound-get_parent_dir());
 + if (t == dir) break;
 }
 }
 }
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/39] mds: send table request when peer is in proper state.

2013-03-20 Thread Greg Farnum

This and patch 6 are probably going to get dealt with as part of our 
conversation on patch 4 and restart of the TableServers. 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 Table client/server should send request/reply when the peer is active.
 Anchor query is an exception, because MDS in rejoin stage may need
 fetch files before sending rejoin ack, the anchor server can also be
 in rejoin stage.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/AnchorClient.cc (http://AnchorClient.cc) | 5 -
 src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 9 ++---
 src/mds/MDSTableServer.cc (http://MDSTableServer.cc) | 3 ++-
 3 files changed, 12 insertions(+), 5 deletions(-)
 
 diff --git a/src/mds/AnchorClient.cc (http://AnchorClient.cc) 
 b/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 index 455e97f..d7da9d1 100644
 --- a/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 +++ b/src/mds/AnchorClient.cc (http://AnchorClient.cc)
 @@ -80,9 +80,12 @@ void AnchorClient::lookup(inodeno_t ino, vectorAnchor 
 trace, Context *onfinis
 
 void AnchorClient::_lookup(inodeno_t ino)
 {
 + int ts = mds-mdsmap-get_tableserver();
 + if (mds-mdsmap-get_state(ts)  MDSMap::STATE_REJOIN)
 + return;
 MMDSTableRequest *req = new MMDSTableRequest(table, TABLESERVER_OP_QUERY, 0, 
 0);
 ::encode(ino, req-bl);
 - mds-send_message_mds(req, mds-mdsmap-get_tableserver());
 + mds-send_message_mds(req, ts);
 }
 
 
 diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) 
 b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 index beba0a3..df0131f 100644
 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
 @@ -149,9 +149,10 @@ void MDSTableClient::_prepare(bufferlist mutation, 
 version_t *ptid, bufferlist
 void MDSTableClient::send_to_tableserver(MMDSTableRequest *req)
 {
 int ts = mds-mdsmap-get_tableserver();
 - if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY)
 + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) {
 mds-send_message_mds(req, ts);
 - else {
 + } else {
 + req-put();
 dout(10)   deferring request to not-yet-active tableserver mds.  ts  
 dendl;
 }
 }
 @@ -193,7 +194,9 @@ void MDSTableClient::got_journaled_ack(version_t tid)
 void MDSTableClient::finish_recovery()
 {
 dout(7)  finish_recovery  dendl;
 - resend_commits();
 + int ts = mds-mdsmap-get_tableserver();
 + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY)
 + resend_commits();
 }
 
 void MDSTableClient::resend_commits()
 diff --git a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) 
 b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 index 4f86ff1..07c7d26 100644
 --- a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 +++ b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc)
 @@ -159,7 +159,8 @@ void MDSTableServer::handle_mds_recovery(int who)
 for (mapversion_t,mds_table_pending_t::iterator p = pending_for_mds.begin();
 p != pending_for_mds.end();
 ++p) {
 - if (who = 0  p-second.mds != who)
 + if ((who = 0  p-second.mds != who) ||
 + mds-mdsmap-get_state(p-second.mds)  MDSMap::STATE_CLIENTREPLAY)
 continue;
 MMDSTableRequest *reply = new MMDSTableRequest(table, TABLESERVER_OP_AGREE, 
 p-second.reqid, p-second.tid);
 mds-send_message_mds(reply, p-second.mds);
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.

2013-03-20 Thread Greg Farnum

The idea of this patch makes sense, but I'm not sure if we guarantee that each 
daemon sees every map update — if they don't then if an MDS misses the map 
moving an MDS into CLIENTREPLAY then they won't process them as having 
recovered on the next map. Sage or Joao, what are the guarantees subscription 
provides?  
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
  
 MDS in clientreply state already start servering requests. It also
 make MDS::handle_mds_recovery() and MDS::recovery_done() match.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 2 ++
 1 file changed, 2 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 282fa64..b91dcbd 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m)
  
 setint oldactive, active;
 oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE);
 + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY);
 mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE);
 + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY);
 for (setint::iterator p = active.begin(); p != active.end(); ++p)  
 if (*p != whoami  // not me
 oldactive.count(*p) == 0) // newly so?
 --  
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/39] mds: mark connection down when MDS fails

2013-03-20 Thread Greg Farnum

Reviewed-by: Greg Farnum g...@inktank.com



Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 So if the MDS restarts and uses the same address, it does not get
 old messages.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 859782a..282fa64 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1046,8 +1046,10 @@ void MDS::handle_mds_map(MMDSMap *m)
 oldmap-get_failed_mds_set(oldfailed);
 mdsmap-get_failed_mds_set(failed);
 for (setint::iterator p = failed.begin(); p != failed.end(); ++p)
 - if (oldfailed.count(*p) == 0)
 + if (oldfailed.count(*p) == 0) {
 + messenger-mark_down(oldmap-get_inst(*p).addr);
 mdcache-handle_mds_failure(*p);
 + }
 
 // or down then up?
 // did their addr/inst change?
 @@ -1055,8 +1057,10 @@ void MDS::handle_mds_map(MMDSMap *m)
 mdsmap-get_up_mds_set(up);
 for (setint::iterator p = up.begin(); p != up.end(); ++p) 
 if (oldmap-have_inst(*p) 
 - oldmap-get_inst(*p) != mdsmap-get_inst(*p))
 + oldmap-get_inst(*p) != mdsmap-get_inst(*p)) {
 + messenger-mark_down(oldmap-get_inst(*p).addr);
 mdcache-handle_mds_failure(*p);
 + }
 }
 if (is_clientreplay() || is_active() || is_stopping()) {
 // did anyone stop?
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bad Blocks

2013-03-20 Thread Dyweni - Ceph-Devel


Hi All,

I would like to understand how Ceph handles and recovers from bad 
blocks.  Would someone mind explaining this to me?  It wasn't very 
apparent from the docs.


My ultimate goal to be able to get some extra life out of my disks, 
after I detect that they may be failing.  (I'm talking about those disks 
that may have a small amount of bad blocks, but otherwise seem file and 
still perform well).


Here's what I've put together:

1. BBR Hardware
- All hard disks come with a set number of blocks that are reserved 
for remapping of failed blocks.  This is handled transparently by the 
hard disk.  The hard disk may not begin reporting failed blocks until 
all the reserved blocks are used up.


2. BBR Device Mapper Target
- Back in the EVMS days, IBM wrote a kernel module (dm-bbr) and a 
evms plugin to manage that kernel module.  I have updated that kernel 
module to work with the 3.6.11 kernel.  I have also rewrote some 
portions of the evms plugin as a standalone bash script to allow me to 
initialize the BBR layer and start the BBR device mapper target on that 
layer.  (So far it seems to run fine, but requires more testing).


3. BTRFS
- I've read that BTRFS can perform data scrubbing and repair 
damaged files from redundant copies.


4. CEPH
- I've read that CEPH can perform a deep scrub to find damaged 
copies.  I assume by the distributed nature of CEPH, it can repair the 
damaged copy from the other OSDs.


One thing I am not clear on is when BTRFS / CEPH finds damaged data, 
what do they do to prevent data from being written to the same area?


Also, I'm wondering if any parts to my layered approach are redundant / 
unnecessary...  For instance if BTRFS marks the block bad internally, 
then perhaps the BBR DM Target isn't needed...



In my testing recently, I had the following setup:
  Disk - DM-Crypt - DM-BBR - BTRFS - OSD

When the OSD hit a bad block, the DM-BBR target successfully remapped 
it to one of its own reserved blocks, BTRFS then reported data 
corruption, and the OSD daemon crashed.



--
Thanks,
Dyweni
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.

2013-03-20 Thread Greg Farnum

Oh, also: s/clientreply/clientreplay in the commit message 

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
 
 MDS in clientreply state already start servering requests. It also
 make MDS::handle_mds_recovery() and MDS::recovery_done() match.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 2 ++
 1 file changed, 2 insertions(+)
 
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 282fa64..b91dcbd 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m)
 
 setint oldactive, active;
 oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE);
 + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY);
 mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE);
 + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY);
 for (setint::iterator p = active.begin(); p != active.end(); ++p) 
 if (*p != whoami  // not me
 oldactive.count(*p) == 0) // newly so?
 -- 
 1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/39] mds: defer eval gather locks when removing replica

2013-03-20 Thread Greg Farnum

On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 Locks' states should not change between composing the cache rejoin ack
 messages and sending the message. If Locker::eval_gather() is called
 in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
 change locks' states.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 51 
 ++-
 src/mds/MDCache.h | 8 +---
 2 files changed, 35 insertions(+), 24 deletions(-)
 
 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index 19dc60b..0f6b842 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 // possible response(s)
 MMDSCacheRejoin *ack = 0; // if survivor
 setvinodeno_t acked_inodes; // if survivor
 + setSimpleLock * gather_locks; // if survivor
 bool survivor = false; // am i a survivor?
 
 if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) {
 @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(dnl-is_primary());
 
 if (survivor  dn-is_replica(from)) 
 - dentry_remove_replica(dn, from); // this induces a lock gather completion
 + dentry_remove_replica(dn, from, gather_locks); // this induces a lock 
 gather completion

This comment is no longer accurate :) 
 int dnonce = dn-add_replica(from);
 dout(10)   have   *dn  dendl;
 if (ack) 
 @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(in);
 
 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 int inonce = in-add_replica(from);
 dout(10)   have   *in  dendl;
 
 @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 CInode *in = get_inode(*p);
 assert(in); // hmm fixme wrt stray?
 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from); // this induces a lock gather completion
 + inode_remove_replica(in, from, gather_locks); // this induces a lock gather 
 completion

Same here. 

Other than those, looks good.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


 int inonce = in-add_replica(from);
 dout(10)   have base   *in  dendl;
 
 @@ -3909,8 +3910,11 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 ack-add_inode_base(in);
 }
 
 - rejoin_scour_survivor_replicas(from, ack, acked_inodes);
 + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes);
 mds-send_message(ack, weak-get_connection());
 +
 + for (setSimpleLock*::iterator p = gather_locks.begin(); p != 
 gather_locks.end(); ++p)
 + mds-locker-eval_gather(*p);
 } else {
 // done?
 assert(rejoin_gather.count(from));
 @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t 
 ino, filepath path,
 * all validated replicas are acked with a strong nonce, etc. if that isn't in 
 the
 * ack, the replica dne, and we can remove it from our replica maps.
 */
 -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, 
 setvinodeno_t acked_inodes)
 +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack,
 + setSimpleLock * gather_locks,
 + setvinodeno_t acked_inodes)
 {
 dout(10)  rejoin_scour_survivor_replicas from mds.  from  dendl;
 
 @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (in-is_auth() 
 in-is_replica(from) 
 acked_inodes.count(p-second-vino()) == 0) {
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 dout(10)   rem   *in  dendl;
 }
 
 @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (dn-is_replica(from) 
 (ack-strong_dentries.count(dir-dirfrag()) == 0 ||
 ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, dn-last)) 
 == 0)) {
 - dentry_remove_replica(dn, from);
 + dentry_remove_replica(dn, from, gather_locks);
 dout(10)   rem   *dn  dendl;
 }
 }
 @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 return;
 }
 
 + setSimpleLock * gather_locks;
 // loop over realms
 for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin();
 p != m-realms.end();
 @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 // remove from our cached_by
 dout(7)   inode expire on   *in   from mds.  from 
   cached_by was   in-get_replicas()  dendl;
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 } 
 else {
 // this is an old nonce, ignore expire.
 @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 
 if (nonce == dn-get_replica_nonce(from)) {
 dout(7)   dentry_expire on   *dn   from mds.  from  dendl;
 -

Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Travis Rhoden

Hey folks,

We were hoping this one was fixed.  I upgraded all my nodes to the
latest bobtail branch, but still hit this today:

osdc/ObjectCacher.cc: In function 'void
ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
 ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520)
 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned
long, unsigned long, int)+0xd68) [0x7f651d0ada48]
 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b]
 3: (Context::complete(int)+0xa) [0x7f651d06c9fa]
 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315]
 5: (Context::complete(int)+0xa) [0x7f651d06c9fa]
 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387]
 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d]
 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920]
 9: (()+0x7e9a) [0x7f6519cffe9a]
 10: (clone()+0x6d) [0x7f6519a2bcbd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Is this occuring in librbd caching?  If so, I could disable it for the
time being.

First saw this mentioned on-list here:
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577

Will be happy to provide anything I can for this one -- definitely
critical for my use case.  It happens with about 10% of the VMs I
create.  Always within the first 60 seconds of the VM booting and
being network accessible.

 - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Stefan Priebe


Hi,

strange i've never seen this. Which qemu version?

Stefan
Am 20.03.2013 20:49, schrieb Travis Rhoden:

Hey folks,

We were hoping this one was fixed.  I upgraded all my nodes to the
latest bobtail branch, but still hit this today:

osdc/ObjectCacher.cc: In function 'void
ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
  ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520)
  1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned
long, unsigned long, int)+0xd68) [0x7f651d0ada48]
  2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b]
  3: (Context::complete(int)+0xa) [0x7f651d06c9fa]
  4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315]
  5: (Context::complete(int)+0xa) [0x7f651d06c9fa]
  6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387]
  7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d]
  8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920]
  9: (()+0x7e9a) [0x7f6519cffe9a]
  10: (clone()+0x6d) [0x7f6519a2bcbd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Is this occuring in librbd caching?  If so, I could disable it for the
time being.

First saw this mentioned on-list here:
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577

Will be happy to provide anything I can for this one -- definitely
critical for my use case.  It happens with about 10% of the VMs I
create.  Always within the first 60 seconds of the VM booting and
being network accessible.

  - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Campbell, Bill

Travis, are you using format 1 or 2 images?  I've seen the same behavior on 
format 2 images using cloned snapshots, but haven't run into this issue on any 
normal format 2 images.

- Original Message -
From: Travis Rhoden trho...@gmail.com
To: ceph-devel ceph-devel@vger.kernel.org
Sent: Wednesday, March 20, 2013 3:49:23 PM
Subject: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

Hey folks,

We were hoping this one was fixed.  I upgraded all my nodes to the
latest bobtail branch, but still hit this today:

osdc/ObjectCacher.cc: In function 'void
ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
 ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520)
 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned
long, unsigned long, int)+0xd68) [0x7f651d0ada48]
 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b]
 3: (Context::complete(int)+0xa) [0x7f651d06c9fa]
 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315]
 5: (Context::complete(int)+0xa) [0x7f651d06c9fa]
 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387]
 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d]
 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920]
 9: (()+0x7e9a) [0x7f6519cffe9a]
 10: (clone()+0x6d) [0x7f6519a2bcbd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Is this occuring in librbd caching?  If so, I could disable it for the
time being.

First saw this mentioned on-list here:
http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577

Will be happy to provide anything I can for this one -- definitely
critical for my use case.  It happens with about 10% of the VMs I
create.  Always within the first 60 seconds of the VM booting and
being network accessible.

 - Travis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NOTICE: Protect the information in this message in accordance with the 
company's security policies. If you received this message in error, immediately 
notify the sender and destroy all copies.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Travis Rhoden

Hello.

 Travis, are you using format 1 or 2 images?  I've seen the same behavior on 
 format 2 images using cloned snapshots, but haven't run into this issue on 
 any normal format 2 images.

In this case, they are format 2. And they are from cloned snapshots.
Exactly like the following:

# rbd ls -l -p volumes
NAME SIZE
PARENT   FMT PROT LOCK
volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

I'm doing an OpenStack boot-from-volume setup.

 strange i've never seen this. Which qemu version?

# qemu-x86_64 -version
qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard

that's coming from Ubuntu 12.04 apt repos.

 - Travis

On Wed, Mar 20, 2013 at 3:53 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 strange i've never seen this. Which qemu version?

 Stefan
 Am 20.03.2013 20:49, schrieb Travis Rhoden:

 Hey folks,

 We were hoping this one was fixed.  I upgraded all my nodes to the
 latest bobtail branch, but still hit this today:

 osdc/ObjectCacher.cc: In function 'void
 ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
 tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616
 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
   ceph version 0.56.3-42-ga30903c
 (a30903c6adaa023587d3147179d6038ad37ca520)
   1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned
 long, unsigned long, int)+0xd68) [0x7f651d0ada48]
   2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b]
   3: (Context::complete(int)+0xa) [0x7f651d06c9fa]
   4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315]
   5: (Context::complete(int)+0xa) [0x7f651d06c9fa]
   6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387]
   7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d]
   8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920]
   9: (()+0x7e9a) [0x7f6519cffe9a]
   10: (clone()+0x6d) [0x7f6519a2bcbd]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Is this occuring in librbd caching?  If so, I could disable it for the
 time being.

 First saw this mentioned on-list here:
 http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577

 Will be happy to provide anything I can for this one -- definitely
 critical for my use case.  It happens with about 10% of the VMs I
 create.  Always within the first 60 seconds of the VM booting and
 being network accessible.

   - Travis
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Stefan Priebe


Hi,


In this case, they are format 2. And they are from cloned snapshots.
Exactly like the following:

# rbd ls -l -p volumes
NAME SIZE
PARENT   FMT PROT LOCK
volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

I'm doing an OpenStack boot-from-volume setup.


OK i've never used cloned snapshots so maybe this is the reason.


strange i've never seen this. Which qemu version?


# qemu-x86_64 -version
qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard

that's coming from Ubuntu 12.04 apt repos.


maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does 
not exist anymore it was merged into qemu with 1.3 or 1.4.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Travis Rhoden

On Wed, Mar 20, 2013 at 4:14 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,


 In this case, they are format 2. And they are from cloned snapshots.
 Exactly like the following:

 # rbd ls -l -p volumes
 NAME SIZE
 PARENT   FMT PROT LOCK
 volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
 images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

 I'm doing an OpenStack boot-from-volume setup.


 OK i've never used cloned snapshots so maybe this is the reason.


 strange i've never seen this. Which qemu version?


 # qemu-x86_64 -version
 qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice
 Bellard

 that's coming from Ubuntu 12.04 apt repos.


 maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not
 exist anymore it was merged into qemu with 1.3 or 1.4.

Since the crash is in librbd, would an update of qemu help anything?

 Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Josh Durgin


On 03/20/2013 01:14 PM, Stefan Priebe wrote:

Hi,


In this case, they are format 2. And they are from cloned snapshots.
Exactly like the following:

# rbd ls -l -p volumes
NAME SIZE
PARENT   FMT PROT LOCK
volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

I'm doing an OpenStack boot-from-volume setup.


OK i've never used cloned snapshots so maybe this is the reason.


strange i've never seen this. Which qemu version?


# qemu-x86_64 -version
qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
Fabrice Bellard

that's coming from Ubuntu 12.04 apt repos.


maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
not exist anymore it was merged into qemu with 1.3 or 1.4.


This particular problem won't be solved by upgrading qemu. It's a ceph
bug. Disabling caching would work around the issue.

Travis, could you get a log from qemu of this happening with:

debug ms = 20
debug objectcacher = 20
debug rbd = 20
log file = /path/writeable/by/qemu

From those we can tell whether the issue is on the client side at least,
and hopefully what's causing it.

Thanks!
Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS: stable release?

2013-03-20 Thread Greg Farnum

On Wednesday, March 20, 2013 at 1:22 PM, Pascal wrote:
 Am Sun, 24 Feb 2013 14:41:27 -0800
 schrieb Gregory Farnum g...@inktank.com (mailto:g...@inktank.com):
  
  On Saturday, February 23, 2013 at 2:14 AM, Gandalf Corvotempesta
  wrote:
   Hi all,
   do you have an ETA about a stable realease (or something usable in
   production) for CephFS?
   
   
   
  Short answer: no.  
   
  However, we do have a team of people working on the FS again as of a
  month or so ago. We're doing a lot of stabilization (bug fixes), code
  cleanups, and utility work in the coming months; we can estimate the
  utility and cleanup work but not the bugs that we'll find, and those
  are our main concern right now. Depending on how the next couple
  months of QA and bug fixing go we should be able to publicize real
  estimates soonish. -Greg
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
  
 Hello Gregory,
  
 is your response still up-to-date?  
  
 The FAQ (http://ceph.com/docs/master/faq/) says: Ceph’s object store (RADOS) 
 is production ready.
  


Well put out some blog posts and emails when we have anything more to report. 
:) RADOS is ready, but CephFS is a whole separate layer above it.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Josh Durgin


On 03/20/2013 01:19 PM, Josh Durgin wrote:

On 03/20/2013 01:14 PM, Stefan Priebe wrote:

Hi,


In this case, they are format 2. And they are from cloned snapshots.
Exactly like the following:

# rbd ls -l -p volumes
NAME SIZE
PARENT   FMT PROT LOCK
volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

I'm doing an OpenStack boot-from-volume setup.


OK i've never used cloned snapshots so maybe this is the reason.


strange i've never seen this. Which qemu version?


# qemu-x86_64 -version
qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
Fabrice Bellard

that's coming from Ubuntu 12.04 apt repos.


maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
not exist anymore it was merged into qemu with 1.3 or 1.4.


This particular problem won't be solved by upgrading qemu. It's a ceph
bug. Disabling caching would work around the issue.

Travis, could you get a log from qemu of this happening with:

debug ms = 20
debug objectcacher = 20
debug rbd = 20
log file = /path/writeable/by/qemu


If it doesn't reproduce with those settings, try changing debug ms to 1
instead of 20.


 From those we can tell whether the issue is on the client side at least,
and hopefully what's causing it.

Thanks!
Josh


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Travis Rhoden

Thanks Josh.  I will respond when I have something useful!

On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 03/20/2013 01:19 PM, Josh Durgin wrote:

 On 03/20/2013 01:14 PM, Stefan Priebe wrote:

 Hi,

 In this case, they are format 2. And they are from cloned snapshots.
 Exactly like the following:

 # rbd ls -l -p volumes
 NAME SIZE
 PARENT   FMT PROT LOCK
 volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
 images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

 I'm doing an OpenStack boot-from-volume setup.


 OK i've never used cloned snapshots so maybe this is the reason.

 strange i've never seen this. Which qemu version?


 # qemu-x86_64 -version
 qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
 Fabrice Bellard

 that's coming from Ubuntu 12.04 apt repos.


 maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
 not exist anymore it was merged into qemu with 1.3 or 1.4.


 This particular problem won't be solved by upgrading qemu. It's a ceph
 bug. Disabling caching would work around the issue.

 Travis, could you get a log from qemu of this happening with:

 debug ms = 20
 debug objectcacher = 20
 debug rbd = 20
 log file = /path/writeable/by/qemu


 If it doesn't reproduce with those settings, try changing debug ms to 1
 instead of 20.


  From those we can tell whether the issue is on the client side at least,
 and hopefully what's causing it.

 Thanks!
 Josh


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request

2013-03-20 Thread Greg Farnum

On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 Replicated objects need to be added into the cache immediately
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com
Why do we need to add them right away? Shouldn't we have a journaled replica if 
we need it?
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
 ---
 src/mds/MDCache.cc | 12 
 src/mds/MDCache.h | 2 +-
 src/mds/MDS.cc | 6 +++---
 src/mds/Server.cc | 55 +++---
 4 files changed, 56 insertions(+), 19 deletions(-)
 
 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 0f6b842..b668842 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context 
 *fin, bufferlist bl, int r)
 
 /*  */
 
 +int MDCache::get_num_client_requests()
 +{
 + int count = 0;
 + for (hash_mapmetareqid_t, MDRequest*::iterator p = 
 active_requests.begin();
 + p != active_requests.end();
 + ++p) {
 + if (p-second-reqid.name.is_client()  !p-second-is_slave())
 + count++;
 + }
 + return count;
 +}
 +
 /* This function takes over the reference to the passed Message */
 MDRequest *MDCache::request_start(MClientRequest *req)
 {
 diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
 index a9f05c6..4634121 100644
 --- a/src/mds/MDCache.h
 +++ b/src/mds/MDCache.h
 @@ -240,7 +240,7 @@ protected:
 hash_mapmetareqid_t, MDRequest* active_requests; 
 
 public:
 - int get_num_active_requests() { return active_requests.size(); }
 + int get_num_client_requests();
 
 MDRequest* request_start(MClientRequest *req);
 MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by);
 diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
 index b91dcbd..e99eecc 100644
 --- a/src/mds/MDS.cc
 +++ b/src/mds/MDS.cc
 @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m)
 mdcache-is_open() 
 replay_queue.empty() 
 want_state == MDSMap::STATE_CLIENTREPLAY) {
 - dout(10)   still have   mdcache-get_num_active_requests()
 -   active replay requests  dendl;
 - if (mdcache-get_num_active_requests() == 0)
 + int num_requests = mdcache-get_num_client_requests();
 + dout(10)   still have   num_requests   active replay requests  
 dendl;
 + if (num_requests == 0)
 clientreplay_done();
 }
 
 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 4c4c86b..8e89e4c 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -107,10 +107,8 @@ void Server::dispatch(Message *m)
 (m-get_type() == CEPH_MSG_CLIENT_REQUEST 
 (static_castMClientRequest*(m))-is_replay( {
 // replaying!
 - } else if (mds-is_clientreplay()  m-get_type() == MSG_MDS_SLAVE_REQUEST 
 
 - ((static_castMMDSSlaveRequest*(m))-is_reply() ||
 - !mds-mdsmap-is_active(m-get_source().num( {
 - // slave reply or the master is also in the clientreplay stage
 + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) {
 + // handle_slave_request() will wait if necessary
 } else {
 dout(3)  not active yet, waiting  dendl;
 mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 if (m-is_reply())
 return handle_slave_request_reply(m);
 
 + CDentry *straydn = NULL;
 + if (m-stray.length()  0) {
 + straydn = mdcache-add_replica_stray(m-stray, from);
 + assert(straydn);
 + m-stray.clear();
 + }
 +
 // am i a new slave?
 MDRequest *mdr = NULL;
 if (mdcache-have_request(m-get_reqid())) {
 @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 m-put();
 return;
 }
 - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), 
 m-get_source().num());
 + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from);
 }
 assert(mdr-slave_request == 0); // only one at a time, please! 
 +
 + if (straydn) {
 + mdr-pin(straydn);
 + mdr-straydn = straydn;
 + }
 +
 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + } else if (mds-is_clientreplay()  !mds-mdsmap-is_clientreplay(from) 
 + mdr-locks.empty()) {
 + dout(3)  not active yet, waiting  dendl;
 + mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 mdr-slave_request = m;
 
 dispatch_slave_request(mdr);
 @@ -1339,6 +1361,12 @@ void 
 Server::handle_slave_request_reply(MMDSSlaveRequest *m)
 {
 int from = m-get_source().num();
 
 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) {
 metareqid_t r = m-get_reqid();
 mds-mdcache-committed_master_slave(r, from);
 @@ -5138,10 +5166,8 @@ void Server::handle_slave_rmdir_prep(MDRequest *mdr)
 dout(10)   dn   *dn  dendl;
 mdr-pin(dn);
 
 - assert(mdr-slave_request-stray.length()

Re: [PATCH 12/39] mds: compose and send resolve messages in batch

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

Software Engineer #42 @ http://inktank.com | http://ceph.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Resolve messages for all MDS are the same, so we can compose and
 send them in batch.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 181 
 +
  src/mds/MDCache.h  |  11 ++--
  2 files changed, 93 insertions(+), 99 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index b668842..c455a20 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2432,10 +2432,6 @@ void MDCache::resolve_start()
  if (rootdir)
adjust_subtree_auth(rootdir, CDIR_AUTH_UNKNOWN);
}
 -
 -  for (mapint, mapmetareqid_t, MDSlaveUpdate* ::iterator p = 
 uncommitted_slave_updates.begin();
 -   p != uncommitted_slave_updates.end(); ++p)
 -need_resolve_ack.insert(p-first);
  }

  void MDCache::send_resolves()
 @@ -2444,9 +2440,10 @@ void MDCache::send_resolves()
got_resolve.clear();
other_ambiguous_imports.clear();

 -  if (!need_resolve_ack.empty()) {
 -for (setint::iterator p = need_resolve_ack.begin(); p != 
 need_resolve_ack.end(); ++p)
 -  send_slave_resolve(*p);
 +  send_slave_resolves();
 +  if (!resolve_ack_gather.empty()) {
 +dout(10)  send_resolves still waiting for resolve ack from (
 +  need_resolve_ack  )  dendl;
  return;
}
if (!need_resolve_rollback.empty()) {
 @@ -2454,95 +2451,74 @@ void MDCache::send_resolves()
   need_resolve_rollback  )  dendl;
  return;
}
 -  assert(uncommitted_slave_updates.empty());
 -  for (setint::iterator p = recovery_set.begin(); p != recovery_set.end(); 
 ++p) {
 -int who = *p;
 -if (who == mds-whoami)
 -  continue;
 -if (migrator-is_importing() ||
 -   migrator-is_exporting())
 -  send_resolve_later(who);
 -else
 -  send_resolve_now(who);
 -  }
 -}
 -
 -void MDCache::send_resolve_later(int who)
 -{
 -  dout(10)  send_resolve_later to mds.  who  dendl;
 -  wants_resolve.insert(who);
 +  send_subtree_resolves();
  }

 -void MDCache::maybe_send_pending_resolves()
 +void MDCache::send_slave_resolves()
  {
 -  if (wants_resolve.empty())
 -return;  // nothing to send.
 -
 -  // only if it's appropriate!
 -  if (migrator-is_exporting() ||
 -  migrator-is_importing()) {
 -dout(7)  maybe_send_pending_resolves waiting, imports/exports still 
 in progress  dendl;
 -migrator-show_importing();
 -migrator-show_exporting();
 -return;  // not now
 -  }
 -
 -  // ok, send them.
 -  for (setint::iterator p = wants_resolve.begin();
 -   p != wants_resolve.end();
 -   ++p)
 -send_resolve_now(*p);
 -  wants_resolve.clear();
 -}
 +  dout(10)  send_slave_resolves  dendl;

 +  mapint, MMDSResolve* resolves;

 -class C_MDC_SendResolve : public Context {
 -  MDCache *mdc;
 -  int who;
 -public:
 -  C_MDC_SendResolve(MDCache *c, int w) : mdc(c), who(w) { }
 -  void finish(int r) {
 -mdc-send_resolve_now(who);
 -  }
 -};
 -
 -void MDCache::send_slave_resolve(int who)
 -{
 -  dout(10)  send_slave_resolve to mds.  who  dendl;
 -  MMDSResolve *m = new MMDSResolve;
 -
 -  // list prepare requests lacking a commit
 -  // [active survivor]
 -  for (hash_mapmetareqid_t, MDRequest*::iterator p = 
 active_requests.begin();
 -  p != active_requests.end();
 -  ++p) {
 -if (p-second-is_slave()  p-second-slave_to_mds == who) {
 -  dout(10)   including uncommitted   *p-second  dendl;
 -  m-add_slave_request(p-first);
 +  if (mds-is_resolve()) {
 +for (mapint, mapmetareqid_t, MDSlaveUpdate* ::iterator p = 
 uncommitted_slave_updates.begin();
 +p != uncommitted_slave_updates.end();
 +++p) {
 +  resolves[p-first] = new MMDSResolve;
 +  for (mapmetareqid_t, MDSlaveUpdate*::iterator q = p-second.begin();
 +  q != p-second.end();
 +  ++q) {
 +   dout(10)   including uncommitted   q-first  dendl;
 +   resolves[p-first]-add_slave_request(q-first);
 +  }
  }
 -  }
 -  // [resolving]
 -  if (uncommitted_slave_updates.count(who) 
 -  !uncommitted_slave_updates[who].empty()) {
 -for (mapmetareqid_t, MDSlaveUpdate*::iterator p = 
 uncommitted_slave_updates[who].begin();
 -   p != uncommitted_slave_updates[who].end();
 -   ++p) {
 -  dout(10)   including uncommitted   p-first  dendl;
 -  m-add_slave_request(p-first);
 +  } else {
 +setint resolve_set;
 +mds-mdsmap-get_mds_set(resolve_set, MDSMap::STATE_RESOLVE);
 +for (hash_mapmetareqid_t, MDRequest*::iterator p = 
 active_requests.begin();
 +p != active_requests.end();
 +++p) {
 +  if (!p-second-is_slave() || !p-second-slave_did_prepare())
 +   continue;
 +  int master = p-second-slave_to_mds;
 +  if (resolve_set.count(master)) {
 +   dout(10)   including uncommitted

Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Travis Rhoden

Didn't take long to re-create with the detailed debugging (ms =  20).
I'm sending Josh a link to the gzip'd log off-list, Im not sure if
the log will contain any CephX keys or anything like that.

On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden trho...@gmail.com wrote:
 Thanks Josh.  I will respond when I have something useful!

 On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 03/20/2013 01:19 PM, Josh Durgin wrote:

 On 03/20/2013 01:14 PM, Stefan Priebe wrote:

 Hi,

 In this case, they are format 2. And they are from cloned snapshots.
 Exactly like the following:

 # rbd ls -l -p volumes
 NAME SIZE
 PARENT   FMT PROT LOCK
 volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
 images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

 I'm doing an OpenStack boot-from-volume setup.


 OK i've never used cloned snapshots so maybe this is the reason.

 strange i've never seen this. Which qemu version?


 # qemu-x86_64 -version
 qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
 Fabrice Bellard

 that's coming from Ubuntu 12.04 apt repos.


 maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
 not exist anymore it was merged into qemu with 1.3 or 1.4.


 This particular problem won't be solved by upgrading qemu. It's a ceph
 bug. Disabling caching would work around the issue.

 Travis, could you get a log from qemu of this happening with:

 debug ms = 20
 debug objectcacher = 20
 debug rbd = 20
 log file = /path/writeable/by/qemu


 If it doesn't reproduce with those settings, try changing debug ms to 1
 instead of 20.


  From those we can tell whether the issue is on the client side at least,
 and hopefully what's causing it.

 Thanks!
 Josh


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/39] mds: don't send resolve message between active MDS

2013-03-20 Thread Gregory Farnum

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 When MDS cluster is resolving, current behavior is sending subtree resolve
 message to all other MDS and waiting for all other MDS' resolve message.
 The problem is that active MDS can have diffent subtree map due to rename.
 Besides gathering active MDS's resolve messages are also racy. The only
 function for these messages is disambiguate other MDS' import. We can
 replace it by import finish notification.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc  | 12 +---
  src/mds/Migrator.cc | 25 +++--
  src/mds/Migrator.h  |  3 ++-
  3 files changed, 34 insertions(+), 6 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index c455a20..73c1d59 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2517,7 +2517,8 @@ void MDCache::send_subtree_resolves()
 ++p) {
  if (*p == mds-whoami)
continue;
 -resolves[*p] = new MMDSResolve;
 +if (mds-is_resolve() || mds-mdsmap-is_resolve(*p))
 +  resolves[*p] = new MMDSResolve;
}

// known
 @@ -2837,7 +2838,7 @@ void MDCache::handle_resolve(MMDSResolve *m)
   migrator-import_reverse(dir);
 } else {
   dout(7)  ambiguous import succeeded on   *dir  dendl;
 - migrator-import_finish(dir);
 + migrator-import_finish(dir, true);
 }
 my_ambiguous_imports.erase(p);  // no longer ambiguous.
}
 @@ -3432,7 +3433,12 @@ void MDCache::rejoin_send_rejoins()
 ++p) {
  CDir *dir = p-first;
  assert(dir-is_subtree_root());
 -assert(!dir-is_ambiguous_dir_auth());
 +if (dir-is_ambiguous_dir_auth()) {
 +  // exporter is recovering, importer is survivor.

The importer has to be the MDS this code is running on, right?

 +  assert(rejoins.count(dir-authority().first));
 +  assert(!rejoins.count(dir-authority().second));
 +  continue;
 +}

  // my subtree?
  if (dir-is_auth())
 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 5e53803..833df12 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -2088,6 +2088,23 @@ void Migrator::import_reverse(CDir *dir)
}
  }

 +void Migrator::import_notify_finish(CDir *dir, setCDir* bounds)
 +{
 +  dout(7)  import_notify_finish   *dir  dendl;
 +
 +  for (setint::iterator p = import_bystanders[dir].begin();
 +   p != import_bystanders[dir].end();
 +   ++p) {
 +MExportDirNotify *notify =
 +  new MExportDirNotify(dir-dirfrag(), false,
 +  pairint,int(import_peer[dir-dirfrag()], 
 mds-get_nodeid()),
 +  pairint,int(mds-get_nodeid(), 
 CDIR_AUTH_UNKNOWN));

I don't think this is quite right — we're notifying them that we've
just finished importing data from somebody, right? And so we know that
we're the auth node...

 +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++)
 +  notify-get_bounds().push_back((*i)-dirfrag());
 +mds-send_message_mds(notify, *p);
 +  }
 +}
 +
  void Migrator::import_notify_abort(CDir *dir, setCDir* bounds)
  {
dout(7)  import_notify_abort   *dir  dendl;
 @@ -2183,11 +2200,11 @@ void Migrator::handle_export_finish(MExportDirFinish 
 *m)
CDir *dir = cache-get_dirfrag(m-get_dirfrag());
assert(dir);
dout(7)  handle_export_finish on   *dir  dendl;
 -  import_finish(dir);
 +  import_finish(dir, false);
m-put();
  }

 -void Migrator::import_finish(CDir *dir)
 +void Migrator::import_finish(CDir *dir, bool notify)
  {
dout(7)  import_finish on   *dir  dendl;

 @@ -2205,6 +,10 @@ void Migrator::import_finish(CDir *dir)
// remove pins
setCDir* bounds;
cache-get_subtree_bounds(dir, bounds);
 +
 +  if (notify)
 +import_notify_finish(dir, bounds);
 +
import_remove_pins(dir, bounds);

mapCInode*, mapclient_t,Capability::Export  cap_imports;
 diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h
 index 7988f32..2889a74 100644
 --- a/src/mds/Migrator.h
 +++ b/src/mds/Migrator.h
 @@ -273,12 +273,13 @@ protected:
void import_reverse_unfreeze(CDir *dir);
void import_reverse_final(CDir *dir);
void import_notify_abort(CDir *dir, setCDir* bounds);
 +  void import_notify_finish(CDir *dir, setCDir* bounds);
void import_logged_start(dirfrag_t df, CDir *dir, int from,
mapclient_t,entity_inst_t imported_client_map,
mapclient_t,uint64_t sseqmap);
void handle_export_finish(MExportDirFinish *m);
  public:
 -  void import_finish(CDir *dir);
 +  void import_finish(CDir *dir, bool notify);
  protected:

void handle_export_caps(MExportCaps *m);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/39] mds: set resolve/rejoin gather MDS set in advance

2013-03-20 Thread Gregory Farnum

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 For active MDS, it may receive resolve/resolve message before receiving

resolve/rejoin, maybe?
Other than that,
Reviewed-by: Greg Farnum g...@inktank.com

 the mdsmap message that claims the MDS cluster is in resolving/rejoning
 state. So instead of set the gather MDS set when receiving the mdsmap.
 set them in advance when detecting MDS' failure.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 41 +++--
  src/mds/MDCache.h  |  5 ++---
  2 files changed, 21 insertions(+), 25 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 73c1d59..69db1dd 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2432,18 +2432,17 @@ void MDCache::resolve_start()
  if (rootdir)
adjust_subtree_auth(rootdir, CDIR_AUTH_UNKNOWN);
}
 +  resolve_gather = recovery_set;
 +  resolve_gather.erase(mds-get_nodeid());
 +  rejoin_gather = resolve_gather;
  }

  void MDCache::send_resolves()
  {
 -  // reset resolve state
 -  got_resolve.clear();
 -  other_ambiguous_imports.clear();
 -
send_slave_resolves();
if (!resolve_ack_gather.empty()) {
  dout(10)  send_resolves still waiting for resolve ack from (
 -  need_resolve_ack  )  dendl;
 + resolve_ack_gather  )  dendl;
  return;
}
if (!need_resolve_rollback.empty()) {
 @@ -2495,7 +2494,7 @@ void MDCache::send_slave_resolves()
 ++p) {
  dout(10)  sending slave resolve to mds.  p-first  dendl;
  mds-send_message_mds(p-second, p-first);
 -need_resolve_ack.insert(p-first);
 +resolve_ack_gather.insert(p-first);
}
  }

 @@ -2598,16 +2597,15 @@ void MDCache::handle_mds_failure(int who)
recovery_set.erase(mds-get_nodeid());
dout(1)  handle_mds_failure mds.  who   : recovery peers are   
 recovery_set  dendl;

 -  // adjust my recovery lists
 -  wants_resolve.erase(who);   // MDS will ask again
 -  got_resolve.erase(who); // i'll get another.
 +  resolve_gather.insert(who);
discard_delayed_resolve(who);

 +  rejoin_gather.insert(who);
rejoin_sent.erase(who);// i need to send another
rejoin_ack_gather.erase(who);  // i'll need/get another.

 -  dout(10)   wants_resolve   wants_resolve  dendl;
 -  dout(10)   got_resolve   got_resolve  dendl;
 +  dout(10)   resolve_gather   resolve_gather  dendl;
 +  dout(10)   resolve_ack_gather   resolve_ack_gather  dendl;
dout(10)   rejoin_sent   rejoin_sent  dendl;
dout(10)   rejoin_gather   rejoin_gather  dendl;
dout(10)   rejoin_ack_gather   rejoin_ack_gather  dendl;
 @@ -2788,7 +2786,7 @@ void MDCache::handle_resolve(MMDSResolve *m)
  return;
}

 -  if (!need_resolve_ack.empty() || !need_resolve_rollback.empty()) {
 +  if (!resolve_ack_gather.empty() || !need_resolve_rollback.empty()) {
  dout(10)  delay processing subtree resolve  dendl;
  discard_delayed_resolve(from);
  delayed_resolve[from] = m;
 @@ -2875,7 +2873,7 @@ void MDCache::handle_resolve(MMDSResolve *m)
}

// did i get them all?
 -  got_resolve.insert(from);
 +  resolve_gather.erase(from);

maybe_resolve_finish();

 @@ -2901,12 +2899,12 @@ void MDCache::discard_delayed_resolve(int who)

  void MDCache::maybe_resolve_finish()
  {
 -  assert(need_resolve_ack.empty());
 +  assert(resolve_ack_gather.empty());
assert(need_resolve_rollback.empty());

 -  if (got_resolve != recovery_set) {
 -dout(10)  maybe_resolve_finish still waiting for more resolves, got (
 - got_resolve  ), need (  recovery_set  )  dendl;
 +  if (!resolve_gather.empty()) {
 +dout(10)  maybe_resolve_finish still waiting for resolves (
 + resolve_gather  )  dendl;
  return;
} else {
  dout(10)  maybe_resolve_finish got all resolves+resolve_acks, done. 
  dendl;
 @@ -2926,7 +2924,7 @@ void MDCache::handle_resolve_ack(MMDSResolveAck *ack)
dout(10)  handle_resolve_ack   *ack   from   ack-get_source() 
  dendl;
int from = ack-get_source().num();

 -  if (!need_resolve_ack.count(from)) {
 +  if (!resolve_ack_gather.count(from)) {
  ack-put();
  return;
}
 @@ -3001,8 +2999,8 @@ void MDCache::handle_resolve_ack(MMDSResolveAck *ack)
assert(p-second-slave_to_mds != from);
}

 -  need_resolve_ack.erase(from);
 -  if (need_resolve_ack.empty()  need_resolve_rollback.empty()) {
 +  resolve_ack_gather.erase(from);
 +  if (resolve_ack_gather.empty()  need_resolve_rollback.empty()) {
  send_subtree_resolves();
  process_delayed_resolve();
}
 @@ -3069,7 +3067,7 @@ void MDCache::finish_rollback(metareqid_t reqid) {
if (mds-is_resolve())
  finish_uncommitted_slave_update(reqid, need_resolve_rollback[reqid]);
need_resolve_rollback.erase(reqid);
 -  if (need_resolve_ack.empty()  need_resolve_rollback.empty()) {
 +  if (resolve_ack_gather.empty()  need_resolve_rollback.empty()) {

Re: [PATCH 15/39] mds: don't send MDentry{Link,Unlink} before receiving cache rejoin

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it
 receives the cache rejoin message. The function will remove the objects
 replicated by MDentry{Link,Unlink} from replica map.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 69db1dd..f102205 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -3893,6 +3893,8 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
  }
}

 +  assert(rejoin_gather.count(from));
 +  rejoin_gather.erase(from);
if (survivor) {
  // survivor.  do everything now.
  for (mapinodeno_t,MMDSCacheRejoin::lock_bls::iterator p = 
 weak-inode_scatterlocks.begin();
 @@ -3911,8 +3913,6 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
mds-locker-eval_gather(*p);
} else {
  // done?
 -assert(rejoin_gather.count(from));
 -rejoin_gather.erase(from);
  if (rejoin_gather.empty()) {
rejoin_gather_finish();
  } else {
 @@ -9582,7 +9582,9 @@ void MDCache::send_dentry_link(CDentry *dn)
for (mapint,int::iterator p = dn-replicas_begin();
 p != dn-replicas_end();
 ++p) {
 -if (mds-mdsmap-get_state(p-first)  MDSMap::STATE_REJOIN)
 +if (mds-mdsmap-get_state(p-first)  MDSMap::STATE_REJOIN ||
 +   (mds-mdsmap-get_state(p-first) == MDSMap::STATE_REJOIN 
 +rejoin_gather.count(p-first)))
continue;
  CDentry::linkage_t *dnl = dn-get_linkage();
  MDentryLink *m = new MDentryLink(subtree-dirfrag(), 
 dn-get_dir()-dirfrag(),
 @@ -9668,6 +9670,11 @@ void MDCache::send_dentry_unlink(CDentry *dn, CDentry 
 *straydn, MDRequest *mdr)
  if (mdr  mdr-more()-witnessed.count(it-first))
continue;

 +if (mds-mdsmap-get_state(it-first)  MDSMap::STATE_REJOIN ||
 +   (mds-mdsmap-get_state(it-first) == MDSMap::STATE_REJOIN 
 +rejoin_gather.count(it-first)))
 +  continue;
 +
  MDentryUnlink *unlink = new MDentryUnlink(dn-get_dir()-dirfrag(), 
 dn-name);
  if (straydn)
replicate_stray(straydn, it-first, unlink-straybl);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/39] mds: unify slave request waiting

2013-03-20 Thread Sage Weil

Much simpler!

Reviewed-by: Sage Weil s...@inktank.com

On Sun, 17 Mar 2013, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com
 
 When requesting remote xlock or remote wrlock, the master request is
 put into lock object's REMOTEXLOCK waiting queue. The problem is that
 remote wrlock's target can be different from lock's auth MDS. When
 the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake
 incorrect request. So just unify slave request waiting, dispatch the
 master request when receiving slave request reply.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Locker.cc | 49 ++---
  src/mds/Server.cc | 12 ++--
  2 files changed, 32 insertions(+), 29 deletions(-)
 
 diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
 index d06a9cc..0055a19 100644
 --- a/src/mds/Locker.cc
 +++ b/src/mds/Locker.cc
 @@ -544,8 +544,6 @@ void Locker::cancel_locking(Mutation *mut, setCInode* 
 *pneed_issue)
if (need_issue)
   pneed_issue-insert(static_castCInode *(lock-get_parent()));
  }
 -  } else {
 -lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK);
}
mut-finish_locking(lock);
  }
 @@ -1326,18 +1324,16 @@ void Locker::remote_wrlock_start(SimpleLock *lock, 
 int target, MDRequest *mut)
}
  
// send lock request
 -  if (!lock-is_waiter_for(SimpleLock::WAIT_REMOTEXLOCK)) {
 -mut-start_locking(lock, target);
 -mut-more()-slaves.insert(target);
 -MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt,
 -MMDSSlaveRequest::OP_WRLOCK);
 -r-set_lock_type(lock-get_type());
 -lock-get_parent()-set_object_info(r-get_object_info());
 -mds-send_message_mds(r, target);
 -  }
 -  
 -  // wait
 -  lock-add_waiter(SimpleLock::WAIT_REMOTEXLOCK, new 
 C_MDS_RetryRequest(mdcache, mut));
 +  mut-start_locking(lock, target);
 +  mut-more()-slaves.insert(target);
 +  MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt,
 +  MMDSSlaveRequest::OP_WRLOCK);
 +  r-set_lock_type(lock-get_type());
 +  lock-get_parent()-set_object_info(r-get_object_info());
 +  mds-send_message_mds(r, target);
 +
 +  assert(mut-more()-waiting_on_slave.count(target) == 0);
 +  mut-more()-waiting_on_slave.insert(target);
  }
  
  void Locker::remote_wrlock_finish(SimpleLock *lock, int target, Mutation 
 *mut)
 @@ -1411,19 +1407,18 @@ bool Locker::xlock_start(SimpleLock *lock, MDRequest 
 *mut)
  }
  
  // send lock request
 -if (!lock-is_waiter_for(SimpleLock::WAIT_REMOTEXLOCK)) {
 -  int auth = lock-get_parent()-authority().first;
 -  mut-more()-slaves.insert(auth);
 -  mut-start_locking(lock, auth);
 -  MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt,
 -  MMDSSlaveRequest::OP_XLOCK);
 -  r-set_lock_type(lock-get_type());
 -  lock-get_parent()-set_object_info(r-get_object_info());
 -  mds-send_message_mds(r, auth);
 -}
 -
 -// wait
 -lock-add_waiter(SimpleLock::WAIT_REMOTEXLOCK, new 
 C_MDS_RetryRequest(mdcache, mut));
 +int auth = lock-get_parent()-authority().first;
 +mut-more()-slaves.insert(auth);
 +mut-start_locking(lock, auth);
 +MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt,
 +MMDSSlaveRequest::OP_XLOCK);
 +r-set_lock_type(lock-get_type());
 +lock-get_parent()-set_object_info(r-get_object_info());
 +mds-send_message_mds(r, auth);
 +
 +assert(mut-more()-waiting_on_slave.count(auth) == 0);
 +mut-more()-waiting_on_slave.insert(auth);
 +
  return false;
}
  }
 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 6d0519f..4c4c86b 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -1371,7 +1371,11 @@ void 
 Server::handle_slave_request_reply(MMDSSlaveRequest *m)
mdr-locks.insert(lock);
mdr-finish_locking(lock);
lock-get_xlock(mdr, mdr-get_client());
 -  lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK);
 +
 +  assert(mdr-more()-waiting_on_slave.count(from));
 +  mdr-more()-waiting_on_slave.erase(from);
 +  assert(mdr-more()-waiting_on_slave.empty());
 +  dispatch_client_request(mdr);
  }
  break;
  
 @@ -1385,7 +1389,11 @@ void 
 Server::handle_slave_request_reply(MMDSSlaveRequest *m)
mdr-remote_wrlocks[lock] = from;
mdr-locks.insert(lock);
mdr-finish_locking(lock);
 -  lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK);
 +
 +  assert(mdr-more()-waiting_on_slave.count(from));
 +  mdr-more()-waiting_on_slave.erase(from);
 +  assert(mdr-more()-waiting_on_slave.empty());
 +  dispatch_client_request(mdr);
  }
  break;
  
 -- 
 1.7.11.7
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More

Re: [PATCH 16/39] mds: send cache rejoin messages after gathering all resolves

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 10 ++
  src/mds/MDCache.h  |  5 +
  2 files changed, 15 insertions(+)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index f102205..6853bf1 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2914,6 +2914,8 @@ void MDCache::maybe_resolve_finish()
recalc_auth_bits();
trim_non_auth();
mds-resolve_done();
 +} else {
 +  maybe_send_pending_rejoins();
  }
}
  }
 @@ -3398,6 +3400,13 @@ void MDCache::rejoin_send_rejoins()
  {
dout(10)  rejoin_send_rejoins with recovery_set   recovery_set  
 dendl;

 +  if (!resolve_gather.empty()) {
 +dout(7)  rejoin_send_rejoins still waiting for resolves (
 +resolve_gather  )  dendl;
 +rejoins_pending = true;
 +return;
 +  }
 +
mapint, MMDSCacheRejoin* rejoins;

// encode cap list once.
 @@ -3571,6 +3580,7 @@ void MDCache::rejoin_send_rejoins()
  mds-send_message_mds(p-second, p-first);
}
rejoin_ack_gather.insert(mds-whoami);   // we need to complete 
 rejoin_gather_finish, too
 +  rejoins_pending = false;

// nothing?
if (mds-is_rejoin()  rejoins.empty()) {
 diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
 index 278debf..379f715 100644
 --- a/src/mds/MDCache.h
 +++ b/src/mds/MDCache.h
 @@ -383,6 +383,7 @@ public:

  protected:
// [rejoin]
 +  bool rejoins_pending;
setint rejoin_gather;  // nodes from whom i need a rejoin
setint rejoin_sent;// nodes i sent a rejoin to
setint rejoin_ack_gather;  // nodes from whom i need a rejoin ack
 @@ -417,6 +418,10 @@ protected:
void handle_cache_rejoin_full(MMDSCacheRejoin *m);
void rejoin_send_acks();
void rejoin_trim_undef_inodes();
 +  void maybe_send_pending_rejoins() {
 +if (rejoins_pending)
 +  rejoin_send_rejoins();
 +  }
  public:
void rejoin_gather_finish();
void rejoin_send_rejoins();
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/39] mds: send resolve acks after master updates are safely logged

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 33 +
  src/mds/MDCache.h  |  7 ++-
  src/mds/Server.cc  |  9 +
  src/mds/journal.cc |  2 +-
  4 files changed, 45 insertions(+), 6 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 6853bf1..9b37b1e 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2177,6 +2177,17 @@ void MDCache::committed_master_slave(metareqid_t r, 
 int from)
  log_master_commit(r);
  }

 +void MDCache::logged_master_update(metareqid_t reqid)
 +{
 +  dout(10)  logged_master_update   reqid  dendl;
 +  assert(uncommitted_masters.count(reqid));
 +  uncommitted_masters[reqid].safe = true;
 +  if (pending_masters.count(reqid)) {
 +pending_masters.erase(reqid);
 +if (pending_masters.empty())
 +  process_delayed_resolve();
 +  }
 +}

  /*
   * The mds could crash after receiving all slaves' commit acknowledgement,
 @@ -2764,8 +2775,23 @@ void MDCache::handle_resolve(MMDSResolve *m)
  return;
}

 +  discard_delayed_resolve(from);
 +
// ambiguous slave requests?
if (!m-slave_requests.empty()) {
 +for (vectormetareqid_t::iterator p = m-slave_requests.begin();
 +p != m-slave_requests.end();
 +++p) {
 +  if (uncommitted_masters.count(*p)  !uncommitted_masters[*p].safe)
 +   pending_masters.insert(*p);
 +}
 +
 +if (!pending_masters.empty()) {
 +  dout(10)   still have pending updates, delay processing slave 
 resolve  dendl;
 +  delayed_resolve[from] = m;
 +  return;
 +}
 +
  MMDSResolveAck *ack = new MMDSResolveAck;
  for (vectormetareqid_t::iterator p = m-slave_requests.begin();
  p != m-slave_requests.end();
 @@ -2788,7 +2814,6 @@ void MDCache::handle_resolve(MMDSResolve *m)

if (!resolve_ack_gather.empty() || !need_resolve_rollback.empty()) {
  dout(10)  delay processing subtree resolve  dendl;
 -discard_delayed_resolve(from);
  delayed_resolve[from] = m;
  return;
}
 @@ -2883,10 +2908,10 @@ void MDCache::handle_resolve(MMDSResolve *m)
  void MDCache::process_delayed_resolve()
  {
dout(10)  process_delayed_resolve  dendl;
 -  for (mapint, MMDSResolve *::iterator p = delayed_resolve.begin();
 -   p != delayed_resolve.end(); ++p)
 +  mapint, MMDSResolve* tmp;
 +  tmp.swap(delayed_resolve);
 +  for (mapint, MMDSResolve*::iterator p = tmp.begin(); p != tmp.end(); ++p)
  handle_resolve(p-second);
 -  delayed_resolve.clear();
  }

  void MDCache::discard_delayed_resolve(int who)
 diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
 index 379f715..8f262b9 100644
 --- a/src/mds/MDCache.h
 +++ b/src/mds/MDCache.h
 @@ -281,14 +281,16 @@ public:
 snapid_t follows=CEPH_NOSNAP);

// slaves
 -  void add_uncommitted_master(metareqid_t reqid, LogSegment *ls, setint 
 slaves) {
 +  void add_uncommitted_master(metareqid_t reqid, LogSegment *ls, setint 
 slaves, bool safe=false) {
  uncommitted_masters[reqid].ls = ls;
  uncommitted_masters[reqid].slaves = slaves;
 +uncommitted_masters[reqid].safe = safe;
}
void wait_for_uncommitted_master(metareqid_t reqid, Context *c) {
  uncommitted_masters[reqid].waiters.push_back(c);
}
void log_master_commit(metareqid_t reqid);
 +  void logged_master_update(metareqid_t reqid);
void _logged_master_commit(metareqid_t reqid, LogSegment *ls, 
 listContext* waiters);
void committed_master_slave(metareqid_t r, int from);
void finish_committed_masters();
 @@ -320,9 +322,12 @@ protected:
  setint slaves;
  LogSegment *ls;
  listContext* waiters;
 +bool safe;
};
mapmetareqid_t, umaster uncommitted_masters; // 
 master: req - slave set

 +  setmetareqid_t pending_masters;
 +
//mapmetareqid_t, bool ambiguous_slave_updates; // for log 
 trimming.
//mapmetareqid_t, Context* waiting_for_slave_update_commit;
friend class ESlaveUpdate;
 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 8e89e4c..1330f11 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -4463,6 +4463,9 @@ void Server::_link_remote_finish(MDRequest *mdr, bool 
 inc,

assert(g_conf-mds_kill_link_at != 3);

 +  if (!mdr-more()-witnessed.empty())
 +mdcache-logged_master_update(mdr-reqid);
 +
if (inc) {
  // link the new dentry
  dn-pop_projected_linkage();
 @@ -5073,6 +5076,9 @@ void Server::_unlink_local_finish(MDRequest *mdr,
  {
dout(10)  _unlink_local_finish   *dn  dendl;

 +  if (!mdr-more()-witnessed.empty())
 +mdcache-logged_master_update(mdr-reqid);
 +
// unlink main dentry
dn-get_dir()-unlink_inode(dn);
dn-pop_projected_linkage();
 @@ -5881,6 +5887,9 @@ void Server::_rename_finish(MDRequest *mdr, CDentry

Re: [PATCH 19/39] mds: remove MDCache::rejoin_fetch_dirfrags()

2013-03-20 Thread Gregory Farnum

Nice.
Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced
 MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates the function
 of MDCache::open_undef_dirfrags(), so just remove rejoin_fetch_dirfrags()
 and make open_undef_dirfrags() also handle undefined inodes.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/CDir.cc|  70 +++
  src/mds/MDCache.cc | 193 
 +
  src/mds/MDCache.h  |   5 +-
  3 files changed, 107 insertions(+), 161 deletions(-)

 diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
 index 231630e..af0ae9c 100644
 --- a/src/mds/CDir.cc
 +++ b/src/mds/CDir.cc
 @@ -1553,33 +1553,32 @@ void CDir::_fetched(bufferlist bl, const string 
 want_dn)
if (stale)
 continue;

 +  bool undef_inode = false;
if (dn) {
 -if (dn-get_linkage()-get_inode() == 0) {
 -  dout(12)  _fetched  had NEG dentry   *dn  dendl;
 -} else {
 -  dout(12)  _fetched  had dentry   *dn  dendl;
 -}
 -  } else {
 +   CInode *in = dn-get_linkage()-get_inode();
 +   if (in) {
 + dout(12)  _fetched  had dentry   *dn  dendl;
 + if (in-state_test(CInode::STATE_REJOINUNDEF)) {
 +   assert(cache-mds-is_rejoin());
 +   assert(in-vino() == vinodeno_t(inode.ino, last));
 +   in-state_clear(CInode::STATE_REJOINUNDEF);
 +   cache-opened_undef_inode(in);
 +   undef_inode = true;
 + }
 +   } else
 + dout(12)  _fetched  had NEG dentry   *dn  dendl;
 +  }
 +
 +  if (!dn || undef_inode) {
 // add inode
 CInode *in = cache-get_inode(inode.ino, last);
 -   if (in) {
 - dout(0)  _fetched  badness: got (but i already had)   *in
 -   mode   in-inode.mode
 -   mtime   in-inode.mtime  dendl;
 - string dirpath, inopath;
 - this-inode-make_path_string(dirpath);
 - in-make_path_string(inopath);
 - clog.error()  loaded dup inode   inode.ino
 - [  first  ,  last  ] v  inode.version
 - at   dirpath  /  dname
 -, but inode   in-vino()   v  in-inode.version
 - already exists at   inopath  \n;
 - continue;
 -   } else {
 - // inode
 - in = new CInode(cache, true, first, last);
 - in-inode = inode;
 +   if (!in || undef_inode) {
 + if (undef_inode)
 +   in-first = first;
 + else
 +   in = new CInode(cache, true, first, last);

 + in-inode = inode;
   // symlink?
   if (in-is_symlink())
 in-symlink = symlink;
 @@ -1591,11 +1590,13 @@ void CDir::_fetched(bufferlist bl, const string 
 want_dn)
   if (snaps)
 in-purge_stale_snap_data(*snaps);

 - // add
 - cache-add_inode( in );
 -
 - // link
 - dn = add_primary_dentry(dname, in, first, last);
 + if (undef_inode) {
 +   if (inode.anchored)
 + dn-adjust_nested_anchors(1);
 + } else {
 +   cache-add_inode( in ); // add
 +   dn = add_primary_dentry(dname, in, first, last); // link
 + }
   dout(12)  _fetched  got   *dn *in  dendl;

   if (in-inode.is_dirty_rstat())
 @@ -1604,6 +1605,19 @@ void CDir::_fetched(bufferlist bl, const string 
 want_dn)
   //in-hack_accessed = false;
   //in-hack_load_stamp = ceph_clock_now(g_ceph_context);
   //num_new_inodes_loaded++;
 +   } else {
 + dout(0)  _fetched  badness: got (but i already had)   *in
 +   mode   in-inode.mode
 +   mtime   in-inode.mtime  dendl;
 + string dirpath, inopath;
 + this-inode-make_path_string(dirpath);
 + in-make_path_string(inopath);
 + clog.error()  loaded dup inode   inode.ino
 + [  first  ,  last  ] v  inode.version
 + at   dirpath  /  dname
 +, but inode   in-vino()   v  in-inode.version
 + already exists at   inopath  \n;
 + continue;
 }
}
  } else {
 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index d934020..008a8a2 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -4178,7 +4178,6 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack,

  CInode *MDCache::rejoin_invent_inode(inodeno_t ino, snapid_t last)
  {
 -  assert(0);
CInode *in = new CInode(this, true, 1, last);
in-inode.ino = ino;
in-state_set(CInode::STATE_REJOINUNDEF);
 @@ -4190,16 +4189,13 @@ CInode *MDCache::rejoin_invent_inode(inodeno_t ino, 
 snapid_t last)

  CDir *MDCache::rejoin_invent_dirfrag(dirfrag_t df)
  {
 -  assert(0);
CInode *in = get_inode(df.ino);

Re: deb/rpm package purge

2013-03-20 Thread Laszlo Boszormenyi (GCS)

On Wed, 2013-03-20 at 05:48 -0700, Sage Weil wrote:
 On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote:
  On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote:
   As a point of comparison, mysql removes the config files but not 
   /var/lib/mysql.
   The question is, is that okay/typical/desireable/recommended/a bad idea?
 I should have asked this sooner. Do you know _any_ program that removes
your favorite music collection, your family photos or your business
emails when you do uninstall it?
I suspect that your question was theoretical instead.

On Wed, 2013-03-20 at 09:48 -0500, Mark Nelson wrote:
On 03/20/2013 07:48 AM, Sage Weil wrote:
 It's not as important given that it won't outright destroy the cluster, 
 but perhaps we should also leave /etc/ceph untouched on purge if a 
 ceph.conf file has been placed in it (since that also was not installed 
 by the package, but rather by a user?).  I figure we should probably try 
 to get it right now.  The message about the directory not being empty 
 sounds good.
 Sure, personal user data must be kept. If it's a big amount of data and
left under a non-standard location (ie, not under his/her $HOME) then
s/he should be informed where those files are located on purge.

 My thought here is:
 - remove anything created by the packages in /var/lib/ceph that has been 
 untouched since package installation.
 - remove /var/lib/ceph if it has been untouched
 Please note that you have to store some kind of checksum for the files
then. Probably md5sum is enough.

 - remove /etc/ceph if it has been untouched
 This is an other case. dpkg itself handle package files here, called
conffiles. I should check the method (md5sum and/or sha1 variants) used
for the checksum on these files. On upgrade it's used not to overwrite
local changes by the user. It may worth to read a bit more about it[1]
from Raphaël Hertzog. He is the co-author of Debian Administrator's
handbook[2] BTW.
 On purge dpkg will remove the package conffiles no matter what. It
won't check if those were changed or not. You may not mark the files
under /etc as conffiles, but then you'll lose the mentioned
merge logic on upgrades; dpkg will just overwrite those.
 In short, files under /var/lib/ceph are the only candidates for
in-package checksumming. How many files under there that essential for
the packages?

Laszlo/GCS
[1] 
http://raphaelhertzog.com/2010/09/21/debian-conffile-configuration-file-managed-by-dpkg/
[2] http://debian-handbook.info/

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong

2013-03-20 Thread Gregory Farnum

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 So the recovering MDS can properly handle cache expire messages.
 Also increase the nonce value when sending the cache rejoin acks.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 35 +++
  src/messages/MMDSCacheRejoin.h | 11 +++
  2 files changed, 30 insertions(+), 16 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 008a8a2..8ba676e 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -3538,6 +3538,7 @@ void MDCache::rejoin_send_rejoins()
if (p-first == 0  root) {
 p-second-add_weak_inode(root-vino());
 p-second-add_strong_inode(root-vino(),
 +   root-get_replica_nonce(),
 root-get_caps_wanted(),
 root-filelock.get_state(),
 root-nestlock.get_state(),
 @@ -3551,6 +3552,7 @@ void MDCache::rejoin_send_rejoins()
if (CInode *in = get_inode(MDS_INO_MDSDIR(p-first))) {
 p-second-add_weak_inode(in-vino());
 p-second-add_strong_inode(in-vino(),
 +   in-get_replica_nonce(),
 in-get_caps_wanted(),
 in-filelock.get_state(),
 in-nestlock.get_state(),
 @@ -3709,6 +3711,7 @@ void MDCache::rejoin_walk(CDir *dir, MMDSCacheRejoin 
 *rejoin)
 CInode *in = dnl-get_inode();
 dout(15)   add_strong_inode   *in  dendl;
 rejoin-add_strong_inode(in-vino(),
 +in-get_replica_nonce(),
  in-get_caps_wanted(),
  in-filelock.get_state(),
  in-nestlock.get_state(),
 @@ -4248,7 +4251,7 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
 dir = rejoin_invent_dirfrag(p-first);
  }
  if (dir) {
 -  dir-add_replica(from);
 +  dir-add_replica(from, p-second.nonce);
dir-dir_rep = p-second.dir_rep;
  } else {
dout(10)   frag   p-first   doesn't match dirfragtree   
 *diri  dendl;
 @@ -4263,7 +4266,7 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
   dir = rejoin_invent_dirfrag(p-first);
 else
   dout(10)   have(approx)   *dir  dendl;
 -   dir-add_replica(from);
 +   dir-add_replica(from, p-second.nonce);
 dir-dir_rep = p-second.dir_rep;
}
refragged = true;
 @@ -4327,7 +4330,7 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
 mdr-locks.insert(dn-lock);
}

 -  dn-add_replica(from);
 +  dn-add_replica(from, q-second.nonce);
dout(10)   have   *dn  dendl;

// inode?
 @@ -4412,7 +4415,7 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
   dout(10)   sender has dentry but not inode, adding them as a 
 replica  dendl;
 }

 -   in-add_replica(from);
 +   in-add_replica(from, p-second.nonce);
 dout(10)   have   *in  dendl;
}
  }
 @@ -5176,7 +5179,7 @@ void MDCache::rejoin_send_acks()
for (mapint,int::iterator r = dir-replicas_begin();
r != dir-replicas_end();
++r)
 -   ack[r-first]-add_strong_dirfrag(dir-dirfrag(), r-second, 
 dir-dir_rep);
 +   ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, 
 dir-dir_rep);

for (CDir::map_t::iterator q = dir-items.begin();
q != dir-items.end();
 @@ -5192,7 +5195,7 @@ void MDCache::rejoin_send_acks()
dnl-is_primary() ? 
 dnl-get_inode()-ino():inodeno_t(0),
dnl-is_remote() ? 
 dnl-get_remote_ino():inodeno_t(0),
dnl-is_remote() ? 
 dnl-get_remote_d_type():0,
 -  r-second,
 +  ++r-second,
dn-lock.get_replica_state());

 if (!dnl-is_primary())
 @@ -5205,7 +5208,7 @@ void MDCache::rejoin_send_acks()
  r != in-replicas_end();
  ++r) {
   ack[r-first]-add_inode_base(in);
 - ack[r-first]-add_inode_locks(in, r-second);
 + ack[r-first]-add_inode_locks(in, ++r-second);
 }

 // subdirs in this subtree?
 @@ -5220,14 +5223,14 @@ void MDCache::rejoin_send_acks()
  r != root-replicas_end();
  ++r) {
ack[r-first]-add_inode_base(root);
 -  ack[r-first]-add_inode_locks(root, r-second);
 +  ack[r-first]-add_inode_locks(root, ++r-second);
  }
if (myin)
  for (mapint,int::iterator r = myin-replicas_begin();
  r != myin-replicas_end();

Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack

2013-03-20 Thread Gregory Farnum

This needs to handle versioning the encoding based on peer feature bits too.

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Cache rejoin ack message already encodes inode base, make it also encode
 dirfrag base. This allowes the message to replicate stray dentries like
 MDentryUnlink message. The function will be used by later patch.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/CDir.h | 20 +---
  src/mds/MDCache.cc | 20 ++--
  src/messages/MMDSCacheRejoin.h | 12 +++-
  3 files changed, 42 insertions(+), 10 deletions(-)

 diff --git a/src/mds/CDir.h b/src/mds/CDir.h
 index 79946f1..f4a3a3d 100644
 --- a/src/mds/CDir.h
 +++ b/src/mds/CDir.h
 @@ -437,23 +437,29 @@ private:
  ::encode(dist, bl);
}

 -  void encode_replica(int who, bufferlist bl) {
 -__u32 nonce = add_replica(who);
 -::encode(nonce, bl);
 +  void _encode_base(bufferlist bl) {
  ::encode(first, bl);
  ::encode(fnode, bl);
  ::encode(dir_rep, bl);
  ::encode(dir_rep_by, bl);
}
 -  void decode_replica(bufferlist::iterator p) {
 -__u32 nonce;
 -::decode(nonce, p);
 -replica_nonce = nonce;
 +  void _decode_base(bufferlist::iterator p) {
  ::decode(first, p);
  ::decode(fnode, p);
  ::decode(dir_rep, p);
  ::decode(dir_rep_by, p);
}
 +  void encode_replica(int who, bufferlist bl) {
 +__u32 nonce = add_replica(who);
 +::encode(nonce, bl);
 +_encode_base(bl);
 +  }
 +  void decode_replica(bufferlist::iterator p) {
 +__u32 nonce;
 +::decode(nonce, p);
 +replica_nonce = nonce;
 +_decode_base(p);
 +  }



 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 8ba676e..344777e 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -4510,8 +4510,22 @@ void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin 
 *ack)
  }
}

 +  // full dirfrags
 +  bufferlist::iterator p = ack-dirfrag_base.begin();
 +  while (!p.end()) {
 +dirfrag_t df;
 +bufferlist basebl;
 +::decode(df, p);
 +::decode(basebl, p);
 +CDir *dir = get_dirfrag(df);
 +assert(dir);
 +bufferlist::iterator q = basebl.begin();
 +dir-_decode_base(q);
 +dout(10)   got dir replica   *dir  dendl;
 +  }
 +
// full inodes
 -  bufferlist::iterator p = ack-inode_base.begin();
 +  p = ack-inode_base.begin();
while (!p.end()) {
  inodeno_t ino;
  snapid_t last;
 @@ -5178,8 +5192,10 @@ void MDCache::rejoin_send_acks()
// dir
for (mapint,int::iterator r = dir-replicas_begin();
r != dir-replicas_end();
 -  ++r)
 +  ++r) {
 ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, 
 dir-dir_rep);
 +   ack[r-first]-add_dirfrag_base(dir);
 +  }

for (CDir::map_t::iterator q = dir-items.begin();
q != dir-items.end();
 diff --git a/src/messages/MMDSCacheRejoin.h b/src/messages/MMDSCacheRejoin.h
 index b88f551..7c37ab4 100644
 --- a/src/messages/MMDSCacheRejoin.h
 +++ b/src/messages/MMDSCacheRejoin.h
 @@ -20,6 +20,7 @@
  #include include/types.h

  #include mds/CInode.h
 +#include mds/CDir.h

  // sent from replica to auth

 @@ -169,6 +170,7 @@ class MMDSCacheRejoin : public Message {
// full
bufferlist inode_base;
bufferlist inode_locks;
 +  bufferlist dirfrag_base;

// authpins, xlocks
struct slave_reqid {
 @@ -258,7 +260,13 @@ public:
void add_strong_dirfrag(dirfrag_t df, int n, int dr) {
  strong_dirfrags[df] = dirfrag_strong(n, dr);
}
 -
 +  void add_dirfrag_base(CDir *dir) {
 +::encode(dir-dirfrag(), dirfrag_base);
 +bufferlist bl;
 +dir-_encode_base(bl);
 +::encode(bl, dirfrag_base);
 +  }

We are guilty of doing this in other places, but we should avoid
implicit encodings like this one, especially when the decode happens
somewhere else like it does here. We can make a vector dirfrag_bases
and add to that, and then encode and decode it along with the rest of
the message — would that work for your purposes?
-Greg

 +
// dentries
void add_weak_dirfrag(dirfrag_t df) {
  weak_dirfrags.insert(df);
 @@ -294,6 +302,7 @@ public:
  ::encode(wrlocked_inodes, payload);
  ::encode(cap_export_bl, payload);
  ::encode(strong_dirfrags, payload);
 +::encode(dirfrag_base, payload);
  ::encode(weak, payload);
  ::encode(weak_dirfrags, payload);
  ::encode(weak_inodes, payload);
 @@ -319,6 +328,7 @@ public:
::decode(cap_export_paths, q);
  }
  ::decode(strong_dirfrags, p);
 +::decode(dirfrag_base, p);
  ::decode(weak, p);
  ::decode(weak_dirfrags, p);
  ::decode(weak_inodes, p);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong

2013-03-20 Thread Sage Weil

On Wed, 20 Mar 2013, Gregory Farnum wrote:
 On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
  From: Yan, Zheng zheng.z@intel.com
 
  So the recovering MDS can properly handle cache expire messages.
  Also increase the nonce value when sending the cache rejoin acks.
 
  Signed-off-by: Yan, Zheng zheng.z@intel.com
  ---
   src/mds/MDCache.cc | 35 +++
   src/messages/MMDSCacheRejoin.h | 11 +++
   2 files changed, 30 insertions(+), 16 deletions(-)
 
  diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
  index 008a8a2..8ba676e 100644
  --- a/src/mds/MDCache.cc
  +++ b/src/mds/MDCache.cc
  @@ -3538,6 +3538,7 @@ void MDCache::rejoin_send_rejoins()
 if (p-first == 0  root) {
  p-second-add_weak_inode(root-vino());
  p-second-add_strong_inode(root-vino(),
  +   root-get_replica_nonce(),
  root-get_caps_wanted(),
  root-filelock.get_state(),
  root-nestlock.get_state(),
  @@ -3551,6 +3552,7 @@ void MDCache::rejoin_send_rejoins()
 if (CInode *in = get_inode(MDS_INO_MDSDIR(p-first))) {
  p-second-add_weak_inode(in-vino());
  p-second-add_strong_inode(in-vino(),
  +   in-get_replica_nonce(),
  in-get_caps_wanted(),
  in-filelock.get_state(),
  in-nestlock.get_state(),
  @@ -3709,6 +3711,7 @@ void MDCache::rejoin_walk(CDir *dir, MMDSCacheRejoin 
  *rejoin)
  CInode *in = dnl-get_inode();
  dout(15)   add_strong_inode   *in  dendl;
  rejoin-add_strong_inode(in-vino(),
  +in-get_replica_nonce(),
   in-get_caps_wanted(),
   in-filelock.get_state(),
   in-nestlock.get_state(),
  @@ -4248,7 +4251,7 @@ void 
  MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
  dir = rejoin_invent_dirfrag(p-first);
   }
   if (dir) {
  -  dir-add_replica(from);
  +  dir-add_replica(from, p-second.nonce);
 dir-dir_rep = p-second.dir_rep;
   } else {
 dout(10)   frag   p-first   doesn't match dirfragtree   
  *diri  dendl;
  @@ -4263,7 +4266,7 @@ void 
  MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
dir = rejoin_invent_dirfrag(p-first);
  else
dout(10)   have(approx)   *dir  dendl;
  -   dir-add_replica(from);
  +   dir-add_replica(from, p-second.nonce);
  dir-dir_rep = p-second.dir_rep;
 }
 refragged = true;
  @@ -4327,7 +4330,7 @@ void 
  MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
  mdr-locks.insert(dn-lock);
 }
 
  -  dn-add_replica(from);
  +  dn-add_replica(from, q-second.nonce);
 dout(10)   have   *dn  dendl;
 
 // inode?
  @@ -4412,7 +4415,7 @@ void 
  MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
dout(10)   sender has dentry but not inode, adding them as a 
  replica  dendl;
  }
 
  -   in-add_replica(from);
  +   in-add_replica(from, p-second.nonce);
  dout(10)   have   *in  dendl;
 }
   }
  @@ -5176,7 +5179,7 @@ void MDCache::rejoin_send_acks()
 for (mapint,int::iterator r = dir-replicas_begin();
 r != dir-replicas_end();
 ++r)
  -   ack[r-first]-add_strong_dirfrag(dir-dirfrag(), r-second, 
  dir-dir_rep);
  +   ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, 
  dir-dir_rep);
 
 for (CDir::map_t::iterator q = dir-items.begin();
 q != dir-items.end();
  @@ -5192,7 +5195,7 @@ void MDCache::rejoin_send_acks()
 dnl-is_primary() ? 
  dnl-get_inode()-ino():inodeno_t(0),
 dnl-is_remote() ? 
  dnl-get_remote_ino():inodeno_t(0),
 dnl-is_remote() ? 
  dnl-get_remote_d_type():0,
  -  r-second,
  +  ++r-second,
 dn-lock.get_replica_state());
 
  if (!dnl-is_primary())
  @@ -5205,7 +5208,7 @@ void MDCache::rejoin_send_acks()
   r != in-replicas_end();
   ++r) {
ack[r-first]-add_inode_base(in);
  - ack[r-first]-add_inode_locks(in, r-second);
  + ack[r-first]-add_inode_locks(in, ++r-second);
  }
 
  // subdirs in this subtree?
  @@ -5220,14 +5223,14 @@ void MDCache::rejoin_send_acks()
   r != root-replicas_end();
   ++r) {
 ack[r-first]-add_inode_base(root);
  -  ack[r-first]-add_inode_locks(root, r-second);
  +

Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack

2013-03-20 Thread Gregory Farnum

On Wed, Mar 20, 2013 at 4:33 PM, Gregory Farnum g...@inktank.com wrote:
 This needs to handle versioning the encoding based on peer feature bits too.

 On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 +  void add_dirfrag_base(CDir *dir) {
 +::encode(dir-dirfrag(), dirfrag_base);
 +bufferlist bl;
 +dir-_encode_base(bl);
 +::encode(bl, dirfrag_base);
 +  }

 We are guilty of doing this in other places, but we should avoid
 implicit encodings like this one, especially when the decode happens
 somewhere else like it does here. We can make a vector dirfrag_bases
 and add to that, and then encode and decode it along with the rest of
 the message — would that work for your purposes?
 -Greg

Sorry, a vector (called dirfrag_bases) of pairdirfrag_t, bl where bl
is the encoded base. Or something like that. :)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 23/39] mds: reqid for rejoinning authpin/wrlock need to be list

2013-03-20 Thread Gregory Farnum

I think Sage is right, we can just bump the MDS protocol instead of
spending a feature bit on OTW changes — but this is another message we
should update to the new encoding macros while we're making that bump.
The rest looks good!
-Greg

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 78 
 --
  src/messages/MMDSCacheRejoin.h | 12 +++
  2 files changed, 50 insertions(+), 40 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 38b1fdf..f4622de 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -4305,16 +4305,19 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
// dn auth_pin?
if (strong-authpinned_dentries.count(p-first) 
   strong-authpinned_dentries[p-first].count(q-first)) {
 -   MMDSCacheRejoin::slave_reqid r = 
 strong-authpinned_dentries[p-first][q-first];
 -   dout(10)   dn authpin by   r   on   *dn  dendl;
 -
 -   // get/create slave mdrequest
 -   MDRequest *mdr;
 -   if (have_request(r.reqid))
 - mdr = request_get(r.reqid);
 -   else
 - mdr = request_start_slave(r.reqid, r.attempt, from);
 -   mdr-auth_pin(dn);
 +   for (listMMDSCacheRejoin::slave_reqid::iterator r = 
 strong-authpinned_dentries[p-first][q-first].begin();
 +r != strong-authpinned_dentries[p-first][q-first].end();
 +++r) {
 + dout(10)   dn authpin by   *r   on   *dn  dendl;
 +
 + // get/create slave mdrequest
 + MDRequest *mdr;
 + if (have_request(r-reqid))
 +   mdr = request_get(r-reqid);
 + else
 +   mdr = request_start_slave(r-reqid, r-attempt, from);
 + mdr-auth_pin(dn);
 +   }
}

// dn xlock?
 @@ -4389,22 +4392,25 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)

  // auth pin?
  if (strong-authpinned_inodes.count(in-vino())) {
 -  MMDSCacheRejoin::slave_reqid r = strong-authpinned_inodes[in-vino()];
 -  dout(10)   inode authpin by   r   on   *in  dendl;
 +  for (listMMDSCacheRejoin::slave_reqid::iterator r = 
 strong-authpinned_inodes[in-vino()].begin();
 +  r != strong-authpinned_inodes[in-vino()].end();
 +  ++r) {
 +   dout(10)   inode authpin by   *r   on   *in  dendl;

 -  // get/create slave mdrequest
 -  MDRequest *mdr;
 -  if (have_request(r.reqid))
 -   mdr = request_get(r.reqid);
 -  else
 -   mdr = request_start_slave(r.reqid, r.attempt, from);
 -  if (strong-frozen_authpin_inodes.count(in-vino())) {
 -   assert(!in-get_num_auth_pins());
 -   mdr-freeze_auth_pin(in);
 -  } else {
 -   assert(!in-is_frozen_auth_pin());
 +   // get/create slave mdrequest
 +   MDRequest *mdr;
 +   if (have_request(r-reqid))
 + mdr = request_get(r-reqid);
 +   else
 + mdr = request_start_slave(r-reqid, r-attempt, from);
 +   if (strong-frozen_authpin_inodes.count(in-vino())) {
 + assert(!in-get_num_auth_pins());
 + mdr-freeze_auth_pin(in);
 +   } else {
 + assert(!in-is_frozen_auth_pin());
 +   }
 +   mdr-auth_pin(in);
}
 -  mdr-auth_pin(in);
  }
  // xlock(s)?
  if (strong-xlocked_inodes.count(in-vino())) {
 @@ -4427,19 +4433,23 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
  }
  // wrlock(s)?
  if (strong-wrlocked_inodes.count(in-vino())) {
 -  for (mapint,MMDSCacheRejoin::slave_reqid::iterator q = 
 strong-wrlocked_inodes[in-vino()].begin();
 +  for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = 
 strong-wrlocked_inodes[in-vino()].begin();
q != strong-wrlocked_inodes[in-vino()].end();
++q) {
 SimpleLock *lock = in-get_lock(q-first);
 -   dout(10)   inode wrlock by   q-second   on   *lock   
 on   *in  dendl;
 -   MDRequest *mdr = request_get(q-second.reqid);  // should have this 
 from auth_pin above.
 -   assert(mdr-is_auth_pinned(in));
 -   lock-set_state(LOCK_LOCK);
 -   if (lock == in-filelock)
 - in-loner_cap = -1;
 -   lock-get_wrlock(true);
 -   mdr-wrlocks.insert(lock);
 -   mdr-locks.insert(lock);
 +   for (listMMDSCacheRejoin::slave_reqid::iterator r = 
 q-second.begin();
 +r != q-second.end();
 +++r) {
 + dout(10)   inode wrlock by   *r   on   *lock   on  
  *in  dendl;
 + MDRequest *mdr = request_get(r-reqid);  // should have this from 
 auth_pin above.
 + assert(mdr-is_auth_pinned(in));
 + lock-set_state(LOCK_MIX);
 + if (lock == in-filelock)
 +   in-loner_cap = -1;
 + lock-get_wrlock(true);
 + mdr-wrlocks.insert(lock);
 + mdr-locks.insert(lock);
 +   }
}
  }

Re: [PATCH 24/39] mds: take object's versionlock when rejoinning xlock

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 12 
  1 file changed, 12 insertions(+)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index f4622de..194f983 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -4327,6 +4327,12 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
 dout(10)   dn xlock by   r   on   *dn  dendl;
 MDRequest *mdr = request_get(r.reqid);  // should have this from 
 auth_pin above.
 assert(mdr-is_auth_pinned(dn));
 +   if (!mdr-xlocks.count(dn-versionlock)) {
 + assert(dn-versionlock.can_xlock_local());
 + dn-versionlock.get_xlock(mdr, mdr-get_client());
 + mdr-xlocks.insert(dn-versionlock);
 + mdr-locks.insert(dn-versionlock);
 +   }
 if (dn-lock.is_stable())
   dn-auth_pin(dn-lock);
 dn-lock.set_state(LOCK_XLOCK);
 @@ -4421,6 +4427,12 @@ void 
 MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
 dout(10)   inode xlock by   q-second   on   *lock   
 on   *in  dendl;
 MDRequest *mdr = request_get(q-second.reqid);  // should have this 
 from auth_pin above.
 assert(mdr-is_auth_pinned(in));
 +   if (!mdr-xlocks.count(in-versionlock)) {
 + assert(in-versionlock.can_xlock_local());
 + in-versionlock.get_xlock(mdr, mdr-get_client());
 + mdr-xlocks.insert(in-versionlock);
 + mdr-locks.insert(in-versionlock);
 +   }
 if (lock-is_stable())
   in-auth_pin(lock);
 lock-set_state(LOCK_XLOCK);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 25/39] mds: share inode max size after MDS recovers

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 The MDS may crash after journaling the new max size, but before sending
 the new max size to the client. Later when the MDS recovers, the client
 re-requests the new max size, but the MDS finds max size unchanged. So
 the client waits for the new max size forever. This issue can be avoided
 by checking client cap's last_sent, share inode max size if it is zero.
 (reconnected cap's last_sent is zero)

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Locker.cc  | 18 ++
  src/mds/Locker.h   |  2 +-
  src/mds/MDCache.cc |  2 ++
  3 files changed, 17 insertions(+), 5 deletions(-)

 diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
 index 0055a19..4d45f99 100644
 --- a/src/mds/Locker.cc
 +++ b/src/mds/Locker.cc
 @@ -2089,7 +2089,7 @@ bool Locker::check_inode_max_size(CInode *in, bool 
 force_wrlock,
  }


 -void Locker::share_inode_max_size(CInode *in)
 +void Locker::share_inode_max_size(CInode *in, Capability *only_cap)
  {
/*
 * only share if currently issued a WR cap.  if client doesn't have it,
 @@ -2097,9 +2097,12 @@ void Locker::share_inode_max_size(CInode *in)
 * the cap later.
 */
dout(10)  share_inode_max_size on   *in  dendl;
 -  for (mapclient_t,Capability*::iterator it = in-client_caps.begin();
 -   it != in-client_caps.end();
 -   ++it) {
 +  mapclient_t, Capability*::iterator it;
 +  if (only_cap)
 +it = in-client_caps.find(only_cap-get_client());
 +  else
 +it = in-client_caps.begin();
 +  for (; it != in-client_caps.end(); ++it) {
  const client_t client = it-first;
  Capability *cap = it-second;
  if (cap-is_suppress())
 @@ -2115,6 +2118,8 @@ void Locker::share_inode_max_size(CInode *in)
in-encode_cap_message(m, cap);
mds-send_message_client_counted(m, client);
  }
 +if (only_cap)
 +  break;
}
  }

 @@ -2398,6 +2403,11 @@ void Locker::handle_client_caps(MClientCaps *m)
bool did_issue = eval(in, CEPH_CAP_LOCKS);
if (!did_issue  (cap-wanted()  ~cap-pending()))
 issue_caps(in, cap);
 +  if (cap-get_last_seq() == 0 
 + (cap-pending()  (CEPH_CAP_FILE_WR|CEPH_CAP_FILE_BUFFER))) {
 +   cap-issue_norevoke(cap-issued());
 +   share_inode_max_size(in, cap);
 +  }
  }
}

 diff --git a/src/mds/Locker.h b/src/mds/Locker.h
 index 3f79996..d98104f 100644
 --- a/src/mds/Locker.h
 +++ b/src/mds/Locker.h
 @@ -276,7 +276,7 @@ public:
void calc_new_client_ranges(CInode *in, uint64_t size, mapclient_t, 
 client_writeable_range_t new_ranges);
bool check_inode_max_size(CInode *in, bool force_wrlock=false, bool 
 update_size=false, uint64_t newsize=0,
 utime_t mtime=utime_t());
 -  void share_inode_max_size(CInode *in);
 +  void share_inode_max_size(CInode *in, Capability *only_cap=0);

  private:
friend class C_MDL_CheckMaxSize;
 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 194f983..459b400 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -5073,6 +5073,8 @@ void MDCache::do_cap_import(Session *session, CInode 
 *in, Capability *cap)
SnapRealm *realm = in-find_snaprealm();
if (realm-have_past_parents_open()) {
  dout(10)  do_cap_import   session-info.inst.name   mseq   
 cap-get_mseq()   on   *in  dendl;
 +if (cap-get_last_seq() == 0)
 +  cap-issue_norevoke(cap-issued()); // reconnected cap
  cap-set_last_issue();
  MClientCaps *reap = new MClientCaps(CEPH_CAP_OP_IMPORT,
 in-ino(),
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 26/39] mds: issue caps when lock state in replica become SYNC

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 because client can request READ caps from non-auth MDS.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Locker.cc | 2 ++
  1 file changed, 2 insertions(+)

 diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
 index 4d45f99..28920d4 100644
 --- a/src/mds/Locker.cc
 +++ b/src/mds/Locker.cc
 @@ -4403,6 +4403,8 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock 
 *m)
  lock-set_state(LOCK_SYNC);

  lock-get_rdlock();
 +if (caps)
 +  issue_caps(in);
  lock-finish_waiters(SimpleLock::WAIT_RD|SimpleLock::WAIT_STABLE);
  lock-put_rdlock();
  break;
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-20 Thread Jacky.He

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Stefan Priebe
 Sent: Thursday, March 21, 2013 4:14 AM
 To: Travis Rhoden
 Cc: bcampb...@axcess-financial.com; ceph-devel
 Subject: Re: Latest bobtail branch still crashing KVM VMs in
bh_write_commit()

 Hi,

  In this case, they are format 2. And they are from cloned snapshots.
  Exactly like the following:

  # rbd ls -l -p volumes
  NAME SIZE
  PARENT   FMT PROT LOCK
  volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
  images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

  I'm doing an OpenStack boot-from-volume setup.

 OK i've never used cloned snapshots so maybe this is the reason.

  strange i've never seen this. Which qemu version?

  # qemu-x86_64 -version
  qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
  Fabrice Bellard

  that's coming from Ubuntu 12.04 apt repos.

 maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
not
 exist anymore it was merged into qemu with 1.3 or 1.4.

[jacky_he] I also encountered the same issue, ceph version is 0.56.3.
I have tried Qemu 1.3.1 and Qemu 1.4.0, KVM VM with format 2 cloned image
crashs.
My host OS is ubuntu 12.04, guest OS are CentOS 6.3 and Windows XP/Windows 7

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of
 a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html

 __ Information from ESET NOD32 Antivirus, version of virus
signature
 database 8141 (20130320) __

 The message was checked by ESET NOD32 Antivirus.

 http://www.eset.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deb/rpm package purge

2013-03-20 Thread Dan Mick




As a point of comparison, mysql removes the config files but not
/var/lib/mysql.
The question is, is that okay/typical/desireable/recommended/a bad idea?

  I should have asked this sooner. Do you know _any_ program that removes
your favorite music collection, your family photos or your business
emails when you do uninstall it?
I suspect that your question was theoretical instead.


It's somewhat different in that the data is not owned by one user, but 
there are clear parallels.  The thing to be careful about here, IMO, is 
not only to preserve the data, but the associated files that allow

(reasonably-easy) access to that data.  (It's no good preserving the OSD
filestore if the keys, monmap, or osdmap are gone or hard to recover.)


- remove /etc/ceph if it has been untouched

  This is an other case. dpkg itself handle package files here, called
conffiles. I should check the method (md5sum and/or sha1 variants) used
for the checksum on these files. On upgrade it's used not to overwrite
local changes by the user. It may worth to read a bit more about it[1]
from Raphaël Hertzog. He is the co-author of Debian Administrator's
handbook[2] BTW.


Excellent reference; thanks for the pointer.


  On purge dpkg will remove the package conffiles no matter what. It
won't check if those were changed or not. You may not mark the files
under /etc as conffiles, but then you'll lose the mentioned
merge logic on upgrades; dpkg will just overwrite those.
  In short, files under /var/lib/ceph are the only candidates for
in-package checksumming. How many files under there that essential for
the packages?

Laszlo/GCS
[1] 
http://raphaelhertzog.com/2010/09/21/debian-conffile-configuration-file-managed-by-dpkg/
[2] http://debian-handbook.info/


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.

2013-03-20 Thread Yan, Zheng

On 03/21/2013 02:40 AM, Greg Farnum wrote:
 The idea of this patch makes sense, but I'm not sure if we guarantee that 
 each daemon sees every map update — if they don't then if an MDS misses the 
 map moving an MDS into CLIENTREPLAY then they won't process them as having 
 recovered on the next map. Sage or Joao, what are the guarantees subscription 
 provides?  
 -Greg

See MDS::active_start(), it also kicks clientreplay waiters. And I will fix the 
'clientreply' typo in my git tree.

Thanks
Yan, Zheng

 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 
 From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com)
  
 MDS in clientreply state already start servering requests. It also
 make MDS::handle_mds_recovery() and MDS::recovery_done() match.
  
 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDS.cc (http://MDS.cc) | 2 ++
 1 file changed, 2 insertions(+)
  
 diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
 index 282fa64..b91dcbd 100644
 --- a/src/mds/MDS.cc (http://MDS.cc)
 +++ b/src/mds/MDS.cc (http://MDS.cc)
 @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m)
  
 setint oldactive, active;
 oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE);
 + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY);
 mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE);
 + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY);
 for (setint::iterator p = active.begin(); p != active.end(); ++p)  
 if (*p != whoami  // not me
 oldactive.count(*p) == 0) // newly so?
 --  
 1.7.11.7
 
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/39] mds: defer eval gather locks when removing replica

2013-03-20 Thread Yan, Zheng

Will update my git tree.

Thanks
Yan, Zheng

On 03/21/2013 03:36 AM, Greg Farnum wrote:
 On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com

 Locks' states should not change between composing the cache rejoin ack
 messages and sending the message. If Locker::eval_gather() is called
 in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
 change locks' states.

 Signed-off-by: Yan, Zheng zheng.z@intel.com 
 (mailto:zheng.z@intel.com)
 ---
 src/mds/MDCache.cc (http://MDCache.cc) | 51 
 ++-
 src/mds/MDCache.h | 8 +---
 2 files changed, 35 insertions(+), 24 deletions(-)

 diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc 
 (http://MDCache.cc)
 index 19dc60b..0f6b842 100644
 --- a/src/mds/MDCache.cc (http://MDCache.cc)
 +++ b/src/mds/MDCache.cc (http://MDCache.cc)
 @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 // possible response(s)
 MMDSCacheRejoin *ack = 0; // if survivor
 setvinodeno_t acked_inodes; // if survivor
 + setSimpleLock * gather_locks; // if survivor
 bool survivor = false; // am i a survivor?

 if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) {
 @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(dnl-is_primary());

 if (survivor  dn-is_replica(from)) 
 - dentry_remove_replica(dn, from); // this induces a lock gather completion
 + dentry_remove_replica(dn, from, gather_locks); // this induces a lock 
 gather completion
 
 This comment is no longer accurate :) 
 int dnonce = dn-add_replica(from);
 dout(10)   have   *dn  dendl;
 if (ack) 
 @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 assert(in);

 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 int inonce = in-add_replica(from);
 dout(10)   have   *in  dendl;

 @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 CInode *in = get_inode(*p);
 assert(in); // hmm fixme wrt stray?
 if (survivor  in-is_replica(from)) 
 - inode_remove_replica(in, from); // this induces a lock gather completion
 + inode_remove_replica(in, from, gather_locks); // this induces a lock 
 gather completion
 
 Same here. 
 
 Other than those, looks good.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 int inonce = in-add_replica(from);
 dout(10)   have base   *in  dendl;

 @@ -3909,8 +3910,11 @@ void 
 MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak)
 ack-add_inode_base(in);
 }

 - rejoin_scour_survivor_replicas(from, ack, acked_inodes);
 + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes);
 mds-send_message(ack, weak-get_connection());
 +
 + for (setSimpleLock*::iterator p = gather_locks.begin(); p != 
 gather_locks.end(); ++p)
 + mds-locker-eval_gather(*p);
 } else {
 // done?
 assert(rejoin_gather.count(from));
 @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t 
 ino, filepath path,
 * all validated replicas are acked with a strong nonce, etc. if that isn't 
 in the
 * ack, the replica dne, and we can remove it from our replica maps.
 */
 -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin 
 *ack, setvinodeno_t acked_inodes)
 +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack,
 + setSimpleLock * gather_locks,
 + setvinodeno_t acked_inodes)
 {
 dout(10)  rejoin_scour_survivor_replicas from mds.  from  dendl;

 @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (in-is_auth() 
 in-is_replica(from) 
 acked_inodes.count(p-second-vino()) == 0) {
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 dout(10)   rem   *in  dendl;
 }

 @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, 
 MMDSCacheRejoin *ack, set
 if (dn-is_replica(from) 
 (ack-strong_dentries.count(dir-dirfrag()) == 0 ||
 ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, 
 dn-last)) == 0)) {
 - dentry_remove_replica(dn, from);
 + dentry_remove_replica(dn, from, gather_locks);
 dout(10)   rem   *dn  dendl;
 }
 }
 @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 return;
 }

 + setSimpleLock * gather_locks;
 // loop over realms
 for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin();
 p != m-realms.end();
 @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 // remove from our cached_by
 dout(7)   inode expire on   *in   from mds.  from 
   cached_by was   in-get_replicas()  dendl;
 - inode_remove_replica(in, from);
 + inode_remove_replica(in, from, gather_locks);
 } 
 else {
 // this is an old nonce, ignore expire.
 @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)

 if (nonce ==

Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request

2013-03-20 Thread Yan, Zheng

On 03/21/2013 05:19 AM, Greg Farnum wrote:
 On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com

 Replicated objects need to be added into the cache immediately

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 Why do we need to add them right away? Shouldn't we have a journaled replica 
 if we need it?
 -Greg

The issue I encountered is lock action message received, but replicated objects 
wasn't in the
cache because slave request was delayed.

Thanks
Yan, Zheng


 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ---
 src/mds/MDCache.cc | 12 
 src/mds/MDCache.h | 2 +-
 src/mds/MDS.cc | 6 +++---
 src/mds/Server.cc | 55 +++---
 4 files changed, 56 insertions(+), 19 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 0f6b842..b668842 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context 
 *fin, bufferlist bl, int r)

 /*  */

 +int MDCache::get_num_client_requests()
 +{
 + int count = 0;
 + for (hash_mapmetareqid_t, MDRequest*::iterator p = 
 active_requests.begin();
 + p != active_requests.end();
 + ++p) {
 + if (p-second-reqid.name.is_client()  !p-second-is_slave())
 + count++;
 + }
 + return count;
 +}
 +
 /* This function takes over the reference to the passed Message */
 MDRequest *MDCache::request_start(MClientRequest *req)
 {
 diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
 index a9f05c6..4634121 100644
 --- a/src/mds/MDCache.h
 +++ b/src/mds/MDCache.h
 @@ -240,7 +240,7 @@ protected:
 hash_mapmetareqid_t, MDRequest* active_requests; 

 public:
 - int get_num_active_requests() { return active_requests.size(); }
 + int get_num_client_requests();

 MDRequest* request_start(MClientRequest *req);
 MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by);
 diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
 index b91dcbd..e99eecc 100644
 --- a/src/mds/MDS.cc
 +++ b/src/mds/MDS.cc
 @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m)
 mdcache-is_open() 
 replay_queue.empty() 
 want_state == MDSMap::STATE_CLIENTREPLAY) {
 - dout(10)   still have   mdcache-get_num_active_requests()
 -   active replay requests  dendl;
 - if (mdcache-get_num_active_requests() == 0)
 + int num_requests = mdcache-get_num_client_requests();
 + dout(10)   still have   num_requests   active replay requests  
 dendl;
 + if (num_requests == 0)
 clientreplay_done();
 }

 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 4c4c86b..8e89e4c 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -107,10 +107,8 @@ void Server::dispatch(Message *m)
 (m-get_type() == CEPH_MSG_CLIENT_REQUEST 
 (static_castMClientRequest*(m))-is_replay( {
 // replaying!
 - } else if (mds-is_clientreplay()  m-get_type() == 
 MSG_MDS_SLAVE_REQUEST 
 - ((static_castMMDSSlaveRequest*(m))-is_reply() ||
 - !mds-mdsmap-is_active(m-get_source().num( {
 - // slave reply or the master is also in the clientreplay stage
 + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) {
 + // handle_slave_request() will wait if necessary
 } else {
 dout(3)  not active yet, waiting  dendl;
 mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 if (m-is_reply())
 return handle_slave_request_reply(m);

 + CDentry *straydn = NULL;
 + if (m-stray.length()  0) {
 + straydn = mdcache-add_replica_stray(m-stray, from);
 + assert(straydn);
 + m-stray.clear();
 + }
 +
 // am i a new slave?
 MDRequest *mdr = NULL;
 if (mdcache-have_request(m-get_reqid())) {
 @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
 m-put();
 return;
 }
 - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), 
 m-get_source().num());
 + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from);
 }
 assert(mdr-slave_request == 0); // only one at a time, please! 
 +
 + if (straydn) {
 + mdr-pin(straydn);
 + mdr-straydn = straydn;
 + }
 +
 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + } else if (mds-is_clientreplay()  !mds-mdsmap-is_clientreplay(from) 
 + mdr-locks.empty()) {
 + dout(3)  not active yet, waiting  dendl;
 + mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 mdr-slave_request = m;

 dispatch_slave_request(mdr);
 @@ -1339,6 +1361,12 @@ void 
 Server::handle_slave_request_reply(MMDSSlaveRequest *m)
 {
 int from = m-get_source().num();

 + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 + dout(3)  not clientreplay|active yet, waiting  dendl;
 + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 + return;
 + }
 +
 if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) {
 metareqid_t r = m-get_reqid();

Re: [PATCH 13/39] mds: don't send resolve message between active MDS

2013-03-20 Thread Yan, Zheng

On 03/21/2013 05:56 AM, Gregory Farnum wrote:
 On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 When MDS cluster is resolving, current behavior is sending subtree resolve
 message to all other MDS and waiting for all other MDS' resolve message.
 The problem is that active MDS can have diffent subtree map due to rename.
 Besides gathering active MDS's resolve messages are also racy. The only
 function for these messages is disambiguate other MDS' import. We can
 replace it by import finish notification.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc  | 12 +---
  src/mds/Migrator.cc | 25 +++--
  src/mds/Migrator.h  |  3 ++-
  3 files changed, 34 insertions(+), 6 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index c455a20..73c1d59 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -2517,7 +2517,8 @@ void MDCache::send_subtree_resolves()
 ++p) {
  if (*p == mds-whoami)
continue;
 -resolves[*p] = new MMDSResolve;
 +if (mds-is_resolve() || mds-mdsmap-is_resolve(*p))
 +  resolves[*p] = new MMDSResolve;
}

// known
 @@ -2837,7 +2838,7 @@ void MDCache::handle_resolve(MMDSResolve *m)
   migrator-import_reverse(dir);
 } else {
   dout(7)  ambiguous import succeeded on   *dir  dendl;
 - migrator-import_finish(dir);
 + migrator-import_finish(dir, true);
 }
 my_ambiguous_imports.erase(p);  // no longer ambiguous.
}
 @@ -3432,7 +3433,12 @@ void MDCache::rejoin_send_rejoins()
 ++p) {
  CDir *dir = p-first;
  assert(dir-is_subtree_root());
 -assert(!dir-is_ambiguous_dir_auth());
 +if (dir-is_ambiguous_dir_auth()) {
 +  // exporter is recovering, importer is survivor.
 
 The importer has to be the MDS this code is running on, right?

This code is for bystanders. The exporter is recovering, and its resolve 
message didn't claim
the subtree. So the export must succeed.

 
 +  assert(rejoins.count(dir-authority().first));
 +  assert(!rejoins.count(dir-authority().second));
 +  continue;
 +}

  // my subtree?
  if (dir-is_auth())
 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 5e53803..833df12 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -2088,6 +2088,23 @@ void Migrator::import_reverse(CDir *dir)
}
  }

 +void Migrator::import_notify_finish(CDir *dir, setCDir* bounds)
 +{
 +  dout(7)  import_notify_finish   *dir  dendl;
 +
 +  for (setint::iterator p = import_bystanders[dir].begin();
 +   p != import_bystanders[dir].end();
 +   ++p) {
 +MExportDirNotify *notify =
 +  new MExportDirNotify(dir-dirfrag(), false,
 +  pairint,int(import_peer[dir-dirfrag()], 
 mds-get_nodeid()),
 +  pairint,int(mds-get_nodeid(), 
 CDIR_AUTH_UNKNOWN));
 
 I don't think this is quite right — we're notifying them that we've
 just finished importing data from somebody, right? And so we know that
 we're the auth node...

Yes. In normal case, exporter notifies the bystanders. But if exporter crashes, 
the importer notifies
the bystanders after it confirms ambiguous import succeeds.

Thanks
Yan, Zheng

 
 +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++)
 +  notify-get_bounds().push_back((*i)-dirfrag());
 +mds-send_message_mds(notify, *p);
 +  }
 +}
 +
  void Migrator::import_notify_abort(CDir *dir, setCDir* bounds)
  {
dout(7)  import_notify_abort   *dir  dendl;
 @@ -2183,11 +2200,11 @@ void Migrator::handle_export_finish(MExportDirFinish 
 *m)
CDir *dir = cache-get_dirfrag(m-get_dirfrag());
assert(dir);
dout(7)  handle_export_finish on   *dir  dendl;
 -  import_finish(dir);
 +  import_finish(dir, false);
m-put();
  }

 -void Migrator::import_finish(CDir *dir)
 +void Migrator::import_finish(CDir *dir, bool notify)
  {
dout(7)  import_finish on   *dir  dendl;

 @@ -2205,6 +,10 @@ void Migrator::import_finish(CDir *dir)
// remove pins
setCDir* bounds;
cache-get_subtree_bounds(dir, bounds);
 +
 +  if (notify)
 +import_notify_finish(dir, bounds);
 +
import_remove_pins(dir, bounds);

mapCInode*, mapclient_t,Capability::Export  cap_imports;
 diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h
 index 7988f32..2889a74 100644
 --- a/src/mds/Migrator.h
 +++ b/src/mds/Migrator.h
 @@ -273,12 +273,13 @@ protected:
void import_reverse_unfreeze(CDir *dir);
void import_reverse_final(CDir *dir);
void import_notify_abort(CDir *dir, setCDir* bounds);
 +  void import_notify_finish(CDir *dir, setCDir* bounds);
void import_logged_start(dirfrag_t df, CDir *dir, int from,
mapclient_t,entity_inst_t imported_client_map,
mapclient_t,uint64_t sseqmap);
void handle_export_finish(MExportDirFinish *m);
  public:
 -

Re: [PATCH 27/39] mds: send lock action message when auth MDS is in proper state.

2013-03-20 Thread Gregory Farnum

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 For rejoining object, don't send lock ACK message because lock states
 are still uncertain. The lock ACK may confuse object's auth MDS and
 trigger assertion.

 If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK
 and REQSCATTER messages. MDCache::handle_mds_recovery() will take care
 of them.

 Also defer caps release message until clientreplay or active

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Locker.cc  | 46 ++
  src/mds/MDCache.cc | 13 +++--
  2 files changed, 41 insertions(+), 18 deletions(-)

 diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
 index 28920d4..ece39e3 100644
 --- a/src/mds/Locker.cc
 +++ b/src/mds/Locker.cc
 @@ -658,6 +658,13 @@ void Locker::eval_gather(SimpleLock *lock, bool first, 
 bool *pneed_issue, listC
// replica: tell auth
int auth = lock-get_parent()-authority().first;

 +  if (lock-get_parent()-is_rejoining() 
 + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) {
 +   dout(7)  eval_gather finished gather, but still rejoining 
 +*lock-get_parent()  dendl;
 +   return;
 +  }
 +
if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) {
 switch (lock-get_state()) {
 case LOCK_SYNC_LOCK:
 @@ -1050,9 +1057,11 @@ bool Locker::_rdlock_kick(SimpleLock *lock, bool 
 as_anon)
  } else {
// request rdlock state change from auth
int auth = lock-get_parent()-authority().first;
 -  dout(10)  requesting rdlock from auth on 
 -   *lock   on   *lock-get_parent()  dendl;
 -  mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, 
 mds-get_nodeid()), auth);
 +  if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) {
 +   dout(10)  requesting rdlock from auth on 
 + *lock   on   *lock-get_parent()  dendl;
 +   mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, 
 mds-get_nodeid()), auth);
 +  }
return false;
  }
}
 @@ -1272,9 +1281,11 @@ bool Locker::wrlock_start(SimpleLock *lock, MDRequest 
 *mut, bool nowait)
// replica.
// auth should be auth_pinned (see acquire_locks wrlock weird mustpin 
 case).
int auth = lock-get_parent()-authority().first;
 -  dout(10)  requesting scatter from auth on 
 -   *lock   on   *lock-get_parent()  dendl;
 -  mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, 
 mds-get_nodeid()), auth);
 +  if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) {
 +   dout(10)  requesting scatter from auth on 
 + *lock   on   *lock-get_parent()  dendl;
 +   mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, 
 mds-get_nodeid()), auth);
 +  }
break;
  }
}
 @@ -1899,13 +1910,19 @@ void Locker::request_inode_file_caps(CInode *in)
  }

  int auth = in-authority().first;
 +if (in-is_rejoining() 
 +   mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) {
 +  mds-wait_for_active_peer(auth, new C_MDL_RequestInodeFileCaps(this, 
 in));
 +  return;
 +}
 +
  dout(7)  request_inode_file_caps   ccap_string(wanted)
was   ccap_string(in-replica_caps_wanted)
on   *in   to mds.  auth  dendl;

  in-replica_caps_wanted = wanted;

 -if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN)
 +if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth))
mds-send_message_mds(new MInodeFileCaps(in-ino(), 
 in-replica_caps_wanted),
 auth);
}
 @@ -1924,14 +1941,6 @@ void Locker::handle_inode_file_caps(MInodeFileCaps *m)
assert(in);
assert(in-is_auth());

 -  if (mds-is_rejoin() 
 -  in-is_rejoining()) {
 -dout(7)  handle_inode_file_caps still rejoining   *in  , 
 dropping   *m  dendl;
 -m-put();
 -return;
 -  }

This is okay since we catch it in the follow-on functions (I assume
that's why you removed it, to avoid checks at more levels than
necessary), but if you could note that's why in the commit message
it'll prevent anyone else from needing to go check like I did. :)

The code looks good.
Reviewed-by: Greg Farnum g...@inktank.com

 -
 -
dout(7)  handle_inode_file_caps replica mds.  from   wants caps  
  ccap_string(m-get_caps())   on   *in  dendl;

if (m-get_caps())
 @@ -2850,6 +2859,11 @@ void 
 Locker::handle_client_cap_release(MClientCapRelease *m)
client_t client = m-get_source().num();
dout(10)  handle_client_cap_release   *m  dendl;

 +  if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 +mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 +return;
 +  }
 +
for (vectorceph_mds_cap_item::iterator p = m-caps.begin(); p != 
 m-caps.end(); ++p) {
  inodeno_t ino((uint64_t)p-ino);
  CInode *in = mdcache-get_inode(ino);

Re: [PATCH 28/39] mds: add dirty imported dirfrag to LogSegment

2013-03-20 Thread Gregory Farnum

Whoops!
Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/CDir.cc | 7 +--
  src/mds/CDir.h  | 2 +-
  src/mds/Migrator.cc | 2 +-
  3 files changed, 7 insertions(+), 4 deletions(-)

 diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
 index af0ae9c..34bd8d3 100644
 --- a/src/mds/CDir.cc
 +++ b/src/mds/CDir.cc
 @@ -2164,7 +2164,7 @@ void CDir::finish_export(utime_t now)
dirty_old_rstat.clear();
  }

 -void CDir::decode_import(bufferlist::iterator blp, utime_t now)
 +void CDir::decode_import(bufferlist::iterator blp, utime_t now, LogSegment 
 *ls)
  {
::decode(first, blp);
::decode(fnode, blp);
 @@ -2177,7 +2177,10 @@ void CDir::decode_import(bufferlist::iterator blp, 
 utime_t now)
::decode(s, blp);
state = MASK_STATE_IMPORT_KEPT;
state |= (s  MASK_STATE_EXPORTED);
 -  if (is_dirty()) get(PIN_DIRTY);
 +  if (is_dirty()) {
 +get(PIN_DIRTY);
 +_mark_dirty(ls);
 +  }

::decode(dir_rep, blp);

 diff --git a/src/mds/CDir.h b/src/mds/CDir.h
 index f4a3a3d..7e1db73 100644
 --- a/src/mds/CDir.h
 +++ b/src/mds/CDir.h
 @@ -550,7 +550,7 @@ public:
void abort_export() {
  put(PIN_TEMPEXPORTING);
}
 -  void decode_import(bufferlist::iterator blp, utime_t now);
 +  void decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls);


// -- auth pins --
 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 833df12..d626cb1 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -2397,7 +2397,7 @@ int Migrator::decode_import_dir(bufferlist::iterator 
 blp,
dout(7)  decode_import_dir   *dir  dendl;

// assimilate state
 -  dir-decode_import(blp, now);
 +  dir-decode_import(blp, now, ls);

// mark  (may already be marked from get_or_open_dir() above)
if (!dir-is_auth())
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 29/39] mds: avoid double auth pin for file recovery

2013-03-20 Thread Gregory Farnum

This looks good on its face but I haven't had the chance to dig
through the recovery queue stuff yet (it's on my list following some
issues with recovery speed). How'd you run across this? If it's being
added to the recovery queue multiple times I want to make sure we
don't have some other machinery trying to dequeue it multiple times,
or a single waiter which needs to be a list or something.
-Greg

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 973a4d0..e9a79cd 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in)
dout(15)  _queue_file_recover   *in  dendl;
assert(in-is_auth());
in-state_clear(CInode::STATE_NEEDSRECOVER);
 -  in-state_set(CInode::STATE_RECOVERING);
 -  in-auth_pin(this);
 +  if (!in-state_test(CInode::STATE_RECOVERING)) {
 +in-state_set(CInode::STATE_RECOVERING);
 +in-auth_pin(this);
 +  }
file_recover_queue.insert(in);
  }

 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 27/39] mds: send lock action message when auth MDS is in proper state.

2013-03-20 Thread Yan, Zheng

On 03/21/2013 11:12 AM, Gregory Farnum wrote:
 On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 For rejoining object, don't send lock ACK message because lock states
 are still uncertain. The lock ACK may confuse object's auth MDS and
 trigger assertion.

 If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK
 and REQSCATTER messages. MDCache::handle_mds_recovery() will take care
 of them.

 Also defer caps release message until clientreplay or active

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Locker.cc  | 46 ++
  src/mds/MDCache.cc | 13 +++--
  2 files changed, 41 insertions(+), 18 deletions(-)

 diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
 index 28920d4..ece39e3 100644
 --- a/src/mds/Locker.cc
 +++ b/src/mds/Locker.cc
 @@ -658,6 +658,13 @@ void Locker::eval_gather(SimpleLock *lock, bool first, 
 bool *pneed_issue, listC
// replica: tell auth
int auth = lock-get_parent()-authority().first;

 +  if (lock-get_parent()-is_rejoining() 
 + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) {
 +   dout(7)  eval_gather finished gather, but still rejoining 
 +*lock-get_parent()  dendl;
 +   return;
 +  }
 +
if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) {
 switch (lock-get_state()) {
 case LOCK_SYNC_LOCK:
 @@ -1050,9 +1057,11 @@ bool Locker::_rdlock_kick(SimpleLock *lock, bool 
 as_anon)
  } else {
// request rdlock state change from auth
int auth = lock-get_parent()-authority().first;
 -  dout(10)  requesting rdlock from auth on 
 -   *lock   on   *lock-get_parent()  dendl;
 -  mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, 
 mds-get_nodeid()), auth);
 +  if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) {
 +   dout(10)  requesting rdlock from auth on 
 + *lock   on   *lock-get_parent()  dendl;
 +   mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, 
 mds-get_nodeid()), auth);
 +  }
return false;
  }
}
 @@ -1272,9 +1281,11 @@ bool Locker::wrlock_start(SimpleLock *lock, MDRequest 
 *mut, bool nowait)
// replica.
// auth should be auth_pinned (see acquire_locks wrlock weird mustpin 
 case).
int auth = lock-get_parent()-authority().first;
 -  dout(10)  requesting scatter from auth on 
 -   *lock   on   *lock-get_parent()  dendl;
 -  mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, 
 mds-get_nodeid()), auth);
 +  if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) {
 +   dout(10)  requesting scatter from auth on 
 + *lock   on   *lock-get_parent()  dendl;
 +   mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, 
 mds-get_nodeid()), auth);
 +  }
break;
  }
}
 @@ -1899,13 +1910,19 @@ void Locker::request_inode_file_caps(CInode *in)
  }

  int auth = in-authority().first;
 +if (in-is_rejoining() 
 +   mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) {
 +  mds-wait_for_active_peer(auth, new C_MDL_RequestInodeFileCaps(this, 
 in));
 +  return;
 +}
 +
  dout(7)  request_inode_file_caps   ccap_string(wanted)
was   ccap_string(in-replica_caps_wanted)
on   *in   to mds.  auth  dendl;

  in-replica_caps_wanted = wanted;

 -if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN)
 +if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth))
mds-send_message_mds(new MInodeFileCaps(in-ino(), 
 in-replica_caps_wanted),
 auth);
}
 @@ -1924,14 +1941,6 @@ void Locker::handle_inode_file_caps(MInodeFileCaps *m)
assert(in);
assert(in-is_auth());

 -  if (mds-is_rejoin() 
 -  in-is_rejoining()) {
 -dout(7)  handle_inode_file_caps still rejoining   *in  , 
 dropping   *m  dendl;
 -m-put();
 -return;
 -  }
 
 This is okay since we catch it in the follow-on functions (I assume
 that's why you removed it, to avoid checks at more levels than
 necessary), but if you could note that's why in the commit message
 it'll prevent anyone else from needing to go check like I did. :)
 

if an inode is auth, it can not be rejoining. that's why I removed it.

Thanks
Yan, Zheng


 The code looks good.
 Reviewed-by: Greg Farnum g...@inktank.com
 
 -
 -
dout(7)  handle_inode_file_caps replica mds.  from   wants caps 
   ccap_string(m-get_caps())   on   *in  dendl;

if (m-get_caps())
 @@ -2850,6 +2859,11 @@ void 
 Locker::handle_client_cap_release(MClientCapRelease *m)
client_t client = m-get_source().num();
dout(10)  handle_client_cap_release   *m  dendl;

 +  if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) {
 +mds-wait_for_replay(new C_MDS_RetryMessage(mds, m));
 +return;
 +  }
 +
for

Re: [PATCH 30/39] mds: check MDS peer's state through mdsmap

2013-03-20 Thread Gregory Farnum

Yep.
Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Migrator.cc | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index d626cb1..143d71e 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -238,7 +238,7 @@ void Migrator::handle_mds_failure_or_stop(int who)
 export_unlock(dir);
 export_locks.erase(dir);
 dir-state_clear(CDir::STATE_EXPORTING);
 -   if (export_peer[dir] != who) // tell them.
 +   if 
 (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // 
 tell them.
   mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), 
 export_peer[dir]);
 break;

 @@ -247,7 +247,7 @@ void Migrator::handle_mds_failure_or_stop(int who)
 dir-unfreeze_tree();  // cancel the freeze
 export_state.erase(dir); // clean up
 dir-state_clear(CDir::STATE_EXPORTING);
 -   if (export_peer[dir] != who) // tell them.
 +   if 
 (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // 
 tell them.
   mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), 
 export_peer[dir]);
 break;

 @@ -278,7 +278,7 @@ void Migrator::handle_mds_failure_or_stop(int who)
 export_unlock(dir);
 export_locks.erase(dir);
 dir-state_clear(CDir::STATE_EXPORTING);
 -   if (export_peer[dir] != who) // tell them.
 +   if 
 (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // 
 tell them.
   mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), 
 export_peer[dir]);
 break;

 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 31/39] mds: unfreeze subtree if import aborts in PREPPED state

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Migrator.cc | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 143d71e..963706c 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -1658,11 +1658,14 @@ void Migrator::handle_export_cancel(MExportDirCancel 
 *m)
  CInode *in = cache-get_inode(df.ino);
  assert(in);
  import_reverse_discovered(df, in);
 -  } else if (import_state[df] == IMPORT_PREPPING ||
 -import_state[df] == IMPORT_PREPPED) {
 +  } else if (import_state[df] == IMPORT_PREPPING) {
  CDir *dir = mds-mdcache-get_dirfrag(df);
  assert(dir);
  import_reverse_prepping(dir);
 +  } else if (import_state[df] == IMPORT_PREPPED) {
 +CDir *dir = mds-mdcache-get_dirfrag(df);
 +assert(dir);
 +import_reverse_unfreeze(dir);
} else {
  assert(0 == got export_cancel in weird state);
}
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 32/39] mds: fix export cancel notification

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 The comment says that if the importer is dead, bystanders thinks the
 exporter is the only auth, as per mdcache-handle_mds_failure(). But
 there is no such code in MDCache::handle_mds_failure().

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Migrator.cc | 20 +---
  1 file changed, 5 insertions(+), 15 deletions(-)

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 963706c..40a5394 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -1390,17 +1390,9 @@ void Migrator::export_logged_finish(CDir *dir)
for (setint::iterator p = export_notify_ack_waiting[dir].begin();
 p != export_notify_ack_waiting[dir].end();
 ++p) {
 -MExportDirNotify *notify;
 -if (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir]))
 -  // dest is still alive.
 -  notify = new MExportDirNotify(dir-dirfrag(), true,
 -   pairint,int(mds-get_nodeid(), dest),
 -   pairint,int(dest, CDIR_AUTH_UNKNOWN));
 -else
 -  // dest is dead.  bystanders will think i am only auth, as per 
 mdcache-handle_mds_failure()
 -  notify = new MExportDirNotify(dir-dirfrag(), true,
 -   pairint,int(mds-get_nodeid(), 
 CDIR_AUTH_UNKNOWN),
 -   pairint,int(dest, CDIR_AUTH_UNKNOWN));
 +MExportDirNotify *notify = new MExportDirNotify(dir-dirfrag(), true,
 +   
 pairint,int(mds-get_nodeid(), dest),
 +   pairint,int(dest, 
 CDIR_AUTH_UNKNOWN));

  for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++)
notify-get_bounds().push_back((*i)-dirfrag());
 @@ -2115,11 +2107,9 @@ void Migrator::import_notify_abort(CDir *dir, 
 setCDir* bounds)
for (setint::iterator p = import_bystanders[dir].begin();
 p != import_bystanders[dir].end();
 ++p) {
 -// NOTE: the bystander will think i am _only_ auth, because they will 
 have seen
 -// the exporter's failure and updated the subtree auth.  see 
 mdcache-handle_mds_failure().
 -MExportDirNotify *notify =
 +MExportDirNotify *notify =
new MExportDirNotify(dir-dirfrag(), true,
 -  pairint,int(mds-get_nodeid(), 
 CDIR_AUTH_UNKNOWN),
 +  pairint,int(import_peer[dir-dirfrag()], 
 mds-get_nodeid()),
pairint,int(import_peer[dir-dirfrag()], 
 CDIR_AUTH_UNKNOWN));
  for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++)
notify-get_bounds().push_back((*i)-dirfrag());
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 29/39] mds: avoid double auth pin for file recovery

2013-03-20 Thread Yan, Zheng

On 03/21/2013 11:20 AM, Gregory Farnum wrote:
 This looks good on its face but I haven't had the chance to dig
 through the recovery queue stuff yet (it's on my list following some
 issues with recovery speed). How'd you run across this? If it's being
 added to the recovery queue multiple times I want to make sure we
 don't have some other machinery trying to dequeue it multiple times,
 or a single waiter which needs to be a list or something.
 -Greg

Two clients that were writing the same file crashed successively.

Thanks,
Yan, Zheng

 
 On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 973a4d0..e9a79cd 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in)
dout(15)  _queue_file_recover   *in  dendl;
assert(in-is_auth());
in-state_clear(CInode::STATE_NEEDSRECOVER);
 -  in-state_set(CInode::STATE_RECOVERING);
 -  in-auth_pin(this);
 +  if (!in-state_test(CInode::STATE_RECOVERING)) {
 +in-state_set(CInode::STATE_RECOVERING);
 +in-auth_pin(this);
 +  }
file_recover_queue.insert(in);
  }

 --
 1.7.11.7


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 33/39] mds: notify bystanders if export aborts

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 So bystanders know the subtree is single auth earlier.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Migrator.cc | 34 ++
  src/mds/Migrator.h  |  1 +
  2 files changed, 27 insertions(+), 8 deletions(-)

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 40a5394..0672d03 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -251,25 +251,28 @@ void Migrator::handle_mds_failure_or_stop(int who)
   mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), 
 export_peer[dir]);
 break;

 -   // NOTE: state order reversal, warning comes after 
 loggingstart+prepping
 +   // NOTE: state order reversal, warning comes after prepping
case EXPORT_WARNING:
 dout(10)  export state=warning : unpinning bounds, unfreezing, 
 notifying  dendl;
 // fall-thru

case EXPORT_PREPPING:
 if (p-second != EXPORT_WARNING)
 - dout(10)  export state=loggingstart|prepping : unpinning bounds, 
 unfreezing  dendl;
 + dout(10)  export state=prepping : unpinning bounds, unfreezing 
  dendl;
 {
   // unpin bounds
   setCDir* bounds;
   cache-get_subtree_bounds(dir, bounds);
 - for (setCDir*::iterator p = bounds.begin();
 -  p != bounds.end();
 -  ++p) {
 -   CDir *bd = *p;
 + for (setCDir*::iterator q = bounds.begin();
 +  q != bounds.end();
 +  ++q) {
 +   CDir *bd = *q;
 bd-put(CDir::PIN_EXPORTBOUND);
 bd-state_clear(CDir::STATE_EXPORTBOUND);
   }
 + // notify bystanders
 + if (p-second == EXPORT_WARNING)
 +   export_notify_abort(dir, bounds);
 }
 dir-unfreeze_tree();
 export_state.erase(dir); // clean up
 @@ -1307,9 +1310,21 @@ void Migrator::handle_export_ack(MExportDirAck *m)
m-put();
  }

 +void Migrator::export_notify_abort(CDir *dir, setCDir* bounds)
 +{
 +  dout(7)  export_notify_abort   *dir  dendl;

 -
 -
 +  for (setint::iterator p = export_notify_ack_waiting[dir].begin();
 +   p != export_notify_ack_waiting[dir].end();
 +   ++p) {
 +MExportDirNotify *notify = new MExportDirNotify(dir-dirfrag(), false,
 +   
 pairint,int(mds-get_nodeid(),export_peer[dir]),
 +   
 pairint,int(mds-get_nodeid(),CDIR_AUTH_UNKNOWN));
 +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); ++i)
 +  notify-get_bounds().push_back((*i)-dirfrag());
 +mds-send_message_mds(notify, *p);
 +  }
 +}

  /*
   * this happens if hte dest failes after i send teh export data but before 
 it is acked
 @@ -1356,6 +1371,9 @@ void Migrator::export_reverse(CDir *dir)
  bd-state_clear(CDir::STATE_EXPORTBOUND);
}

 +  // notify bystanders
 +  export_notify_abort(dir, bounds);
 +
// process delayed expires
cache-process_delayed_expire(dir);

 diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h
 index 2889a74..f395bc1 100644
 --- a/src/mds/Migrator.h
 +++ b/src/mds/Migrator.h
 @@ -227,6 +227,7 @@ public:
void export_go(CDir *dir);
void export_go_synced(CDir *dir);
void export_reverse(CDir *dir);
 +  void export_notify_abort(CDir *dir, setCDir* bounds);
void handle_export_ack(MExportDirAck *m);
void export_logged_finish(CDir *dir);
void handle_export_notify_ack(MExportDirNotifyAck *m);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 34/39] mds: don't open dirfrag while subtree is frozen

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index e9a79cd..30687ec 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -7101,9 +7101,9 @@ int MDCache::path_traverse(MDRequest *mdr, Message 
 *req, Context *fin, // wh
  if (!curdir) {
if (cur-is_auth()) {
  // parent dir frozen_dir?
 -if (cur-is_frozen_dir()) {
 -  dout(7)  traverse:   *cur-get_parent_dir()   is 
 frozen_dir, waiting  dendl;
 -  cur-get_parent_dn()-get_dir()-add_waiter(CDir::WAIT_UNFREEZE, 
 _get_waiter(mdr, req, fin));
 +if (cur-is_frozen()) {
 +  dout(7)  traverse:   *cur   is frozen, waiting  dendl;
 +  cur-add_waiter(CDir::WAIT_UNFREEZE, _get_waiter(mdr, req, fin));
return 1;
  }
  curdir = cur-get_or_open_dirfrag(this, fg);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 35/39] mds: clear dirty inode rstat if import fails

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/CDir.cc | 1 +
  src/mds/Migrator.cc | 2 ++
  2 files changed, 3 insertions(+)

 diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
 index 34bd8d3..47b6753 100644
 --- a/src/mds/CDir.cc
 +++ b/src/mds/CDir.cc
 @@ -1022,6 +1022,7 @@ void CDir::assimilate_dirty_rstat_inodes()
for (elistCInode*::iterator p = dirty_rstat_inodes.begin_use_current();
 !p.end(); ++p) {
  CInode *in = *p;
 +assert(in-is_auth());
  if (in-is_frozen())
continue;

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 0672d03..f563b8d 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -2052,6 +2052,8 @@ void Migrator::import_reverse(CDir *dir)
 in-clear_replica_map();
 if (in-is_dirty())
   in-mark_clean();
 +   in-clear_dirty_rstat();
 +
 in-authlock.clear_gather();
 in-linklock.clear_gather();
 in-dirfragtreelock.clear_gather();
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 36/39] mds: try merging subtree after clear EXPORTBOUND

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Migrator.cc | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index f563b8d..9cbad87 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -1340,10 +1340,6 @@ void Migrator::export_reverse(CDir *dir)
setCDir* bounds;
cache-get_subtree_bounds(dir, bounds);

 -  // adjust auth, with possible subtree merge.
 -  cache-adjust_subtree_auth(dir, mds-get_nodeid());
 -  cache-try_subtree_merge(dir);  // NOTE: may journal subtree_map as 
 side-effect
 -
// remove exporting pins
listCDir* rq;
rq.push_back(dir);
 @@ -1371,6 +1367,10 @@ void Migrator::export_reverse(CDir *dir)
  bd-state_clear(CDir::STATE_EXPORTBOUND);
}

 +  // adjust auth, with possible subtree merge.
 +  cache-adjust_subtree_auth(dir, mds-get_nodeid());
 +  cache-try_subtree_merge(dir);  // NOTE: may journal subtree_map as 
 side-effect
 +
// notify bystanders
export_notify_abort(dir, bounds);

 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 37/39] mds: eval inodes with caps imported by cache rejoin message

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 1 +
  1 file changed, 1 insertion(+)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 30687ec..24f1109 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -3823,6 +3823,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin 
 *weak)
 dout(10)   claiming cap import   p-first   client.  
 q-first   on   *in  dendl;
 rejoin_import_cap(in, q-first, q-second, from);
}
 +  mds-locker-eval(in, CEPH_CAP_LOCKS, true);
  }
} else {
  assert(mds-is_rejoin());
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 38/39] mds: don't replicate purging dentry

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 open_remote_ino is racy, it's possible someone deletes the inode's
 last linkage while the MDS is discovering the inode.

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/MDCache.cc | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index 24f1109..d730ff1 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -9225,8 +9225,15 @@ void MDCache::handle_discover(MDiscover *dis)
  if (dis-get_want_ino()) {
// lookup by ino
CInode *in = get_inode(dis-get_want_ino(), snapid);
 -  if (in  in-is_auth()  in-get_parent_dn()-get_dir() == curdir)
 +  if (in  in-is_auth()  in-get_parent_dn()-get_dir() == curdir) {
 dn = in-get_parent_dn();
 +   if (dn-state_test(CDentry::STATE_PURGING)) {
 + // set error flag in reply
 + dout(7)  dentry   *dn   is purging, flagging error ino  
 dendl;
 + reply-set_flag_error_ino();
 + break;
 +   }
 +  }
  } else if (dis-get_want().depth()  0) {
// lookup dentry
dn = curdir-lookup(dis-get_dentry(i), snapid);
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 39/39] mds: clear scatter dirty if replica inode has no auth subtree

2013-03-20 Thread Gregory Farnum

Reviewed-by: Greg Farnum g...@inktank.com

On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
 From: Yan, Zheng zheng.z@intel.com

 This avoids sending superfluous scatterlock state to recovering MDS

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/CInode.cc   |  5 +++--
  src/mds/CInode.h|  2 +-
  src/mds/MDCache.cc  | 13 ++---
  src/mds/Migrator.cc | 15 +++
  4 files changed, 25 insertions(+), 10 deletions(-)

 diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
 index 42137f3..25cb6c1 100644
 --- a/src/mds/CInode.cc
 +++ b/src/mds/CInode.cc
 @@ -615,12 +615,13 @@ void CInode::close_dirfrags()
  close_dirfrag(dirfrags.begin()-first);
  }

 -bool CInode::has_subtree_root_dirfrag()
 +bool CInode::has_subtree_root_dirfrag(int auth)
  {
for (mapfrag_t,CDir*::iterator p = dirfrags.begin();
 p != dirfrags.end();
 ++p)
 -if (p-second-is_subtree_root())
 +if (p-second-is_subtree_root() 
 +   (auth == -1 || p-second-dir_auth.first == auth))
return true;
return false;
  }
 diff --git a/src/mds/CInode.h b/src/mds/CInode.h
 index f7b8f33..bea7430 100644
 --- a/src/mds/CInode.h
 +++ b/src/mds/CInode.h
 @@ -344,7 +344,7 @@ public:
CDir *add_dirfrag(CDir *dir);
void close_dirfrag(frag_t fg);
void close_dirfrags();
 -  bool has_subtree_root_dirfrag();
 +  bool has_subtree_root_dirfrag(int auth=-1);

void force_dirfrags();
void verify_dirfrags();
 diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
 index d730ff1..75c7ded 100644
 --- a/src/mds/MDCache.cc
 +++ b/src/mds/MDCache.cc
 @@ -3330,8 +3330,10 @@ void MDCache::recalc_auth_bits()
setCInode* subtree_inodes;
for (mapCDir*,setCDir* ::iterator p = subtrees.begin();
 p != subtrees.end();
 -   ++p)
 -subtree_inodes.insert(p-first-inode);
 +   ++p) {
 +if (p-first-dir_auth.first == mds-get_nodeid())
 +  subtree_inodes.insert(p-first-inode);
 +  }

for (mapCDir*,setCDir* ::iterator p = subtrees.begin();
 p != subtrees.end();
 @@ -3390,11 +3392,8 @@ void MDCache::recalc_auth_bits()
 if (dnl-get_inode()-is_dirty())
   dnl-get_inode()-mark_clean();
 // avoid touching scatterlocks for our subtree roots!
 -   if (subtree_inodes.count(dnl-get_inode()) == 0) {
 - dnl-get_inode()-filelock.remove_dirty();
 - dnl-get_inode()-nestlock.remove_dirty();
 - dnl-get_inode()-dirfragtreelock.remove_dirty();
 -   }
 +   if (subtree_inodes.count(dnl-get_inode()) == 0)
 + dnl-get_inode()-clear_scatter_dirty();
   }

   // recurse?
 diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
 index 9cbad87..49d21ab 100644
 --- a/src/mds/Migrator.cc
 +++ b/src/mds/Migrator.cc
 @@ -1095,6 +1095,10 @@ void Migrator::finish_export_inode(CInode *in, utime_t 
 now, listContext* fini

in-clear_dirty_rstat();

 +  // no more auth subtree? clear scatter dirty
 +  if (!in-has_subtree_root_dirfrag(mds-get_nodeid()))
 +in-clear_scatter_dirty();
 +
in-item_open_file.remove_myself();

// waiters
 @@ -1534,6 +1538,11 @@ void Migrator::export_finish(CDir *dir)
cache-adjust_subtree_auth(dir, export_peer[dir]);
cache-try_subtree_merge(dir);  // NOTE: may journal subtree_map as 
 sideeffect

 +  // no more auth subtree? clear scatter dirty
 +  if (!dir-get_inode()-is_auth() 
 +  !dir-get_inode()-has_subtree_root_dirfrag(mds-get_nodeid()))
 +dir-get_inode()-clear_scatter_dirty();
 +
// unpin path
export_unlock(dir);

 @@ -2020,6 +2029,10 @@ void Migrator::import_reverse(CDir *dir)
  cache-trim_non_auth_subtree(dir);
cache-adjust_subtree_auth(dir, import_peer[dir-dirfrag()]);

 +  if (!dir-get_inode()-is_auth() 
 +  !dir-get_inode()-has_subtree_root_dirfrag(mds-get_nodeid()))
 +dir-get_inode()-clear_scatter_dirty();
 +
// adjust auth bits.
listCDir* q;
q.push_back(dir);
 @@ -2053,6 +2066,8 @@ void Migrator::import_reverse(CDir *dir)
 if (in-is_dirty())
   in-mark_clean();
 in-clear_dirty_rstat();
 +   if (!in-has_subtree_root_dirfrag(mds-get_nodeid()))
 + in-clear_scatter_dirty();

 in-authlock.clear_gather();
 in-linklock.clear_gather();
 --
 1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request

2013-03-20 Thread Sage Weil

On Thu, 21 Mar 2013, Yan, Zheng wrote:
 On 03/21/2013 05:19 AM, Greg Farnum wrote:
  On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
  From: Yan, Zheng zheng.z@intel.com
 
  Replicated objects need to be added into the cache immediately
 
  Signed-off-by: Yan, Zheng zheng.z@intel.com
  Why do we need to add them right away? Shouldn't we have a journaled 
  replica if we need it?
  -Greg
 
 The issue I encountered is lock action message received, but replicated 
 objects wasn't in the
 cache because slave request was delayed.

This makes sense to me; the add_replica_*() methods that create and push 
replicas of cache objects to other nodes need to always be applied 
immediately, or else the cache coherency falls apart.

There are similar games played between the client and mds with the caps 
protocol, although in that case IIRC there are certain limited 
circumstances where we can delay processing the message.  For mds-mds 
traffic, I don't think that's possible, unless *all* potentially dependent 
traffic is also delayed to preserve ordering and so forth.

[That said, I didn't review the actual patch :)]

sage

 
 Thanks
 Yan, Zheng
 
 
  
  Software Engineer #42 @ http://inktank.com | http://ceph.com
  ---
  src/mds/MDCache.cc | 12 
  src/mds/MDCache.h | 2 +-
  src/mds/MDS.cc | 6 +++---
  src/mds/Server.cc | 55 
  +++---
  4 files changed, 56 insertions(+), 19 deletions(-)
 
  diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
  index 0f6b842..b668842 100644
  --- a/src/mds/MDCache.cc
  +++ b/src/mds/MDCache.cc
  @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context 
  *fin, bufferlist bl, int r)
 
  /*  */
 
  +int MDCache::get_num_client_requests()
  +{
  + int count = 0;
  + for (hash_mapmetareqid_t, MDRequest*::iterator p = 
  active_requests.begin();
  + p != active_requests.end();
  + ++p) {
  + if (p-second-reqid.name.is_client()  !p-second-is_slave())
  + count++;
  + }
  + return count;
  +}
  +
  /* This function takes over the reference to the passed Message */
  MDRequest *MDCache::request_start(MClientRequest *req)
  {
  diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
  index a9f05c6..4634121 100644
  --- a/src/mds/MDCache.h
  +++ b/src/mds/MDCache.h
  @@ -240,7 +240,7 @@ protected:
  hash_mapmetareqid_t, MDRequest* active_requests; 
 
  public:
  - int get_num_active_requests() { return active_requests.size(); }
  + int get_num_client_requests();
 
  MDRequest* request_start(MClientRequest *req);
  MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by);
  diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
  index b91dcbd..e99eecc 100644
  --- a/src/mds/MDS.cc
  +++ b/src/mds/MDS.cc
  @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m)
  mdcache-is_open() 
  replay_queue.empty() 
  want_state == MDSMap::STATE_CLIENTREPLAY) {
  - dout(10)   still have   mdcache-get_num_active_requests()
  -   active replay requests  dendl;
  - if (mdcache-get_num_active_requests() == 0)
  + int num_requests = mdcache-get_num_client_requests();
  + dout(10)   still have   num_requests   active replay requests 
   dendl;
  + if (num_requests == 0)
  clientreplay_done();
  }
 
  diff --git a/src/mds/Server.cc b/src/mds/Server.cc
  index 4c4c86b..8e89e4c 100644
  --- a/src/mds/Server.cc
  +++ b/src/mds/Server.cc
  @@ -107,10 +107,8 @@ void Server::dispatch(Message *m)
  (m-get_type() == CEPH_MSG_CLIENT_REQUEST 
  (static_castMClientRequest*(m))-is_replay( {
  // replaying!
  - } else if (mds-is_clientreplay()  m-get_type() == 
  MSG_MDS_SLAVE_REQUEST 
  - ((static_castMMDSSlaveRequest*(m))-is_reply() ||
  - !mds-mdsmap-is_active(m-get_source().num( {
  - // slave reply or the master is also in the clientreplay stage
  + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) {
  + // handle_slave_request() will wait if necessary
  } else {
  dout(3)  not active yet, waiting  dendl;
  mds-wait_for_active(new C_MDS_RetryMessage(mds, m));
  @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest 
  *m)
  if (m-is_reply())
  return handle_slave_request_reply(m);
 
  + CDentry *straydn = NULL;
  + if (m-stray.length()  0) {
  + straydn = mdcache-add_replica_stray(m-stray, from);
  + assert(straydn);
  + m-stray.clear();
  + }
  +
  // am i a new slave?
  MDRequest *mdr = NULL;
  if (mdcache-have_request(m-get_reqid())) {
  @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest 
  *m)
  m-put();
  return;
  }
  - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), 
  m-get_source().num());
  + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), 
  from);
  }
  assert(mdr-slave_request == 0); // only one at a time, please! 
  +
  + if (straydn) {
  + mdr-pin(straydn);
  + mdr-straydn = straydn;
  + }
  +
  + if (!mds-is_clientreplay()  !mds-is_active()  !mds-is_stopping()) 
  {
  + dout(3)  not clientreplay|active

Re: [PATCH 29/39] mds: avoid double auth pin for file recovery

2013-03-20 Thread Sage Weil

On Thu, 21 Mar 2013, Yan, Zheng wrote:
 On 03/21/2013 11:20 AM, Gregory Farnum wrote:
  This looks good on its face but I haven't had the chance to dig
  through the recovery queue stuff yet (it's on my list following some
  issues with recovery speed). How'd you run across this? If it's being
  added to the recovery queue multiple times I want to make sure we
  don't have some other machinery trying to dequeue it multiple times,
  or a single waiter which needs to be a list or something.
  -Greg
 
 Two clients that were writing the same file crashed successively.

Hmm, I would love to have a test case for this.  It should be pretty easy 
to construct some tests with libcephfs that fork, connect and do some 
operations, and are then killed by the parent, who verifies the resulting 
recovery occurs.  This is some of the more fragile, not just because it is 
rarely tested.

sage



 
 Thanks,
 Yan, Zheng
 
  
  On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote:
  From: Yan, Zheng zheng.z@intel.com
 
  Signed-off-by: Yan, Zheng zheng.z@intel.com
  ---
   src/mds/MDCache.cc | 6 --
   1 file changed, 4 insertions(+), 2 deletions(-)
 
  diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
  index 973a4d0..e9a79cd 100644
  --- a/src/mds/MDCache.cc
  +++ b/src/mds/MDCache.cc
  @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in)
 dout(15)  _queue_file_recover   *in  dendl;
 assert(in-is_auth());
 in-state_clear(CInode::STATE_NEEDSRECOVER);
  -  in-state_set(CInode::STATE_RECOVERING);
  -  in-auth_pin(this);
  +  if (!in-state_test(CInode::STATE_RECOVERING)) {
  +in-state_set(CInode::STATE_RECOVERING);
  +in-auth_pin(this);
  +  }
 file_recover_queue.insert(in);
   }
 
  --
  1.7.11.7
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

69 matches

Mail list logo