Re: [PATCH 04/39] mds: make sure table request id unique
On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum wrote: Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. sage Thanks Yan, Zheng -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com When a MDS becomes active, the table server re-sends 'agree' messages for old prepared request. If the recoverd MDS starts a new table request at the same time, The new request's ID can happen to be the same as old prepared request's ID, because current table client assigns request ID from zero after MDS restarts. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 3 +++ src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 + src/mds/MDSTableClient.h | 2 ++ 3 files changed, 10 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index bb1c833..859782a 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) dout(2) boot_start step : opening snap table dendl; snapserver-load(gather.new_sub()); } + + anchorclient-init(); + snapclient-init(); dout(2) boot_start step : opening mds log dendl; mdlog-open(gather.new_sub()); diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index ea021f5..beba0a3 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -34,6 +34,11 @@ #undef dout_prefix #define dout_prefix *_dout mds. mds-get_nodeid() .tableclient( get_mdstable_name(table) ) +void MDSTableClient::init() +{ + // make reqid unique between MDS restarts + last_reqid = (uint64_t)mds-mdsmap-get_epoch() 32; +} void MDSTableClient::handle_request(class MMDSTableRequest *m) { diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h index e15837f..78035db 100644 --- a/src/mds/MDSTableClient.h +++ b/src/mds/MDSTableClient.h @@ -63,6 +63,8 @@ public: MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} virtual ~MDSTableClient() {} + void init(); + void handle_request(MMDSTableRequest *m); void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/39] mds: make sure table request id unique
On 03/20/2013 02:15 PM, Sage Weil wrote: On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum wrote: Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. This approach works only if client journal the reqid before it sending the request to the server. but current implementation is client journal the reqid when it receives server's agree message. Regards Yan, Zheng I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. sage Thanks Yan, Zheng -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com When a MDS becomes active, the table server re-sends 'agree' messages for old prepared request. If the recoverd MDS starts a new table request at the same time, The new request's ID can happen to be the same as old prepared request's ID, because current table client assigns request ID from zero after MDS restarts. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 3 +++ src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 + src/mds/MDSTableClient.h | 2 ++ 3 files changed, 10 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index bb1c833..859782a 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) dout(2) boot_start step : opening snap table dendl; snapserver-load(gather.new_sub()); } + + anchorclient-init(); + snapclient-init(); dout(2) boot_start step : opening mds log dendl; mdlog-open(gather.new_sub()); diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index ea021f5..beba0a3 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -34,6 +34,11 @@ #undef dout_prefix #define dout_prefix *_dout mds. mds-get_nodeid() .tableclient( get_mdstable_name(table) ) +void MDSTableClient::init() +{ + // make reqid unique between MDS restarts + last_reqid = (uint64_t)mds-mdsmap-get_epoch() 32; +} void MDSTableClient::handle_request(class MMDSTableRequest *m) { diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h index e15837f..78035db 100644 --- a/src/mds/MDSTableClient.h +++ b/src/mds/MDSTableClient.h @@ -63,6 +63,8 @@ public: MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} virtual ~MDSTableClient() {} + void init(); + void handle_request(MMDSTableRequest *m); void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote: On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On the other hand, 'dpkg --purge' is to remove everything the package has installed and/or generated. This includes debconf answers as well. With other words, purge is used to make the system totally clean of the package. As such, if the sysadmin install the package again, all debconf questions will be asked again and all generated files will be generated again from scratch. I understand that part, but the policy isn't very clear about files that are not part of the package but are generated as a result of the package being installed (i.e., user data). Forgive me, I just learnt English and my wording may not be that clear for a natural speaker. As a point of comparison, mysql removes the config files but not /var/lib/mysql. As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I may mix with an other (database related) package. The question is, is that okay/typical/desireable/recommended/a bad idea? I can rephrase my words. Purge removes the (binary) package files, its configuration and logs (its generated files). To emphasis, user files are _not_ fall into this category and must remain as-is, _intact_. Some packages writes a console message that 'your files remain at xxx, they were not removed' on purge. Others just leave the dpkg warning directory not empty so not removed which means user files may have left there and that may be the reason the directory is not empty. I'm in a rush, but hopefully will be able to note policy parts in the afternoon (CET). The less important question is whether /var/log/ceph should be removed; I'm assuming yes? Yes, logs are going to be removed. Hope I could answer your question now. Please note me if I should clear more parts of my answer. Laszlo/GCS -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/39] mds: make sure table request id unique
On 03/20/2013 02:15 PM, Sage Weil wrote: On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum wrote: Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. Can we re-use the client request ID as table client request ID ? Regards Yan, Zheng sage Thanks Yan, Zheng -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com When a MDS becomes active, the table server re-sends 'agree' messages for old prepared request. If the recoverd MDS starts a new table request at the same time, The new request's ID can happen to be the same as old prepared request's ID, because current table client assigns request ID from zero after MDS restarts. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 3 +++ src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 + src/mds/MDSTableClient.h | 2 ++ 3 files changed, 10 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index bb1c833..859782a 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) dout(2) boot_start step : opening snap table dendl; snapserver-load(gather.new_sub()); } + + anchorclient-init(); + snapclient-init(); dout(2) boot_start step : opening mds log dendl; mdlog-open(gather.new_sub()); diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index ea021f5..beba0a3 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -34,6 +34,11 @@ #undef dout_prefix #define dout_prefix *_dout mds. mds-get_nodeid() .tableclient( get_mdstable_name(table) ) +void MDSTableClient::init() +{ + // make reqid unique between MDS restarts + last_reqid = (uint64_t)mds-mdsmap-get_epoch() 32; +} void MDSTableClient::handle_request(class MMDSTableRequest *m) { diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h index e15837f..78035db 100644 --- a/src/mds/MDSTableClient.h +++ b/src/mds/MDSTableClient.h @@ -63,6 +63,8 @@ public: MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} virtual ~MDSTableClient() {} + void init(); + void handle_request(MMDSTableRequest *m); void _prepare(bufferlist mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote: On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On the other hand, 'dpkg --purge' is to remove everything the package has installed and/or generated. This includes debconf answers as well. With other words, purge is used to make the system totally clean of the package. As such, if the sysadmin install the package again, all debconf questions will be asked again and all generated files will be generated again from scratch. I understand that part, but the policy isn't very clear about files that are not part of the package but are generated as a result of the package being installed (i.e., user data). Forgive me, I just learnt English and my wording may not be that clear for a natural speaker. As a point of comparison, mysql removes the config files but not /var/lib/mysql. As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I may mix with an other (database related) package. The question is, is that okay/typical/desireable/recommended/a bad idea? I can rephrase my words. Purge removes the (binary) package files, its configuration and logs (its generated files). To emphasis, user files are _not_ fall into this category and must remain as-is, _intact_. Some packages writes a console message that 'your files remain at xxx, they were not removed' on purge. Others just leave the dpkg warning directory not empty so not removed which means user files may have left there and that may be the reason the directory is not empty. I'm in a rush, but hopefully will be able to note policy parts in the afternoon (CET). Thanks, Laszlo, that's exactly what I was after! Sorry for the confusing exchange. :) Sounds like in this case, the fix is simply to leave /var/lib/ceph untouched. We'll need to update teuthology ceph.py and nuke to clean up /var/lib/ceph (for qa runs), and I think we should add a ceph-deploy 'purgedata' command to clean out /var/lib/ceph on a given host. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
waiting for 1 open ops to drain
I am using ceph 0.58 and kernel 3.9-rc2 and btrfs on my osds. I have an osd that starts up but blocks with the log message 'waiting for 1 open ops to drain'. This never happens, and I can't get the osd 'up'. I need to clear this problem. I have recently had an osd go problematic and I have recreated a fresh btrfs filesystem on the problem osd drive. I have also added a completely new osd. The 'waiting for 1 open ops to drain' problem has occurred before the cluster has recovered from the earlier surgery and I need to get the data from this osd. I have increased the number of copies from 2 to 3 to give me more resilience in the future, but that has not taken effect yet. Once I get the cluster back to health, I will mkfs.btrfs and rebuild this osd and one other that is a legacy from earlier kernel/ceph versions. How can I tell the osd not to bother with waiting for its open ops to drain? Thank you in anticipation. David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
On 03/20/2013 07:48 AM, Sage Weil wrote: On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote: On Tue, 19 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On the other hand, 'dpkg --purge' is to remove everything the package has installed and/or generated. This includes debconf answers as well. With other words, purge is used to make the system totally clean of the package. As such, if the sysadmin install the package again, all debconf questions will be asked again and all generated files will be generated again from scratch. I understand that part, but the policy isn't very clear about files that are not part of the package but are generated as a result of the package being installed (i.e., user data). Forgive me, I just learnt English and my wording may not be that clear for a natural speaker. As a point of comparison, mysql removes the config files but not /var/lib/mysql. As I remember, MySQL asks if /var/lib/mysql/ should be purged or not; I may mix with an other (database related) package. The question is, is that okay/typical/desireable/recommended/a bad idea? I can rephrase my words. Purge removes the (binary) package files, its configuration and logs (its generated files). To emphasis, user files are _not_ fall into this category and must remain as-is, _intact_. Some packages writes a console message that 'your files remain at xxx, they were not removed' on purge. Others just leave the dpkg warning directory not empty so not removed which means user files may have left there and that may be the reason the directory is not empty. I'm in a rush, but hopefully will be able to note policy parts in the afternoon (CET). Thanks, Laszlo, that's exactly what I was after! Sorry for the confusing exchange. :) Sounds like in this case, the fix is simply to leave /var/lib/ceph untouched. We'll need to update teuthology ceph.py and nuke to clean up /var/lib/ceph (for qa runs), and I think we should add a ceph-deploy 'purgedata' command to clean out /var/lib/ceph on a given host. It's not as important given that it won't outright destroy the cluster, but perhaps we should also leave /etc/ceph untouched on purge if a ceph.conf file has been placed in it (since that also was not installed by the package, but rather by a user?). I figure we should probably try to get it right now. The message about the directory not being empty sounds good. My thought here is: - remove anything created by the packages in /var/lib/ceph that has been untouched since package installation. - remove /var/lib/ceph if it has been untouched - remove /etc/ceph if it has been untouched Thoughts? Thanks! sage Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 20/03/13 14:48, Mark Nelson wrote: We'll need to update teuthology ceph.py and nuke to clean up /var/lib/ceph (for qa runs), and I think we should add a ceph-deploy 'purgedata' command to clean out /var/lib/ceph on a given host. It's not as important given that it won't outright destroy the cluster, but perhaps we should also leave /etc/ceph untouched on purge if a ceph.conf file has been placed in it (since that also was not installed by the package, but rather by a user?). I figure we should probably try to get it right now. The message about the directory not being empty sounds good. My thought here is: - remove anything created by the packages in /var/lib/ceph that has been untouched since package installation. - remove /var/lib/ceph if it has been untouched - remove /etc/ceph if it has been untouched If those directories are created by dpkg rather than maintainer scripts (i.e. in ceph.dirs rather than ceph.postinst) you should should get that all for free; if the directories contain anything dpkg does not know about it will just not removed them. - -- James Page Ubuntu Core Developer Debian Maintainer james.p...@ubuntu.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBCAAGBQJRSfFoAAoJEL/srsug59jDJgsP/0zD4aKFtimFPh02/zdfJ3X0 BEY8Jmnmt3HcCxSPbGleZ+2p/38iLfLz8HdM1ZpOwGDVIv13N45vG0wW9zF/843R 8vKoGHJY7gAt/uY1fqa115m9txeNXAIZoaxwjrd6Zd31pgPvTBmfZhVsFKUnk7E5 9JmUs/K8gjjAZajhkUKgddp2ID70n/WGdHR+iu5cy72TyuvXVBQV1OmyYi9lMIxM yHGnCM/X7x5DE1g61x532VP2D0gAegA2WWURoqQ6vAM3IZfVGVuvat+HdzZ8Ej8z LfTk+8n2alTj6s1Xp3KGbb/D231MIi3VBaFMQx5pBlM29lAv8OYKidRQpZc9bZe2 5m5vhDutp4ZOZmqxDvDdayZgb/s8uVuodT2XK7qn4KbBRDEJN5aJKiUzXH0wTVTZ Lg/A/criFuzRP+ZH/Sh1pfSnLkNrrLMbdTglUv4krM2L6ZPOmEU3fh+UIXkr2u9t f6lnu4fVBwikDy/4hDztVL76IqB3wjnxYlJ1uHMHOrCugDeLRsGHdbCdFcoZonRW rUhdrtqzuuSbPHkzs/dEMCm4vF439YdmuL4WGsfzxEu+djcESDtzuAw4D1LRO12V JbEc9s8+L84oBJUv5dCSDG333jWc/8eihSs1ZG33NKZsWsNheppLR6aeXbg/nGi1 53R63uM1Lv7f1UaTxvH8 =6adU -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/39] mds: make sure table request id unique
On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: On 03/20/2013 02:15 PM, Sage Weil wrote: On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum wrote: Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. Can we re-use the client request ID as table client request ID ? Regards Yan, Zheng Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's agree with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/39] mds: preserve subtree bounds until slave commit
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) When replaying an operation that rename a directory inode to non-auth subtree, if the inode has subtree bounds, we should prevent them from being trimmed until slave commit. This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay() should be called before MDCache::finish_uncommitted_slave_update(). Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 21 +++-- src/mds/Mutation.h | 5 ++--- src/mds/journal.cc (http://journal.cc) | 13 + 3 files changed, 22 insertions(+), 17 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index fddcfc6..684e70b 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -3016,10 +3016,10 @@ void MDCache::add_uncommitted_slave_update(metareqid_t reqid, int master, MDSlav { assert(uncommitted_slave_updates[master].count(reqid) == 0); uncommitted_slave_updates[master][reqid] = su; - if (su-rename_olddir) - uncommitted_slave_rename_olddir[su-rename_olddir]++; + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); ++p) + uncommitted_slave_rename_olddir[*p]++; for(setCInode*::iterator p = su-unlinked.begin(); p != su-unlinked.end(); ++p) - uncommitted_slave_unlink[*p]++; + uncommitted_slave_unlink[*p]++; } void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master) @@ -3031,11 +3031,12 @@ void MDCache::finish_uncommitted_slave_update(metareqid_t reqid, int master) if (uncommitted_slave_updates[master].empty()) uncommitted_slave_updates.erase(master); // discard the non-auth subtree we renamed out of - if (su-rename_olddir) { - uncommitted_slave_rename_olddir[su-rename_olddir]--; - if (uncommitted_slave_rename_olddir[su-rename_olddir] == 0) { - uncommitted_slave_rename_olddir.erase(su-rename_olddir); - CDir *root = get_subtree_root(su-rename_olddir); + for(setCDir*::iterator p = su-olddirs.begin(); p != su-olddirs.end(); ++p) { + CDir *dir = *p; + uncommitted_slave_rename_olddir[dir]--; + if (uncommitted_slave_rename_olddir[dir] == 0) { + uncommitted_slave_rename_olddir.erase(dir); + CDir *root = get_subtree_root(dir); if (root-get_dir_auth() == CDIR_AUTH_UNDEF) try_trim_non_auth_subtree(root); } @@ -6052,8 +6053,8 @@ bool MDCache::trim_non_auth_subtree(CDir *dir) { dout(10) trim_non_auth_subtree( dir ) *dir dendl; - // preserve the dir for rollback - if (uncommitted_slave_rename_olddir.count(dir)) + if (uncommitted_slave_rename_olddir.count(dir) || // preserve the dir for rollback + my_ambiguous_imports.count(dir-dirfrag())) return true; bool keep_dir = false; diff --git a/src/mds/Mutation.h b/src/mds/Mutation.h index 55b84eb..5013f04 100644 --- a/src/mds/Mutation.h +++ b/src/mds/Mutation.h @@ -315,13 +315,12 @@ struct MDSlaveUpdate { bufferlist rollback; elistMDSlaveUpdate*::item item; Context *waiter; - CDir* rename_olddir; + setCDir* olddirs; setCInode* unlinked; MDSlaveUpdate(int oo, bufferlist rbl, elistMDSlaveUpdate* list) : origop(oo), item(this), - waiter(0), - rename_olddir(0) { + waiter(0) { rollback.claim(rbl); list.push_back(item); } diff --git a/src/mds/journal.cc (http://journal.cc) b/src/mds/journal.cc (http://journal.cc) index 5b3bd71..3375e40 100644 --- a/src/mds/journal.cc (http://journal.cc) +++ b/src/mds/journal.cc (http://journal.cc) @@ -1131,10 +1131,15 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, MDSlaveUpdate *slaveup) if (olddir) { if (olddir-authority() != CDIR_AUTH_UNDEF renamed_diri-authority() == CDIR_AUTH_UNDEF) { + assert(slaveup); // auth to non-auth, must be slave prepare listfrag_t leaves; renamed_diri-dirfragtree.get_leaves(leaves); - for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) - renamed_diri-get_or_open_dirfrag(mds-mdcache, *p); + for (listfrag_t::iterator p = leaves.begin(); p != leaves.end(); ++p) { + CDir *dir = renamed_diri-get_or_open_dirfrag(mds-mdcache, *p); + // preserve subtree bound until slave commit + if (dir-authority() == CDIR_AUTH_UNDEF) + slaveup-olddirs.insert(dir); + } } mds-mdcache-adjust_subtree_after_rename(renamed_diri, olddir, false); @@ -1143,7 +1148,7 @@ void EMetaBlob::replay(MDS *mds, LogSegment *logseg, MDSlaveUpdate *slaveup) CDir *root = mds-mdcache-get_subtree_root(olddir); if (root-get_dir_auth() == CDIR_AUTH_UNDEF) { if (slaveup) // preserve the old dir until slave commit - slaveup-rename_olddir = olddir; + slaveup-olddirs.insert(olddir); else mds-mdcache-try_trim_non_auth_subtree(root); } @@ -2122,10 +2127,10 @@ void ESlaveUpdate::replay(MDS *mds) case
Re: [PATCH 03/39] mds: fix MDCache::adjust_bounded_subtree_auth()
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) There are cases that need both create new bound and swallow intervening subtree. For example: A MDS exports subtree A with bound B and imports subtree B with bound C at the same time. The MDS crashes, exporting subtree A fails, but importing subtree B succeed. During recovery, the MDS may create new bound C and swallow subtree B. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index 684e70b..19dc60b 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -980,15 +980,21 @@ void MDCache::adjust_bounded_subtree_auth(CDir *dir, setCDir* bounds, pairin } else { dout(10) want bound *bound dendl; + CDir *t = get_subtree_root(bound-get_parent_dir()); + if (subtrees[t].count(bound) == 0) { + assert(t != dir); + dout(10) new bound *bound dendl; + adjust_subtree_auth(bound, t-authority()); + } // make sure it's nested beneath ambiguous subtree(s) while (1) { - CDir *t = get_subtree_root(bound-get_parent_dir()); - if (t == dir) break; while (subtrees[dir].count(t) == 0) t = get_subtree_root(t-get_parent_dir()); dout(10) swallowing intervening subtree at *t dendl; adjust_subtree_auth(t, auth); try_subtree_merge_at(t); + t = get_subtree_root(bound-get_parent_dir()); + if (t == dir) break; } } } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/39] mds: send table request when peer is in proper state.
This and patch 6 are probably going to get dealt with as part of our conversation on patch 4 and restart of the TableServers. Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) Table client/server should send request/reply when the peer is active. Anchor query is an exception, because MDS in rejoin stage may need fetch files before sending rejoin ack, the anchor server can also be in rejoin stage. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/AnchorClient.cc (http://AnchorClient.cc) | 5 - src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 9 ++--- src/mds/MDSTableServer.cc (http://MDSTableServer.cc) | 3 ++- 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/src/mds/AnchorClient.cc (http://AnchorClient.cc) b/src/mds/AnchorClient.cc (http://AnchorClient.cc) index 455e97f..d7da9d1 100644 --- a/src/mds/AnchorClient.cc (http://AnchorClient.cc) +++ b/src/mds/AnchorClient.cc (http://AnchorClient.cc) @@ -80,9 +80,12 @@ void AnchorClient::lookup(inodeno_t ino, vectorAnchor trace, Context *onfinis void AnchorClient::_lookup(inodeno_t ino) { + int ts = mds-mdsmap-get_tableserver(); + if (mds-mdsmap-get_state(ts) MDSMap::STATE_REJOIN) + return; MMDSTableRequest *req = new MMDSTableRequest(table, TABLESERVER_OP_QUERY, 0, 0); ::encode(ino, req-bl); - mds-send_message_mds(req, mds-mdsmap-get_tableserver()); + mds-send_message_mds(req, ts); } diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) index beba0a3..df0131f 100644 --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) @@ -149,9 +149,10 @@ void MDSTableClient::_prepare(bufferlist mutation, version_t *ptid, bufferlist void MDSTableClient::send_to_tableserver(MMDSTableRequest *req) { int ts = mds-mdsmap-get_tableserver(); - if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) { mds-send_message_mds(req, ts); - else { + } else { + req-put(); dout(10) deferring request to not-yet-active tableserver mds. ts dendl; } } @@ -193,7 +194,9 @@ void MDSTableClient::got_journaled_ack(version_t tid) void MDSTableClient::finish_recovery() { dout(7) finish_recovery dendl; - resend_commits(); + int ts = mds-mdsmap-get_tableserver(); + if (mds-mdsmap-get_state(ts) = MDSMap::STATE_CLIENTREPLAY) + resend_commits(); } void MDSTableClient::resend_commits() diff --git a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) index 4f86ff1..07c7d26 100644 --- a/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) +++ b/src/mds/MDSTableServer.cc (http://MDSTableServer.cc) @@ -159,7 +159,8 @@ void MDSTableServer::handle_mds_recovery(int who) for (mapversion_t,mds_table_pending_t::iterator p = pending_for_mds.begin(); p != pending_for_mds.end(); ++p) { - if (who = 0 p-second.mds != who) + if ((who = 0 p-second.mds != who) || + mds-mdsmap-get_state(p-second.mds) MDSMap::STATE_CLIENTREPLAY) continue; MMDSTableRequest *reply = new MMDSTableRequest(table, TABLESERVER_OP_AGREE, p-second.reqid, p-second.tid); mds-send_message_mds(reply, p-second.mds); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.
The idea of this patch makes sense, but I'm not sure if we guarantee that each daemon sees every map update — if they don't then if an MDS misses the map moving an MDS into CLIENTREPLAY then they won't process them as having recovered on the next map. Sage or Joao, what are the guarantees subscription provides? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) MDS in clientreply state already start servering requests. It also make MDS::handle_mds_recovery() and MDS::recovery_done() match. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 282fa64..b91dcbd 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m) setint oldactive, active; oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE); + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY); mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE); + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY); for (setint::iterator p = active.begin(); p != active.end(); ++p) if (*p != whoami // not me oldactive.count(*p) == 0) // newly so? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/39] mds: mark connection down when MDS fails
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) So if the MDS restarts and uses the same address, it does not get old messages. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 859782a..282fa64 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1046,8 +1046,10 @@ void MDS::handle_mds_map(MMDSMap *m) oldmap-get_failed_mds_set(oldfailed); mdsmap-get_failed_mds_set(failed); for (setint::iterator p = failed.begin(); p != failed.end(); ++p) - if (oldfailed.count(*p) == 0) + if (oldfailed.count(*p) == 0) { + messenger-mark_down(oldmap-get_inst(*p).addr); mdcache-handle_mds_failure(*p); + } // or down then up? // did their addr/inst change? @@ -1055,8 +1057,10 @@ void MDS::handle_mds_map(MMDSMap *m) mdsmap-get_up_mds_set(up); for (setint::iterator p = up.begin(); p != up.end(); ++p) if (oldmap-have_inst(*p) - oldmap-get_inst(*p) != mdsmap-get_inst(*p)) + oldmap-get_inst(*p) != mdsmap-get_inst(*p)) { + messenger-mark_down(oldmap-get_inst(*p).addr); mdcache-handle_mds_failure(*p); + } } if (is_clientreplay() || is_active() || is_stopping()) { // did anyone stop? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bad Blocks
Hi All, I would like to understand how Ceph handles and recovers from bad blocks. Would someone mind explaining this to me? It wasn't very apparent from the docs. My ultimate goal to be able to get some extra life out of my disks, after I detect that they may be failing. (I'm talking about those disks that may have a small amount of bad blocks, but otherwise seem file and still perform well). Here's what I've put together: 1. BBR Hardware - All hard disks come with a set number of blocks that are reserved for remapping of failed blocks. This is handled transparently by the hard disk. The hard disk may not begin reporting failed blocks until all the reserved blocks are used up. 2. BBR Device Mapper Target - Back in the EVMS days, IBM wrote a kernel module (dm-bbr) and a evms plugin to manage that kernel module. I have updated that kernel module to work with the 3.6.11 kernel. I have also rewrote some portions of the evms plugin as a standalone bash script to allow me to initialize the BBR layer and start the BBR device mapper target on that layer. (So far it seems to run fine, but requires more testing). 3. BTRFS - I've read that BTRFS can perform data scrubbing and repair damaged files from redundant copies. 4. CEPH - I've read that CEPH can perform a deep scrub to find damaged copies. I assume by the distributed nature of CEPH, it can repair the damaged copy from the other OSDs. One thing I am not clear on is when BTRFS / CEPH finds damaged data, what do they do to prevent data from being written to the same area? Also, I'm wondering if any parts to my layered approach are redundant / unnecessary... For instance if BTRFS marks the block bad internally, then perhaps the BBR DM Target isn't needed... In my testing recently, I had the following setup: Disk - DM-Crypt - DM-BBR - BTRFS - OSD When the OSD hit a bad block, the DM-BBR target successfully remapped it to one of its own reserved blocks, BTRFS then reported data corruption, and the OSD daemon crashed. -- Thanks, Dyweni -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.
Oh, also: s/clientreply/clientreplay in the commit message Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) MDS in clientreply state already start servering requests. It also make MDS::handle_mds_recovery() and MDS::recovery_done() match. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 282fa64..b91dcbd 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m) setint oldactive, active; oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE); + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY); mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE); + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY); for (setint::iterator p = active.begin(); p != active.end(); ++p) if (*p != whoami // not me oldactive.count(*p) == 0) // newly so? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/39] mds: defer eval gather locks when removing replica
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Locks' states should not change between composing the cache rejoin ack messages and sending the message. If Locker::eval_gather() is called in MDCache::{inode,dentry}_remove_replica(), it may wake requests and change locks' states. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 51 ++- src/mds/MDCache.h | 8 +--- 2 files changed, 35 insertions(+), 24 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index 19dc60b..0f6b842 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) // possible response(s) MMDSCacheRejoin *ack = 0; // if survivor setvinodeno_t acked_inodes; // if survivor + setSimpleLock * gather_locks; // if survivor bool survivor = false; // am i a survivor? if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) { @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(dnl-is_primary()); if (survivor dn-is_replica(from)) - dentry_remove_replica(dn, from); // this induces a lock gather completion + dentry_remove_replica(dn, from, gather_locks); // this induces a lock gather completion This comment is no longer accurate :) int dnonce = dn-add_replica(from); dout(10) have *dn dendl; if (ack) @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(in); if (survivor in-is_replica(from)) - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); int inonce = in-add_replica(from); dout(10) have *in dendl; @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) CInode *in = get_inode(*p); assert(in); // hmm fixme wrt stray? if (survivor in-is_replica(from)) - inode_remove_replica(in, from); // this induces a lock gather completion + inode_remove_replica(in, from, gather_locks); // this induces a lock gather completion Same here. Other than those, looks good. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com int inonce = in-add_replica(from); dout(10) have base *in dendl; @@ -3909,8 +3910,11 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) ack-add_inode_base(in); } - rejoin_scour_survivor_replicas(from, ack, acked_inodes); + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes); mds-send_message(ack, weak-get_connection()); + + for (setSimpleLock*::iterator p = gather_locks.begin(); p != gather_locks.end(); ++p) + mds-locker-eval_gather(*p); } else { // done? assert(rejoin_gather.count(from)); @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t ino, filepath path, * all validated replicas are acked with a strong nonce, etc. if that isn't in the * ack, the replica dne, and we can remove it from our replica maps. */ -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, setvinodeno_t acked_inodes) +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, + setSimpleLock * gather_locks, + setvinodeno_t acked_inodes) { dout(10) rejoin_scour_survivor_replicas from mds. from dendl; @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (in-is_auth() in-is_replica(from) acked_inodes.count(p-second-vino()) == 0) { - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); dout(10) rem *in dendl; } @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (dn-is_replica(from) (ack-strong_dentries.count(dir-dirfrag()) == 0 || ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, dn-last)) == 0)) { - dentry_remove_replica(dn, from); + dentry_remove_replica(dn, from, gather_locks); dout(10) rem *dn dendl; } } @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) return; } + setSimpleLock * gather_locks; // loop over realms for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin(); p != m-realms.end(); @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) // remove from our cached_by dout(7) inode expire on *in from mds. from cached_by was in-get_replicas() dendl; - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); } else { // this is an old nonce, ignore expire. @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) if (nonce == dn-get_replica_nonce(from)) { dout(7) dentry_expire on *dn from mds. from dendl; -
Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Hey folks, We were hoping this one was fixed. I upgraded all my nodes to the latest bobtail branch, but still hit this today: osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f651d0ada48] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b] 3: (Context::complete(int)+0xa) [0x7f651d06c9fa] 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315] 5: (Context::complete(int)+0xa) [0x7f651d06c9fa] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920] 9: (()+0x7e9a) [0x7f6519cffe9a] 10: (clone()+0x6d) [0x7f6519a2bcbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is this occuring in librbd caching? If so, I could disable it for the time being. First saw this mentioned on-list here: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577 Will be happy to provide anything I can for this one -- definitely critical for my use case. It happens with about 10% of the VMs I create. Always within the first 60 seconds of the VM booting and being network accessible. - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Hi, strange i've never seen this. Which qemu version? Stefan Am 20.03.2013 20:49, schrieb Travis Rhoden: Hey folks, We were hoping this one was fixed. I upgraded all my nodes to the latest bobtail branch, but still hit this today: osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f651d0ada48] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b] 3: (Context::complete(int)+0xa) [0x7f651d06c9fa] 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315] 5: (Context::complete(int)+0xa) [0x7f651d06c9fa] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920] 9: (()+0x7e9a) [0x7f6519cffe9a] 10: (clone()+0x6d) [0x7f6519a2bcbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is this occuring in librbd caching? If so, I could disable it for the time being. First saw this mentioned on-list here: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577 Will be happy to provide anything I can for this one -- definitely critical for my use case. It happens with about 10% of the VMs I create. Always within the first 60 seconds of the VM booting and being network accessible. - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Travis, are you using format 1 or 2 images? I've seen the same behavior on format 2 images using cloned snapshots, but haven't run into this issue on any normal format 2 images. - Original Message - From: Travis Rhoden trho...@gmail.com To: ceph-devel ceph-devel@vger.kernel.org Sent: Wednesday, March 20, 2013 3:49:23 PM Subject: Latest bobtail branch still crashing KVM VMs in bh_write_commit() Hey folks, We were hoping this one was fixed. I upgraded all my nodes to the latest bobtail branch, but still hit this today: osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f651d0ada48] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b] 3: (Context::complete(int)+0xa) [0x7f651d06c9fa] 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315] 5: (Context::complete(int)+0xa) [0x7f651d06c9fa] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920] 9: (()+0x7e9a) [0x7f6519cffe9a] 10: (clone()+0x6d) [0x7f6519a2bcbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is this occuring in librbd caching? If so, I could disable it for the time being. First saw this mentioned on-list here: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577 Will be happy to provide anything I can for this one -- definitely critical for my use case. It happens with about 10% of the VMs I create. Always within the first 60 seconds of the VM booting and being network accessible. - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html NOTICE: Protect the information in this message in accordance with the company's security policies. If you received this message in error, immediately notify the sender and destroy all copies. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Hello. Travis, are you using format 1 or 2 images? I've seen the same behavior on format 2 images using cloned snapshots, but haven't run into this issue on any normal format 2 images. In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. - Travis On Wed, Mar 20, 2013 at 3:53 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, strange i've never seen this. Which qemu version? Stefan Am 20.03.2013 20:49, schrieb Travis Rhoden: Hey folks, We were hoping this one was fixed. I upgraded all my nodes to the latest bobtail branch, but still hit this today: osdc/ObjectCacher.cc: In function 'void ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t, tid_t, int)' thread 7f650e62f700 time 2013-03-20 19:34:39.952616 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid) ceph version 0.56.3-42-ga30903c (a30903c6adaa023587d3147179d6038ad37ca520) 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long, unsigned long, int)+0xd68) [0x7f651d0ada48] 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f651d0b460b] 3: (Context::complete(int)+0xa) [0x7f651d06c9fa] 4: (librbd::C_Request::finish(int)+0x85) [0x7f651d09c315] 5: (Context::complete(int)+0xa) [0x7f651d06c9fa] 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f651d081387] 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f651c43163d] 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f651c49c920] 9: (()+0x7e9a) [0x7f6519cffe9a] 10: (clone()+0x6d) [0x7f6519a2bcbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is this occuring in librbd caching? If so, I could disable it for the time being. First saw this mentioned on-list here: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13577 Will be happy to provide anything I can for this one -- definitely critical for my use case. It happens with about 10% of the VMs I create. Always within the first 60 seconds of the VM booting and being network accessible. - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
On Wed, Mar 20, 2013 at 4:14 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. Since the crash is in librbd, would an update of qemu help anything? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
On 03/20/2013 01:14 PM, Stefan Priebe wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. This particular problem won't be solved by upgrading qemu. It's a ceph bug. Disabling caching would work around the issue. Travis, could you get a log from qemu of this happening with: debug ms = 20 debug objectcacher = 20 debug rbd = 20 log file = /path/writeable/by/qemu From those we can tell whether the issue is on the client side at least, and hopefully what's causing it. Thanks! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS: stable release?
On Wednesday, March 20, 2013 at 1:22 PM, Pascal wrote: Am Sun, 24 Feb 2013 14:41:27 -0800 schrieb Gregory Farnum g...@inktank.com (mailto:g...@inktank.com): On Saturday, February 23, 2013 at 2:14 AM, Gandalf Corvotempesta wrote: Hi all, do you have an ETA about a stable realease (or something usable in production) for CephFS? Short answer: no. However, we do have a team of people working on the FS again as of a month or so ago. We're doing a lot of stabilization (bug fixes), code cleanups, and utility work in the coming months; we can estimate the utility and cleanup work but not the bugs that we'll find, and those are our main concern right now. Depending on how the next couple months of QA and bug fixing go we should be able to publicize real estimates soonish. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Hello Gregory, is your response still up-to-date? The FAQ (http://ceph.com/docs/master/faq/) says: Ceph’s object store (RADOS) is production ready. Well put out some blog posts and emails when we have anything more to report. :) RADOS is ready, but CephFS is a whole separate layer above it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
On 03/20/2013 01:19 PM, Josh Durgin wrote: On 03/20/2013 01:14 PM, Stefan Priebe wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. This particular problem won't be solved by upgrading qemu. It's a ceph bug. Disabling caching would work around the issue. Travis, could you get a log from qemu of this happening with: debug ms = 20 debug objectcacher = 20 debug rbd = 20 log file = /path/writeable/by/qemu If it doesn't reproduce with those settings, try changing debug ms to 1 instead of 20. From those we can tell whether the issue is on the client side at least, and hopefully what's causing it. Thanks! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Thanks Josh. I will respond when I have something useful! On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin josh.dur...@inktank.com wrote: On 03/20/2013 01:19 PM, Josh Durgin wrote: On 03/20/2013 01:14 PM, Stefan Priebe wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. This particular problem won't be solved by upgrading qemu. It's a ceph bug. Disabling caching would work around the issue. Travis, could you get a log from qemu of this happening with: debug ms = 20 debug objectcacher = 20 debug rbd = 20 log file = /path/writeable/by/qemu If it doesn't reproduce with those settings, try changing debug ms to 1 instead of 20. From those we can tell whether the issue is on the client side at least, and hopefully what's causing it. Thanks! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request
On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Replicated objects need to be added into the cache immediately Signed-off-by: Yan, Zheng zheng.z@intel.com Why do we need to add them right away? Shouldn't we have a journaled replica if we need it? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com --- src/mds/MDCache.cc | 12 src/mds/MDCache.h | 2 +- src/mds/MDS.cc | 6 +++--- src/mds/Server.cc | 55 +++--- 4 files changed, 56 insertions(+), 19 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 0f6b842..b668842 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context *fin, bufferlist bl, int r) /* */ +int MDCache::get_num_client_requests() +{ + int count = 0; + for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); + p != active_requests.end(); + ++p) { + if (p-second-reqid.name.is_client() !p-second-is_slave()) + count++; + } + return count; +} + /* This function takes over the reference to the passed Message */ MDRequest *MDCache::request_start(MClientRequest *req) { diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index a9f05c6..4634121 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -240,7 +240,7 @@ protected: hash_mapmetareqid_t, MDRequest* active_requests; public: - int get_num_active_requests() { return active_requests.size(); } + int get_num_client_requests(); MDRequest* request_start(MClientRequest *req); MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by); diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index b91dcbd..e99eecc 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m) mdcache-is_open() replay_queue.empty() want_state == MDSMap::STATE_CLIENTREPLAY) { - dout(10) still have mdcache-get_num_active_requests() - active replay requests dendl; - if (mdcache-get_num_active_requests() == 0) + int num_requests = mdcache-get_num_client_requests(); + dout(10) still have num_requests active replay requests dendl; + if (num_requests == 0) clientreplay_done(); } diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 4c4c86b..8e89e4c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -107,10 +107,8 @@ void Server::dispatch(Message *m) (m-get_type() == CEPH_MSG_CLIENT_REQUEST (static_castMClientRequest*(m))-is_replay( { // replaying! - } else if (mds-is_clientreplay() m-get_type() == MSG_MDS_SLAVE_REQUEST - ((static_castMMDSSlaveRequest*(m))-is_reply() || - !mds-mdsmap-is_active(m-get_source().num( { - // slave reply or the master is also in the clientreplay stage + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) { + // handle_slave_request() will wait if necessary } else { dout(3) not active yet, waiting dendl; mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) if (m-is_reply()) return handle_slave_request_reply(m); + CDentry *straydn = NULL; + if (m-stray.length() 0) { + straydn = mdcache-add_replica_stray(m-stray, from); + assert(straydn); + m-stray.clear(); + } + // am i a new slave? MDRequest *mdr = NULL; if (mdcache-have_request(m-get_reqid())) { @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) m-put(); return; } - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), m-get_source().num()); + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from); } assert(mdr-slave_request == 0); // only one at a time, please! + + if (straydn) { + mdr-pin(straydn); + mdr-straydn = straydn; + } + + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } else if (mds-is_clientreplay() !mds-mdsmap-is_clientreplay(from) + mdr-locks.empty()) { + dout(3) not active yet, waiting dendl; + mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); + return; + } + mdr-slave_request = m; dispatch_slave_request(mdr); @@ -1339,6 +1361,12 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) { int from = m-get_source().num(); + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } + if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) { metareqid_t r = m-get_reqid(); mds-mdcache-committed_master_slave(r, from); @@ -5138,10 +5166,8 @@ void Server::handle_slave_rmdir_prep(MDRequest *mdr) dout(10) dn *dn dendl; mdr-pin(dn); - assert(mdr-slave_request-stray.length()
Re: [PATCH 12/39] mds: compose and send resolve messages in batch
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Resolve messages for all MDS are the same, so we can compose and send them in batch. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 181 + src/mds/MDCache.h | 11 ++-- 2 files changed, 93 insertions(+), 99 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index b668842..c455a20 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2432,10 +2432,6 @@ void MDCache::resolve_start() if (rootdir) adjust_subtree_auth(rootdir, CDIR_AUTH_UNKNOWN); } - - for (mapint, mapmetareqid_t, MDSlaveUpdate* ::iterator p = uncommitted_slave_updates.begin(); - p != uncommitted_slave_updates.end(); ++p) -need_resolve_ack.insert(p-first); } void MDCache::send_resolves() @@ -2444,9 +2440,10 @@ void MDCache::send_resolves() got_resolve.clear(); other_ambiguous_imports.clear(); - if (!need_resolve_ack.empty()) { -for (setint::iterator p = need_resolve_ack.begin(); p != need_resolve_ack.end(); ++p) - send_slave_resolve(*p); + send_slave_resolves(); + if (!resolve_ack_gather.empty()) { +dout(10) send_resolves still waiting for resolve ack from ( + need_resolve_ack ) dendl; return; } if (!need_resolve_rollback.empty()) { @@ -2454,95 +2451,74 @@ void MDCache::send_resolves() need_resolve_rollback ) dendl; return; } - assert(uncommitted_slave_updates.empty()); - for (setint::iterator p = recovery_set.begin(); p != recovery_set.end(); ++p) { -int who = *p; -if (who == mds-whoami) - continue; -if (migrator-is_importing() || - migrator-is_exporting()) - send_resolve_later(who); -else - send_resolve_now(who); - } -} - -void MDCache::send_resolve_later(int who) -{ - dout(10) send_resolve_later to mds. who dendl; - wants_resolve.insert(who); + send_subtree_resolves(); } -void MDCache::maybe_send_pending_resolves() +void MDCache::send_slave_resolves() { - if (wants_resolve.empty()) -return; // nothing to send. - - // only if it's appropriate! - if (migrator-is_exporting() || - migrator-is_importing()) { -dout(7) maybe_send_pending_resolves waiting, imports/exports still in progress dendl; -migrator-show_importing(); -migrator-show_exporting(); -return; // not now - } - - // ok, send them. - for (setint::iterator p = wants_resolve.begin(); - p != wants_resolve.end(); - ++p) -send_resolve_now(*p); - wants_resolve.clear(); -} + dout(10) send_slave_resolves dendl; + mapint, MMDSResolve* resolves; -class C_MDC_SendResolve : public Context { - MDCache *mdc; - int who; -public: - C_MDC_SendResolve(MDCache *c, int w) : mdc(c), who(w) { } - void finish(int r) { -mdc-send_resolve_now(who); - } -}; - -void MDCache::send_slave_resolve(int who) -{ - dout(10) send_slave_resolve to mds. who dendl; - MMDSResolve *m = new MMDSResolve; - - // list prepare requests lacking a commit - // [active survivor] - for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); - p != active_requests.end(); - ++p) { -if (p-second-is_slave() p-second-slave_to_mds == who) { - dout(10) including uncommitted *p-second dendl; - m-add_slave_request(p-first); + if (mds-is_resolve()) { +for (mapint, mapmetareqid_t, MDSlaveUpdate* ::iterator p = uncommitted_slave_updates.begin(); +p != uncommitted_slave_updates.end(); +++p) { + resolves[p-first] = new MMDSResolve; + for (mapmetareqid_t, MDSlaveUpdate*::iterator q = p-second.begin(); + q != p-second.end(); + ++q) { + dout(10) including uncommitted q-first dendl; + resolves[p-first]-add_slave_request(q-first); + } } - } - // [resolving] - if (uncommitted_slave_updates.count(who) - !uncommitted_slave_updates[who].empty()) { -for (mapmetareqid_t, MDSlaveUpdate*::iterator p = uncommitted_slave_updates[who].begin(); - p != uncommitted_slave_updates[who].end(); - ++p) { - dout(10) including uncommitted p-first dendl; - m-add_slave_request(p-first); + } else { +setint resolve_set; +mds-mdsmap-get_mds_set(resolve_set, MDSMap::STATE_RESOLVE); +for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); +p != active_requests.end(); +++p) { + if (!p-second-is_slave() || !p-second-slave_did_prepare()) + continue; + int master = p-second-slave_to_mds; + if (resolve_set.count(master)) { + dout(10) including uncommitted
Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
Didn't take long to re-create with the detailed debugging (ms = 20). I'm sending Josh a link to the gzip'd log off-list, Im not sure if the log will contain any CephX keys or anything like that. On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden trho...@gmail.com wrote: Thanks Josh. I will respond when I have something useful! On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin josh.dur...@inktank.com wrote: On 03/20/2013 01:19 PM, Josh Durgin wrote: On 03/20/2013 01:14 PM, Stefan Priebe wrote: Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. This particular problem won't be solved by upgrading qemu. It's a ceph bug. Disabling caching would work around the issue. Travis, could you get a log from qemu of this happening with: debug ms = 20 debug objectcacher = 20 debug rbd = 20 log file = /path/writeable/by/qemu If it doesn't reproduce with those settings, try changing debug ms to 1 instead of 20. From those we can tell whether the issue is on the client side at least, and hopefully what's causing it. Thanks! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/39] mds: don't send resolve message between active MDS
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com When MDS cluster is resolving, current behavior is sending subtree resolve message to all other MDS and waiting for all other MDS' resolve message. The problem is that active MDS can have diffent subtree map due to rename. Besides gathering active MDS's resolve messages are also racy. The only function for these messages is disambiguate other MDS' import. We can replace it by import finish notification. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 12 +--- src/mds/Migrator.cc | 25 +++-- src/mds/Migrator.h | 3 ++- 3 files changed, 34 insertions(+), 6 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index c455a20..73c1d59 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2517,7 +2517,8 @@ void MDCache::send_subtree_resolves() ++p) { if (*p == mds-whoami) continue; -resolves[*p] = new MMDSResolve; +if (mds-is_resolve() || mds-mdsmap-is_resolve(*p)) + resolves[*p] = new MMDSResolve; } // known @@ -2837,7 +2838,7 @@ void MDCache::handle_resolve(MMDSResolve *m) migrator-import_reverse(dir); } else { dout(7) ambiguous import succeeded on *dir dendl; - migrator-import_finish(dir); + migrator-import_finish(dir, true); } my_ambiguous_imports.erase(p); // no longer ambiguous. } @@ -3432,7 +3433,12 @@ void MDCache::rejoin_send_rejoins() ++p) { CDir *dir = p-first; assert(dir-is_subtree_root()); -assert(!dir-is_ambiguous_dir_auth()); +if (dir-is_ambiguous_dir_auth()) { + // exporter is recovering, importer is survivor. The importer has to be the MDS this code is running on, right? + assert(rejoins.count(dir-authority().first)); + assert(!rejoins.count(dir-authority().second)); + continue; +} // my subtree? if (dir-is_auth()) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 5e53803..833df12 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2088,6 +2088,23 @@ void Migrator::import_reverse(CDir *dir) } } +void Migrator::import_notify_finish(CDir *dir, setCDir* bounds) +{ + dout(7) import_notify_finish *dir dendl; + + for (setint::iterator p = import_bystanders[dir].begin(); + p != import_bystanders[dir].end(); + ++p) { +MExportDirNotify *notify = + new MExportDirNotify(dir-dirfrag(), false, + pairint,int(import_peer[dir-dirfrag()], mds-get_nodeid()), + pairint,int(mds-get_nodeid(), CDIR_AUTH_UNKNOWN)); I don't think this is quite right — we're notifying them that we've just finished importing data from somebody, right? And so we know that we're the auth node... +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++) + notify-get_bounds().push_back((*i)-dirfrag()); +mds-send_message_mds(notify, *p); + } +} + void Migrator::import_notify_abort(CDir *dir, setCDir* bounds) { dout(7) import_notify_abort *dir dendl; @@ -2183,11 +2200,11 @@ void Migrator::handle_export_finish(MExportDirFinish *m) CDir *dir = cache-get_dirfrag(m-get_dirfrag()); assert(dir); dout(7) handle_export_finish on *dir dendl; - import_finish(dir); + import_finish(dir, false); m-put(); } -void Migrator::import_finish(CDir *dir) +void Migrator::import_finish(CDir *dir, bool notify) { dout(7) import_finish on *dir dendl; @@ -2205,6 +,10 @@ void Migrator::import_finish(CDir *dir) // remove pins setCDir* bounds; cache-get_subtree_bounds(dir, bounds); + + if (notify) +import_notify_finish(dir, bounds); + import_remove_pins(dir, bounds); mapCInode*, mapclient_t,Capability::Export cap_imports; diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h index 7988f32..2889a74 100644 --- a/src/mds/Migrator.h +++ b/src/mds/Migrator.h @@ -273,12 +273,13 @@ protected: void import_reverse_unfreeze(CDir *dir); void import_reverse_final(CDir *dir); void import_notify_abort(CDir *dir, setCDir* bounds); + void import_notify_finish(CDir *dir, setCDir* bounds); void import_logged_start(dirfrag_t df, CDir *dir, int from, mapclient_t,entity_inst_t imported_client_map, mapclient_t,uint64_t sseqmap); void handle_export_finish(MExportDirFinish *m); public: - void import_finish(CDir *dir); + void import_finish(CDir *dir, bool notify); protected: void handle_export_caps(MExportCaps *m); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/39] mds: set resolve/rejoin gather MDS set in advance
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com For active MDS, it may receive resolve/resolve message before receiving resolve/rejoin, maybe? Other than that, Reviewed-by: Greg Farnum g...@inktank.com the mdsmap message that claims the MDS cluster is in resolving/rejoning state. So instead of set the gather MDS set when receiving the mdsmap. set them in advance when detecting MDS' failure. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 41 +++-- src/mds/MDCache.h | 5 ++--- 2 files changed, 21 insertions(+), 25 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 73c1d59..69db1dd 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2432,18 +2432,17 @@ void MDCache::resolve_start() if (rootdir) adjust_subtree_auth(rootdir, CDIR_AUTH_UNKNOWN); } + resolve_gather = recovery_set; + resolve_gather.erase(mds-get_nodeid()); + rejoin_gather = resolve_gather; } void MDCache::send_resolves() { - // reset resolve state - got_resolve.clear(); - other_ambiguous_imports.clear(); - send_slave_resolves(); if (!resolve_ack_gather.empty()) { dout(10) send_resolves still waiting for resolve ack from ( - need_resolve_ack ) dendl; + resolve_ack_gather ) dendl; return; } if (!need_resolve_rollback.empty()) { @@ -2495,7 +2494,7 @@ void MDCache::send_slave_resolves() ++p) { dout(10) sending slave resolve to mds. p-first dendl; mds-send_message_mds(p-second, p-first); -need_resolve_ack.insert(p-first); +resolve_ack_gather.insert(p-first); } } @@ -2598,16 +2597,15 @@ void MDCache::handle_mds_failure(int who) recovery_set.erase(mds-get_nodeid()); dout(1) handle_mds_failure mds. who : recovery peers are recovery_set dendl; - // adjust my recovery lists - wants_resolve.erase(who); // MDS will ask again - got_resolve.erase(who); // i'll get another. + resolve_gather.insert(who); discard_delayed_resolve(who); + rejoin_gather.insert(who); rejoin_sent.erase(who);// i need to send another rejoin_ack_gather.erase(who); // i'll need/get another. - dout(10) wants_resolve wants_resolve dendl; - dout(10) got_resolve got_resolve dendl; + dout(10) resolve_gather resolve_gather dendl; + dout(10) resolve_ack_gather resolve_ack_gather dendl; dout(10) rejoin_sent rejoin_sent dendl; dout(10) rejoin_gather rejoin_gather dendl; dout(10) rejoin_ack_gather rejoin_ack_gather dendl; @@ -2788,7 +2786,7 @@ void MDCache::handle_resolve(MMDSResolve *m) return; } - if (!need_resolve_ack.empty() || !need_resolve_rollback.empty()) { + if (!resolve_ack_gather.empty() || !need_resolve_rollback.empty()) { dout(10) delay processing subtree resolve dendl; discard_delayed_resolve(from); delayed_resolve[from] = m; @@ -2875,7 +2873,7 @@ void MDCache::handle_resolve(MMDSResolve *m) } // did i get them all? - got_resolve.insert(from); + resolve_gather.erase(from); maybe_resolve_finish(); @@ -2901,12 +2899,12 @@ void MDCache::discard_delayed_resolve(int who) void MDCache::maybe_resolve_finish() { - assert(need_resolve_ack.empty()); + assert(resolve_ack_gather.empty()); assert(need_resolve_rollback.empty()); - if (got_resolve != recovery_set) { -dout(10) maybe_resolve_finish still waiting for more resolves, got ( - got_resolve ), need ( recovery_set ) dendl; + if (!resolve_gather.empty()) { +dout(10) maybe_resolve_finish still waiting for resolves ( + resolve_gather ) dendl; return; } else { dout(10) maybe_resolve_finish got all resolves+resolve_acks, done. dendl; @@ -2926,7 +2924,7 @@ void MDCache::handle_resolve_ack(MMDSResolveAck *ack) dout(10) handle_resolve_ack *ack from ack-get_source() dendl; int from = ack-get_source().num(); - if (!need_resolve_ack.count(from)) { + if (!resolve_ack_gather.count(from)) { ack-put(); return; } @@ -3001,8 +2999,8 @@ void MDCache::handle_resolve_ack(MMDSResolveAck *ack) assert(p-second-slave_to_mds != from); } - need_resolve_ack.erase(from); - if (need_resolve_ack.empty() need_resolve_rollback.empty()) { + resolve_ack_gather.erase(from); + if (resolve_ack_gather.empty() need_resolve_rollback.empty()) { send_subtree_resolves(); process_delayed_resolve(); } @@ -3069,7 +3067,7 @@ void MDCache::finish_rollback(metareqid_t reqid) { if (mds-is_resolve()) finish_uncommitted_slave_update(reqid, need_resolve_rollback[reqid]); need_resolve_rollback.erase(reqid); - if (need_resolve_ack.empty() need_resolve_rollback.empty()) { + if (resolve_ack_gather.empty() need_resolve_rollback.empty()) {
Re: [PATCH 15/39] mds: don't send MDentry{Link,Unlink} before receiving cache rejoin
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it receives the cache rejoin message. The function will remove the objects replicated by MDentry{Link,Unlink} from replica map. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 69db1dd..f102205 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -3893,6 +3893,8 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) } } + assert(rejoin_gather.count(from)); + rejoin_gather.erase(from); if (survivor) { // survivor. do everything now. for (mapinodeno_t,MMDSCacheRejoin::lock_bls::iterator p = weak-inode_scatterlocks.begin(); @@ -3911,8 +3913,6 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) mds-locker-eval_gather(*p); } else { // done? -assert(rejoin_gather.count(from)); -rejoin_gather.erase(from); if (rejoin_gather.empty()) { rejoin_gather_finish(); } else { @@ -9582,7 +9582,9 @@ void MDCache::send_dentry_link(CDentry *dn) for (mapint,int::iterator p = dn-replicas_begin(); p != dn-replicas_end(); ++p) { -if (mds-mdsmap-get_state(p-first) MDSMap::STATE_REJOIN) +if (mds-mdsmap-get_state(p-first) MDSMap::STATE_REJOIN || + (mds-mdsmap-get_state(p-first) == MDSMap::STATE_REJOIN +rejoin_gather.count(p-first))) continue; CDentry::linkage_t *dnl = dn-get_linkage(); MDentryLink *m = new MDentryLink(subtree-dirfrag(), dn-get_dir()-dirfrag(), @@ -9668,6 +9670,11 @@ void MDCache::send_dentry_unlink(CDentry *dn, CDentry *straydn, MDRequest *mdr) if (mdr mdr-more()-witnessed.count(it-first)) continue; +if (mds-mdsmap-get_state(it-first) MDSMap::STATE_REJOIN || + (mds-mdsmap-get_state(it-first) == MDSMap::STATE_REJOIN +rejoin_gather.count(it-first))) + continue; + MDentryUnlink *unlink = new MDentryUnlink(dn-get_dir()-dirfrag(), dn-name); if (straydn) replicate_stray(straydn, it-first, unlink-straybl); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/39] mds: unify slave request waiting
Much simpler! Reviewed-by: Sage Weil s...@inktank.com On Sun, 17 Mar 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com When requesting remote xlock or remote wrlock, the master request is put into lock object's REMOTEXLOCK waiting queue. The problem is that remote wrlock's target can be different from lock's auth MDS. When the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake incorrect request. So just unify slave request waiting, dispatch the master request when receiving slave request reply. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 49 ++--- src/mds/Server.cc | 12 ++-- 2 files changed, 32 insertions(+), 29 deletions(-) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index d06a9cc..0055a19 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -544,8 +544,6 @@ void Locker::cancel_locking(Mutation *mut, setCInode* *pneed_issue) if (need_issue) pneed_issue-insert(static_castCInode *(lock-get_parent())); } - } else { -lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK); } mut-finish_locking(lock); } @@ -1326,18 +1324,16 @@ void Locker::remote_wrlock_start(SimpleLock *lock, int target, MDRequest *mut) } // send lock request - if (!lock-is_waiter_for(SimpleLock::WAIT_REMOTEXLOCK)) { -mut-start_locking(lock, target); -mut-more()-slaves.insert(target); -MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt, -MMDSSlaveRequest::OP_WRLOCK); -r-set_lock_type(lock-get_type()); -lock-get_parent()-set_object_info(r-get_object_info()); -mds-send_message_mds(r, target); - } - - // wait - lock-add_waiter(SimpleLock::WAIT_REMOTEXLOCK, new C_MDS_RetryRequest(mdcache, mut)); + mut-start_locking(lock, target); + mut-more()-slaves.insert(target); + MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt, + MMDSSlaveRequest::OP_WRLOCK); + r-set_lock_type(lock-get_type()); + lock-get_parent()-set_object_info(r-get_object_info()); + mds-send_message_mds(r, target); + + assert(mut-more()-waiting_on_slave.count(target) == 0); + mut-more()-waiting_on_slave.insert(target); } void Locker::remote_wrlock_finish(SimpleLock *lock, int target, Mutation *mut) @@ -1411,19 +1407,18 @@ bool Locker::xlock_start(SimpleLock *lock, MDRequest *mut) } // send lock request -if (!lock-is_waiter_for(SimpleLock::WAIT_REMOTEXLOCK)) { - int auth = lock-get_parent()-authority().first; - mut-more()-slaves.insert(auth); - mut-start_locking(lock, auth); - MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt, - MMDSSlaveRequest::OP_XLOCK); - r-set_lock_type(lock-get_type()); - lock-get_parent()-set_object_info(r-get_object_info()); - mds-send_message_mds(r, auth); -} - -// wait -lock-add_waiter(SimpleLock::WAIT_REMOTEXLOCK, new C_MDS_RetryRequest(mdcache, mut)); +int auth = lock-get_parent()-authority().first; +mut-more()-slaves.insert(auth); +mut-start_locking(lock, auth); +MMDSSlaveRequest *r = new MMDSSlaveRequest(mut-reqid, mut-attempt, +MMDSSlaveRequest::OP_XLOCK); +r-set_lock_type(lock-get_type()); +lock-get_parent()-set_object_info(r-get_object_info()); +mds-send_message_mds(r, auth); + +assert(mut-more()-waiting_on_slave.count(auth) == 0); +mut-more()-waiting_on_slave.insert(auth); + return false; } } diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 6d0519f..4c4c86b 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -1371,7 +1371,11 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) mdr-locks.insert(lock); mdr-finish_locking(lock); lock-get_xlock(mdr, mdr-get_client()); - lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK); + + assert(mdr-more()-waiting_on_slave.count(from)); + mdr-more()-waiting_on_slave.erase(from); + assert(mdr-more()-waiting_on_slave.empty()); + dispatch_client_request(mdr); } break; @@ -1385,7 +1389,11 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) mdr-remote_wrlocks[lock] = from; mdr-locks.insert(lock); mdr-finish_locking(lock); - lock-finish_waiters(SimpleLock::WAIT_REMOTEXLOCK); + + assert(mdr-more()-waiting_on_slave.count(from)); + mdr-more()-waiting_on_slave.erase(from); + assert(mdr-more()-waiting_on_slave.empty()); + dispatch_client_request(mdr); } break; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More
Re: [PATCH 16/39] mds: send cache rejoin messages after gathering all resolves
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 10 ++ src/mds/MDCache.h | 5 + 2 files changed, 15 insertions(+) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index f102205..6853bf1 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2914,6 +2914,8 @@ void MDCache::maybe_resolve_finish() recalc_auth_bits(); trim_non_auth(); mds-resolve_done(); +} else { + maybe_send_pending_rejoins(); } } } @@ -3398,6 +3400,13 @@ void MDCache::rejoin_send_rejoins() { dout(10) rejoin_send_rejoins with recovery_set recovery_set dendl; + if (!resolve_gather.empty()) { +dout(7) rejoin_send_rejoins still waiting for resolves ( +resolve_gather ) dendl; +rejoins_pending = true; +return; + } + mapint, MMDSCacheRejoin* rejoins; // encode cap list once. @@ -3571,6 +3580,7 @@ void MDCache::rejoin_send_rejoins() mds-send_message_mds(p-second, p-first); } rejoin_ack_gather.insert(mds-whoami); // we need to complete rejoin_gather_finish, too + rejoins_pending = false; // nothing? if (mds-is_rejoin() rejoins.empty()) { diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index 278debf..379f715 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -383,6 +383,7 @@ public: protected: // [rejoin] + bool rejoins_pending; setint rejoin_gather; // nodes from whom i need a rejoin setint rejoin_sent;// nodes i sent a rejoin to setint rejoin_ack_gather; // nodes from whom i need a rejoin ack @@ -417,6 +418,10 @@ protected: void handle_cache_rejoin_full(MMDSCacheRejoin *m); void rejoin_send_acks(); void rejoin_trim_undef_inodes(); + void maybe_send_pending_rejoins() { +if (rejoins_pending) + rejoin_send_rejoins(); + } public: void rejoin_gather_finish(); void rejoin_send_rejoins(); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/39] mds: send resolve acks after master updates are safely logged
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 33 + src/mds/MDCache.h | 7 ++- src/mds/Server.cc | 9 + src/mds/journal.cc | 2 +- 4 files changed, 45 insertions(+), 6 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 6853bf1..9b37b1e 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2177,6 +2177,17 @@ void MDCache::committed_master_slave(metareqid_t r, int from) log_master_commit(r); } +void MDCache::logged_master_update(metareqid_t reqid) +{ + dout(10) logged_master_update reqid dendl; + assert(uncommitted_masters.count(reqid)); + uncommitted_masters[reqid].safe = true; + if (pending_masters.count(reqid)) { +pending_masters.erase(reqid); +if (pending_masters.empty()) + process_delayed_resolve(); + } +} /* * The mds could crash after receiving all slaves' commit acknowledgement, @@ -2764,8 +2775,23 @@ void MDCache::handle_resolve(MMDSResolve *m) return; } + discard_delayed_resolve(from); + // ambiguous slave requests? if (!m-slave_requests.empty()) { +for (vectormetareqid_t::iterator p = m-slave_requests.begin(); +p != m-slave_requests.end(); +++p) { + if (uncommitted_masters.count(*p) !uncommitted_masters[*p].safe) + pending_masters.insert(*p); +} + +if (!pending_masters.empty()) { + dout(10) still have pending updates, delay processing slave resolve dendl; + delayed_resolve[from] = m; + return; +} + MMDSResolveAck *ack = new MMDSResolveAck; for (vectormetareqid_t::iterator p = m-slave_requests.begin(); p != m-slave_requests.end(); @@ -2788,7 +2814,6 @@ void MDCache::handle_resolve(MMDSResolve *m) if (!resolve_ack_gather.empty() || !need_resolve_rollback.empty()) { dout(10) delay processing subtree resolve dendl; -discard_delayed_resolve(from); delayed_resolve[from] = m; return; } @@ -2883,10 +2908,10 @@ void MDCache::handle_resolve(MMDSResolve *m) void MDCache::process_delayed_resolve() { dout(10) process_delayed_resolve dendl; - for (mapint, MMDSResolve *::iterator p = delayed_resolve.begin(); - p != delayed_resolve.end(); ++p) + mapint, MMDSResolve* tmp; + tmp.swap(delayed_resolve); + for (mapint, MMDSResolve*::iterator p = tmp.begin(); p != tmp.end(); ++p) handle_resolve(p-second); - delayed_resolve.clear(); } void MDCache::discard_delayed_resolve(int who) diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index 379f715..8f262b9 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -281,14 +281,16 @@ public: snapid_t follows=CEPH_NOSNAP); // slaves - void add_uncommitted_master(metareqid_t reqid, LogSegment *ls, setint slaves) { + void add_uncommitted_master(metareqid_t reqid, LogSegment *ls, setint slaves, bool safe=false) { uncommitted_masters[reqid].ls = ls; uncommitted_masters[reqid].slaves = slaves; +uncommitted_masters[reqid].safe = safe; } void wait_for_uncommitted_master(metareqid_t reqid, Context *c) { uncommitted_masters[reqid].waiters.push_back(c); } void log_master_commit(metareqid_t reqid); + void logged_master_update(metareqid_t reqid); void _logged_master_commit(metareqid_t reqid, LogSegment *ls, listContext* waiters); void committed_master_slave(metareqid_t r, int from); void finish_committed_masters(); @@ -320,9 +322,12 @@ protected: setint slaves; LogSegment *ls; listContext* waiters; +bool safe; }; mapmetareqid_t, umaster uncommitted_masters; // master: req - slave set + setmetareqid_t pending_masters; + //mapmetareqid_t, bool ambiguous_slave_updates; // for log trimming. //mapmetareqid_t, Context* waiting_for_slave_update_commit; friend class ESlaveUpdate; diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 8e89e4c..1330f11 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -4463,6 +4463,9 @@ void Server::_link_remote_finish(MDRequest *mdr, bool inc, assert(g_conf-mds_kill_link_at != 3); + if (!mdr-more()-witnessed.empty()) +mdcache-logged_master_update(mdr-reqid); + if (inc) { // link the new dentry dn-pop_projected_linkage(); @@ -5073,6 +5076,9 @@ void Server::_unlink_local_finish(MDRequest *mdr, { dout(10) _unlink_local_finish *dn dendl; + if (!mdr-more()-witnessed.empty()) +mdcache-logged_master_update(mdr-reqid); + // unlink main dentry dn-get_dir()-unlink_inode(dn); dn-pop_projected_linkage(); @@ -5881,6 +5887,9 @@ void Server::_rename_finish(MDRequest *mdr, CDentry
Re: [PATCH 19/39] mds: remove MDCache::rejoin_fetch_dirfrags()
Nice. Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates the function of MDCache::open_undef_dirfrags(), so just remove rejoin_fetch_dirfrags() and make open_undef_dirfrags() also handle undefined inodes. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc| 70 +++ src/mds/MDCache.cc | 193 + src/mds/MDCache.h | 5 +- 3 files changed, 107 insertions(+), 161 deletions(-) diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index 231630e..af0ae9c 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -1553,33 +1553,32 @@ void CDir::_fetched(bufferlist bl, const string want_dn) if (stale) continue; + bool undef_inode = false; if (dn) { -if (dn-get_linkage()-get_inode() == 0) { - dout(12) _fetched had NEG dentry *dn dendl; -} else { - dout(12) _fetched had dentry *dn dendl; -} - } else { + CInode *in = dn-get_linkage()-get_inode(); + if (in) { + dout(12) _fetched had dentry *dn dendl; + if (in-state_test(CInode::STATE_REJOINUNDEF)) { + assert(cache-mds-is_rejoin()); + assert(in-vino() == vinodeno_t(inode.ino, last)); + in-state_clear(CInode::STATE_REJOINUNDEF); + cache-opened_undef_inode(in); + undef_inode = true; + } + } else + dout(12) _fetched had NEG dentry *dn dendl; + } + + if (!dn || undef_inode) { // add inode CInode *in = cache-get_inode(inode.ino, last); - if (in) { - dout(0) _fetched badness: got (but i already had) *in - mode in-inode.mode - mtime in-inode.mtime dendl; - string dirpath, inopath; - this-inode-make_path_string(dirpath); - in-make_path_string(inopath); - clog.error() loaded dup inode inode.ino - [ first , last ] v inode.version - at dirpath / dname -, but inode in-vino() v in-inode.version - already exists at inopath \n; - continue; - } else { - // inode - in = new CInode(cache, true, first, last); - in-inode = inode; + if (!in || undef_inode) { + if (undef_inode) + in-first = first; + else + in = new CInode(cache, true, first, last); + in-inode = inode; // symlink? if (in-is_symlink()) in-symlink = symlink; @@ -1591,11 +1590,13 @@ void CDir::_fetched(bufferlist bl, const string want_dn) if (snaps) in-purge_stale_snap_data(*snaps); - // add - cache-add_inode( in ); - - // link - dn = add_primary_dentry(dname, in, first, last); + if (undef_inode) { + if (inode.anchored) + dn-adjust_nested_anchors(1); + } else { + cache-add_inode( in ); // add + dn = add_primary_dentry(dname, in, first, last); // link + } dout(12) _fetched got *dn *in dendl; if (in-inode.is_dirty_rstat()) @@ -1604,6 +1605,19 @@ void CDir::_fetched(bufferlist bl, const string want_dn) //in-hack_accessed = false; //in-hack_load_stamp = ceph_clock_now(g_ceph_context); //num_new_inodes_loaded++; + } else { + dout(0) _fetched badness: got (but i already had) *in + mode in-inode.mode + mtime in-inode.mtime dendl; + string dirpath, inopath; + this-inode-make_path_string(dirpath); + in-make_path_string(inopath); + clog.error() loaded dup inode inode.ino + [ first , last ] v inode.version + at dirpath / dname +, but inode in-vino() v in-inode.version + already exists at inopath \n; + continue; } } } else { diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index d934020..008a8a2 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4178,7 +4178,6 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, CInode *MDCache::rejoin_invent_inode(inodeno_t ino, snapid_t last) { - assert(0); CInode *in = new CInode(this, true, 1, last); in-inode.ino = ino; in-state_set(CInode::STATE_REJOINUNDEF); @@ -4190,16 +4189,13 @@ CInode *MDCache::rejoin_invent_inode(inodeno_t ino, snapid_t last) CDir *MDCache::rejoin_invent_dirfrag(dirfrag_t df) { - assert(0); CInode *in = get_inode(df.ino);
Re: deb/rpm package purge
On Wed, 2013-03-20 at 05:48 -0700, Sage Weil wrote: On Wed, 20 Mar 2013, Laszlo Boszormenyi (GCS) wrote: On Tue, 2013-03-19 at 15:59 -0700, Sage Weil wrote: As a point of comparison, mysql removes the config files but not /var/lib/mysql. The question is, is that okay/typical/desireable/recommended/a bad idea? I should have asked this sooner. Do you know _any_ program that removes your favorite music collection, your family photos or your business emails when you do uninstall it? I suspect that your question was theoretical instead. On Wed, 2013-03-20 at 09:48 -0500, Mark Nelson wrote: On 03/20/2013 07:48 AM, Sage Weil wrote: It's not as important given that it won't outright destroy the cluster, but perhaps we should also leave /etc/ceph untouched on purge if a ceph.conf file has been placed in it (since that also was not installed by the package, but rather by a user?). I figure we should probably try to get it right now. The message about the directory not being empty sounds good. Sure, personal user data must be kept. If it's a big amount of data and left under a non-standard location (ie, not under his/her $HOME) then s/he should be informed where those files are located on purge. My thought here is: - remove anything created by the packages in /var/lib/ceph that has been untouched since package installation. - remove /var/lib/ceph if it has been untouched Please note that you have to store some kind of checksum for the files then. Probably md5sum is enough. - remove /etc/ceph if it has been untouched This is an other case. dpkg itself handle package files here, called conffiles. I should check the method (md5sum and/or sha1 variants) used for the checksum on these files. On upgrade it's used not to overwrite local changes by the user. It may worth to read a bit more about it[1] from Raphaël Hertzog. He is the co-author of Debian Administrator's handbook[2] BTW. On purge dpkg will remove the package conffiles no matter what. It won't check if those were changed or not. You may not mark the files under /etc as conffiles, but then you'll lose the mentioned merge logic on upgrades; dpkg will just overwrite those. In short, files under /var/lib/ceph are the only candidates for in-package checksumming. How many files under there that essential for the packages? Laszlo/GCS [1] http://raphaelhertzog.com/2010/09/21/debian-conffile-configuration-file-managed-by-dpkg/ [2] http://debian-handbook.info/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com So the recovering MDS can properly handle cache expire messages. Also increase the nonce value when sending the cache rejoin acks. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 35 +++ src/messages/MMDSCacheRejoin.h | 11 +++ 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 008a8a2..8ba676e 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -3538,6 +3538,7 @@ void MDCache::rejoin_send_rejoins() if (p-first == 0 root) { p-second-add_weak_inode(root-vino()); p-second-add_strong_inode(root-vino(), + root-get_replica_nonce(), root-get_caps_wanted(), root-filelock.get_state(), root-nestlock.get_state(), @@ -3551,6 +3552,7 @@ void MDCache::rejoin_send_rejoins() if (CInode *in = get_inode(MDS_INO_MDSDIR(p-first))) { p-second-add_weak_inode(in-vino()); p-second-add_strong_inode(in-vino(), + in-get_replica_nonce(), in-get_caps_wanted(), in-filelock.get_state(), in-nestlock.get_state(), @@ -3709,6 +3711,7 @@ void MDCache::rejoin_walk(CDir *dir, MMDSCacheRejoin *rejoin) CInode *in = dnl-get_inode(); dout(15) add_strong_inode *in dendl; rejoin-add_strong_inode(in-vino(), +in-get_replica_nonce(), in-get_caps_wanted(), in-filelock.get_state(), in-nestlock.get_state(), @@ -4248,7 +4251,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dir = rejoin_invent_dirfrag(p-first); } if (dir) { - dir-add_replica(from); + dir-add_replica(from, p-second.nonce); dir-dir_rep = p-second.dir_rep; } else { dout(10) frag p-first doesn't match dirfragtree *diri dendl; @@ -4263,7 +4266,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dir = rejoin_invent_dirfrag(p-first); else dout(10) have(approx) *dir dendl; - dir-add_replica(from); + dir-add_replica(from, p-second.nonce); dir-dir_rep = p-second.dir_rep; } refragged = true; @@ -4327,7 +4330,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) mdr-locks.insert(dn-lock); } - dn-add_replica(from); + dn-add_replica(from, q-second.nonce); dout(10) have *dn dendl; // inode? @@ -4412,7 +4415,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dout(10) sender has dentry but not inode, adding them as a replica dendl; } - in-add_replica(from); + in-add_replica(from, p-second.nonce); dout(10) have *in dendl; } } @@ -5176,7 +5179,7 @@ void MDCache::rejoin_send_acks() for (mapint,int::iterator r = dir-replicas_begin(); r != dir-replicas_end(); ++r) - ack[r-first]-add_strong_dirfrag(dir-dirfrag(), r-second, dir-dir_rep); + ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, dir-dir_rep); for (CDir::map_t::iterator q = dir-items.begin(); q != dir-items.end(); @@ -5192,7 +5195,7 @@ void MDCache::rejoin_send_acks() dnl-is_primary() ? dnl-get_inode()-ino():inodeno_t(0), dnl-is_remote() ? dnl-get_remote_ino():inodeno_t(0), dnl-is_remote() ? dnl-get_remote_d_type():0, - r-second, + ++r-second, dn-lock.get_replica_state()); if (!dnl-is_primary()) @@ -5205,7 +5208,7 @@ void MDCache::rejoin_send_acks() r != in-replicas_end(); ++r) { ack[r-first]-add_inode_base(in); - ack[r-first]-add_inode_locks(in, r-second); + ack[r-first]-add_inode_locks(in, ++r-second); } // subdirs in this subtree? @@ -5220,14 +5223,14 @@ void MDCache::rejoin_send_acks() r != root-replicas_end(); ++r) { ack[r-first]-add_inode_base(root); - ack[r-first]-add_inode_locks(root, r-second); + ack[r-first]-add_inode_locks(root, ++r-second); } if (myin) for (mapint,int::iterator r = myin-replicas_begin(); r != myin-replicas_end();
Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack
This needs to handle versioning the encoding based on peer feature bits too. On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Cache rejoin ack message already encodes inode base, make it also encode dirfrag base. This allowes the message to replicate stray dentries like MDentryUnlink message. The function will be used by later patch. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.h | 20 +--- src/mds/MDCache.cc | 20 ++-- src/messages/MMDSCacheRejoin.h | 12 +++- 3 files changed, 42 insertions(+), 10 deletions(-) diff --git a/src/mds/CDir.h b/src/mds/CDir.h index 79946f1..f4a3a3d 100644 --- a/src/mds/CDir.h +++ b/src/mds/CDir.h @@ -437,23 +437,29 @@ private: ::encode(dist, bl); } - void encode_replica(int who, bufferlist bl) { -__u32 nonce = add_replica(who); -::encode(nonce, bl); + void _encode_base(bufferlist bl) { ::encode(first, bl); ::encode(fnode, bl); ::encode(dir_rep, bl); ::encode(dir_rep_by, bl); } - void decode_replica(bufferlist::iterator p) { -__u32 nonce; -::decode(nonce, p); -replica_nonce = nonce; + void _decode_base(bufferlist::iterator p) { ::decode(first, p); ::decode(fnode, p); ::decode(dir_rep, p); ::decode(dir_rep_by, p); } + void encode_replica(int who, bufferlist bl) { +__u32 nonce = add_replica(who); +::encode(nonce, bl); +_encode_base(bl); + } + void decode_replica(bufferlist::iterator p) { +__u32 nonce; +::decode(nonce, p); +replica_nonce = nonce; +_decode_base(p); + } diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 8ba676e..344777e 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4510,8 +4510,22 @@ void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin *ack) } } + // full dirfrags + bufferlist::iterator p = ack-dirfrag_base.begin(); + while (!p.end()) { +dirfrag_t df; +bufferlist basebl; +::decode(df, p); +::decode(basebl, p); +CDir *dir = get_dirfrag(df); +assert(dir); +bufferlist::iterator q = basebl.begin(); +dir-_decode_base(q); +dout(10) got dir replica *dir dendl; + } + // full inodes - bufferlist::iterator p = ack-inode_base.begin(); + p = ack-inode_base.begin(); while (!p.end()) { inodeno_t ino; snapid_t last; @@ -5178,8 +5192,10 @@ void MDCache::rejoin_send_acks() // dir for (mapint,int::iterator r = dir-replicas_begin(); r != dir-replicas_end(); - ++r) + ++r) { ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, dir-dir_rep); + ack[r-first]-add_dirfrag_base(dir); + } for (CDir::map_t::iterator q = dir-items.begin(); q != dir-items.end(); diff --git a/src/messages/MMDSCacheRejoin.h b/src/messages/MMDSCacheRejoin.h index b88f551..7c37ab4 100644 --- a/src/messages/MMDSCacheRejoin.h +++ b/src/messages/MMDSCacheRejoin.h @@ -20,6 +20,7 @@ #include include/types.h #include mds/CInode.h +#include mds/CDir.h // sent from replica to auth @@ -169,6 +170,7 @@ class MMDSCacheRejoin : public Message { // full bufferlist inode_base; bufferlist inode_locks; + bufferlist dirfrag_base; // authpins, xlocks struct slave_reqid { @@ -258,7 +260,13 @@ public: void add_strong_dirfrag(dirfrag_t df, int n, int dr) { strong_dirfrags[df] = dirfrag_strong(n, dr); } - + void add_dirfrag_base(CDir *dir) { +::encode(dir-dirfrag(), dirfrag_base); +bufferlist bl; +dir-_encode_base(bl); +::encode(bl, dirfrag_base); + } We are guilty of doing this in other places, but we should avoid implicit encodings like this one, especially when the decode happens somewhere else like it does here. We can make a vector dirfrag_bases and add to that, and then encode and decode it along with the rest of the message — would that work for your purposes? -Greg + // dentries void add_weak_dirfrag(dirfrag_t df) { weak_dirfrags.insert(df); @@ -294,6 +302,7 @@ public: ::encode(wrlocked_inodes, payload); ::encode(cap_export_bl, payload); ::encode(strong_dirfrags, payload); +::encode(dirfrag_base, payload); ::encode(weak, payload); ::encode(weak_dirfrags, payload); ::encode(weak_inodes, payload); @@ -319,6 +328,7 @@ public: ::decode(cap_export_paths, q); } ::decode(strong_dirfrags, p); +::decode(dirfrag_base, p); ::decode(weak, p); ::decode(weak_dirfrags, p); ::decode(weak_inodes, p); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong
On Wed, 20 Mar 2013, Gregory Farnum wrote: On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com So the recovering MDS can properly handle cache expire messages. Also increase the nonce value when sending the cache rejoin acks. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 35 +++ src/messages/MMDSCacheRejoin.h | 11 +++ 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 008a8a2..8ba676e 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -3538,6 +3538,7 @@ void MDCache::rejoin_send_rejoins() if (p-first == 0 root) { p-second-add_weak_inode(root-vino()); p-second-add_strong_inode(root-vino(), + root-get_replica_nonce(), root-get_caps_wanted(), root-filelock.get_state(), root-nestlock.get_state(), @@ -3551,6 +3552,7 @@ void MDCache::rejoin_send_rejoins() if (CInode *in = get_inode(MDS_INO_MDSDIR(p-first))) { p-second-add_weak_inode(in-vino()); p-second-add_strong_inode(in-vino(), + in-get_replica_nonce(), in-get_caps_wanted(), in-filelock.get_state(), in-nestlock.get_state(), @@ -3709,6 +3711,7 @@ void MDCache::rejoin_walk(CDir *dir, MMDSCacheRejoin *rejoin) CInode *in = dnl-get_inode(); dout(15) add_strong_inode *in dendl; rejoin-add_strong_inode(in-vino(), +in-get_replica_nonce(), in-get_caps_wanted(), in-filelock.get_state(), in-nestlock.get_state(), @@ -4248,7 +4251,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dir = rejoin_invent_dirfrag(p-first); } if (dir) { - dir-add_replica(from); + dir-add_replica(from, p-second.nonce); dir-dir_rep = p-second.dir_rep; } else { dout(10) frag p-first doesn't match dirfragtree *diri dendl; @@ -4263,7 +4266,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dir = rejoin_invent_dirfrag(p-first); else dout(10) have(approx) *dir dendl; - dir-add_replica(from); + dir-add_replica(from, p-second.nonce); dir-dir_rep = p-second.dir_rep; } refragged = true; @@ -4327,7 +4330,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) mdr-locks.insert(dn-lock); } - dn-add_replica(from); + dn-add_replica(from, q-second.nonce); dout(10) have *dn dendl; // inode? @@ -4412,7 +4415,7 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dout(10) sender has dentry but not inode, adding them as a replica dendl; } - in-add_replica(from); + in-add_replica(from, p-second.nonce); dout(10) have *in dendl; } } @@ -5176,7 +5179,7 @@ void MDCache::rejoin_send_acks() for (mapint,int::iterator r = dir-replicas_begin(); r != dir-replicas_end(); ++r) - ack[r-first]-add_strong_dirfrag(dir-dirfrag(), r-second, dir-dir_rep); + ack[r-first]-add_strong_dirfrag(dir-dirfrag(), ++r-second, dir-dir_rep); for (CDir::map_t::iterator q = dir-items.begin(); q != dir-items.end(); @@ -5192,7 +5195,7 @@ void MDCache::rejoin_send_acks() dnl-is_primary() ? dnl-get_inode()-ino():inodeno_t(0), dnl-is_remote() ? dnl-get_remote_ino():inodeno_t(0), dnl-is_remote() ? dnl-get_remote_d_type():0, - r-second, + ++r-second, dn-lock.get_replica_state()); if (!dnl-is_primary()) @@ -5205,7 +5208,7 @@ void MDCache::rejoin_send_acks() r != in-replicas_end(); ++r) { ack[r-first]-add_inode_base(in); - ack[r-first]-add_inode_locks(in, r-second); + ack[r-first]-add_inode_locks(in, ++r-second); } // subdirs in this subtree? @@ -5220,14 +5223,14 @@ void MDCache::rejoin_send_acks() r != root-replicas_end(); ++r) { ack[r-first]-add_inode_base(root); - ack[r-first]-add_inode_locks(root, r-second); +
Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack
On Wed, Mar 20, 2013 at 4:33 PM, Gregory Farnum g...@inktank.com wrote: This needs to handle versioning the encoding based on peer feature bits too. On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: + void add_dirfrag_base(CDir *dir) { +::encode(dir-dirfrag(), dirfrag_base); +bufferlist bl; +dir-_encode_base(bl); +::encode(bl, dirfrag_base); + } We are guilty of doing this in other places, but we should avoid implicit encodings like this one, especially when the decode happens somewhere else like it does here. We can make a vector dirfrag_bases and add to that, and then encode and decode it along with the rest of the message — would that work for your purposes? -Greg Sorry, a vector (called dirfrag_bases) of pairdirfrag_t, bl where bl is the encoded base. Or something like that. :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/39] mds: reqid for rejoinning authpin/wrlock need to be list
I think Sage is right, we can just bump the MDS protocol instead of spending a feature bit on OTW changes — but this is another message we should update to the new encoding macros while we're making that bump. The rest looks good! -Greg On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 78 -- src/messages/MMDSCacheRejoin.h | 12 +++ 2 files changed, 50 insertions(+), 40 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 38b1fdf..f4622de 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4305,16 +4305,19 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) // dn auth_pin? if (strong-authpinned_dentries.count(p-first) strong-authpinned_dentries[p-first].count(q-first)) { - MMDSCacheRejoin::slave_reqid r = strong-authpinned_dentries[p-first][q-first]; - dout(10) dn authpin by r on *dn dendl; - - // get/create slave mdrequest - MDRequest *mdr; - if (have_request(r.reqid)) - mdr = request_get(r.reqid); - else - mdr = request_start_slave(r.reqid, r.attempt, from); - mdr-auth_pin(dn); + for (listMMDSCacheRejoin::slave_reqid::iterator r = strong-authpinned_dentries[p-first][q-first].begin(); +r != strong-authpinned_dentries[p-first][q-first].end(); +++r) { + dout(10) dn authpin by *r on *dn dendl; + + // get/create slave mdrequest + MDRequest *mdr; + if (have_request(r-reqid)) + mdr = request_get(r-reqid); + else + mdr = request_start_slave(r-reqid, r-attempt, from); + mdr-auth_pin(dn); + } } // dn xlock? @@ -4389,22 +4392,25 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) // auth pin? if (strong-authpinned_inodes.count(in-vino())) { - MMDSCacheRejoin::slave_reqid r = strong-authpinned_inodes[in-vino()]; - dout(10) inode authpin by r on *in dendl; + for (listMMDSCacheRejoin::slave_reqid::iterator r = strong-authpinned_inodes[in-vino()].begin(); + r != strong-authpinned_inodes[in-vino()].end(); + ++r) { + dout(10) inode authpin by *r on *in dendl; - // get/create slave mdrequest - MDRequest *mdr; - if (have_request(r.reqid)) - mdr = request_get(r.reqid); - else - mdr = request_start_slave(r.reqid, r.attempt, from); - if (strong-frozen_authpin_inodes.count(in-vino())) { - assert(!in-get_num_auth_pins()); - mdr-freeze_auth_pin(in); - } else { - assert(!in-is_frozen_auth_pin()); + // get/create slave mdrequest + MDRequest *mdr; + if (have_request(r-reqid)) + mdr = request_get(r-reqid); + else + mdr = request_start_slave(r-reqid, r-attempt, from); + if (strong-frozen_authpin_inodes.count(in-vino())) { + assert(!in-get_num_auth_pins()); + mdr-freeze_auth_pin(in); + } else { + assert(!in-is_frozen_auth_pin()); + } + mdr-auth_pin(in); } - mdr-auth_pin(in); } // xlock(s)? if (strong-xlocked_inodes.count(in-vino())) { @@ -4427,19 +4433,23 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) } // wrlock(s)? if (strong-wrlocked_inodes.count(in-vino())) { - for (mapint,MMDSCacheRejoin::slave_reqid::iterator q = strong-wrlocked_inodes[in-vino()].begin(); + for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = strong-wrlocked_inodes[in-vino()].begin(); q != strong-wrlocked_inodes[in-vino()].end(); ++q) { SimpleLock *lock = in-get_lock(q-first); - dout(10) inode wrlock by q-second on *lock on *in dendl; - MDRequest *mdr = request_get(q-second.reqid); // should have this from auth_pin above. - assert(mdr-is_auth_pinned(in)); - lock-set_state(LOCK_LOCK); - if (lock == in-filelock) - in-loner_cap = -1; - lock-get_wrlock(true); - mdr-wrlocks.insert(lock); - mdr-locks.insert(lock); + for (listMMDSCacheRejoin::slave_reqid::iterator r = q-second.begin(); +r != q-second.end(); +++r) { + dout(10) inode wrlock by *r on *lock on *in dendl; + MDRequest *mdr = request_get(r-reqid); // should have this from auth_pin above. + assert(mdr-is_auth_pinned(in)); + lock-set_state(LOCK_MIX); + if (lock == in-filelock) + in-loner_cap = -1; + lock-get_wrlock(true); + mdr-wrlocks.insert(lock); + mdr-locks.insert(lock); + } } }
Re: [PATCH 24/39] mds: take object's versionlock when rejoinning xlock
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 12 1 file changed, 12 insertions(+) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index f4622de..194f983 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4327,6 +4327,12 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dout(10) dn xlock by r on *dn dendl; MDRequest *mdr = request_get(r.reqid); // should have this from auth_pin above. assert(mdr-is_auth_pinned(dn)); + if (!mdr-xlocks.count(dn-versionlock)) { + assert(dn-versionlock.can_xlock_local()); + dn-versionlock.get_xlock(mdr, mdr-get_client()); + mdr-xlocks.insert(dn-versionlock); + mdr-locks.insert(dn-versionlock); + } if (dn-lock.is_stable()) dn-auth_pin(dn-lock); dn-lock.set_state(LOCK_XLOCK); @@ -4421,6 +4427,12 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) dout(10) inode xlock by q-second on *lock on *in dendl; MDRequest *mdr = request_get(q-second.reqid); // should have this from auth_pin above. assert(mdr-is_auth_pinned(in)); + if (!mdr-xlocks.count(in-versionlock)) { + assert(in-versionlock.can_xlock_local()); + in-versionlock.get_xlock(mdr, mdr-get_client()); + mdr-xlocks.insert(in-versionlock); + mdr-locks.insert(in-versionlock); + } if (lock-is_stable()) in-auth_pin(lock); lock-set_state(LOCK_XLOCK); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/39] mds: share inode max size after MDS recovers
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The MDS may crash after journaling the new max size, but before sending the new max size to the client. Later when the MDS recovers, the client re-requests the new max size, but the MDS finds max size unchanged. So the client waits for the new max size forever. This issue can be avoided by checking client cap's last_sent, share inode max size if it is zero. (reconnected cap's last_sent is zero) Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 18 ++ src/mds/Locker.h | 2 +- src/mds/MDCache.cc | 2 ++ 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index 0055a19..4d45f99 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -2089,7 +2089,7 @@ bool Locker::check_inode_max_size(CInode *in, bool force_wrlock, } -void Locker::share_inode_max_size(CInode *in) +void Locker::share_inode_max_size(CInode *in, Capability *only_cap) { /* * only share if currently issued a WR cap. if client doesn't have it, @@ -2097,9 +2097,12 @@ void Locker::share_inode_max_size(CInode *in) * the cap later. */ dout(10) share_inode_max_size on *in dendl; - for (mapclient_t,Capability*::iterator it = in-client_caps.begin(); - it != in-client_caps.end(); - ++it) { + mapclient_t, Capability*::iterator it; + if (only_cap) +it = in-client_caps.find(only_cap-get_client()); + else +it = in-client_caps.begin(); + for (; it != in-client_caps.end(); ++it) { const client_t client = it-first; Capability *cap = it-second; if (cap-is_suppress()) @@ -2115,6 +2118,8 @@ void Locker::share_inode_max_size(CInode *in) in-encode_cap_message(m, cap); mds-send_message_client_counted(m, client); } +if (only_cap) + break; } } @@ -2398,6 +2403,11 @@ void Locker::handle_client_caps(MClientCaps *m) bool did_issue = eval(in, CEPH_CAP_LOCKS); if (!did_issue (cap-wanted() ~cap-pending())) issue_caps(in, cap); + if (cap-get_last_seq() == 0 + (cap-pending() (CEPH_CAP_FILE_WR|CEPH_CAP_FILE_BUFFER))) { + cap-issue_norevoke(cap-issued()); + share_inode_max_size(in, cap); + } } } diff --git a/src/mds/Locker.h b/src/mds/Locker.h index 3f79996..d98104f 100644 --- a/src/mds/Locker.h +++ b/src/mds/Locker.h @@ -276,7 +276,7 @@ public: void calc_new_client_ranges(CInode *in, uint64_t size, mapclient_t, client_writeable_range_t new_ranges); bool check_inode_max_size(CInode *in, bool force_wrlock=false, bool update_size=false, uint64_t newsize=0, utime_t mtime=utime_t()); - void share_inode_max_size(CInode *in); + void share_inode_max_size(CInode *in, Capability *only_cap=0); private: friend class C_MDL_CheckMaxSize; diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 194f983..459b400 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -5073,6 +5073,8 @@ void MDCache::do_cap_import(Session *session, CInode *in, Capability *cap) SnapRealm *realm = in-find_snaprealm(); if (realm-have_past_parents_open()) { dout(10) do_cap_import session-info.inst.name mseq cap-get_mseq() on *in dendl; +if (cap-get_last_seq() == 0) + cap-issue_norevoke(cap-issued()); // reconnected cap cap-set_last_issue(); MClientCaps *reap = new MClientCaps(CEPH_CAP_OP_IMPORT, in-ino(), -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/39] mds: issue caps when lock state in replica become SYNC
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com because client can request READ caps from non-auth MDS. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index 4d45f99..28920d4 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -4403,6 +4403,8 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock *m) lock-set_state(LOCK_SYNC); lock-get_rdlock(); +if (caps) + issue_caps(in); lock-finish_waiters(SimpleLock::WAIT_RD|SimpleLock::WAIT_STABLE); lock-put_rdlock(); break; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Latest bobtail branch still crashing KVM VMs in bh_write_commit()
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Stefan Priebe Sent: Thursday, March 21, 2013 4:14 AM To: Travis Rhoden Cc: bcampb...@axcess-financial.com; ceph-devel Subject: Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit() Hi, In this case, they are format 2. And they are from cloned snapshots. Exactly like the following: # rbd ls -l -p volumes NAME SIZE PARENT FMT PROT LOCK volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M images/b8bdda90-664b-4906-86d6-dd33735441f2@snap 2 I'm doing an OpenStack boot-from-volume setup. OK i've never used cloned snapshots so maybe this is the reason. strange i've never seen this. Which qemu version? # qemu-x86_64 -version qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard that's coming from Ubuntu 12.04 apt repos. maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does not exist anymore it was merged into qemu with 1.3 or 1.4. [jacky_he] I also encountered the same issue, ceph version is 0.56.3. I have tried Qemu 1.3.1 and Qemu 1.4.0, KVM VM with format 2 cloned image crashs. My host OS is ubuntu 12.04, guest OS are CentOS 6.3 and Windows XP/Windows 7 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html __ Information from ESET NOD32 Antivirus, version of virus signature database 8141 (20130320) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deb/rpm package purge
As a point of comparison, mysql removes the config files but not /var/lib/mysql. The question is, is that okay/typical/desireable/recommended/a bad idea? I should have asked this sooner. Do you know _any_ program that removes your favorite music collection, your family photos or your business emails when you do uninstall it? I suspect that your question was theoretical instead. It's somewhat different in that the data is not owned by one user, but there are clear parallels. The thing to be careful about here, IMO, is not only to preserve the data, but the associated files that allow (reasonably-easy) access to that data. (It's no good preserving the OSD filestore if the keys, monmap, or osdmap are gone or hard to recover.) - remove /etc/ceph if it has been untouched This is an other case. dpkg itself handle package files here, called conffiles. I should check the method (md5sum and/or sha1 variants) used for the checksum on these files. On upgrade it's used not to overwrite local changes by the user. It may worth to read a bit more about it[1] from Raphaël Hertzog. He is the co-author of Debian Administrator's handbook[2] BTW. Excellent reference; thanks for the pointer. On purge dpkg will remove the package conffiles no matter what. It won't check if those were changed or not. You may not mark the files under /etc as conffiles, but then you'll lose the mentioned merge logic on upgrades; dpkg will just overwrite those. In short, files under /var/lib/ceph are the only candidates for in-package checksumming. How many files under there that essential for the packages? Laszlo/GCS [1] http://raphaelhertzog.com/2010/09/21/debian-conffile-configuration-file-managed-by-dpkg/ [2] http://debian-handbook.info/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/39] mds: consider MDS as recovered when it reaches clientreply state.
On 03/21/2013 02:40 AM, Greg Farnum wrote: The idea of this patch makes sense, but I'm not sure if we guarantee that each daemon sees every map update — if they don't then if an MDS misses the map moving an MDS into CLIENTREPLAY then they won't process them as having recovered on the next map. Sage or Joao, what are the guarantees subscription provides? -Greg See MDS::active_start(), it also kicks clientreplay waiters. And I will fix the 'clientreply' typo in my git tree. Thanks Yan, Zheng Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) MDS in clientreply state already start servering requests. It also make MDS::handle_mds_recovery() and MDS::recovery_done() match. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDS.cc (http://MDS.cc) | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) index 282fa64..b91dcbd 100644 --- a/src/mds/MDS.cc (http://MDS.cc) +++ b/src/mds/MDS.cc (http://MDS.cc) @@ -1032,7 +1032,9 @@ void MDS::handle_mds_map(MMDSMap *m) setint oldactive, active; oldmap-get_mds_set(oldactive, MDSMap::STATE_ACTIVE); + oldmap-get_mds_set(oldactive, MDSMap::STATE_CLIENTREPLAY); mdsmap-get_mds_set(active, MDSMap::STATE_ACTIVE); + mdsmap-get_mds_set(active, MDSMap::STATE_CLIENTREPLAY); for (setint::iterator p = active.begin(); p != active.end(); ++p) if (*p != whoami // not me oldactive.count(*p) == 0) // newly so? -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/39] mds: defer eval gather locks when removing replica
Will update my git tree. Thanks Yan, Zheng On 03/21/2013 03:36 AM, Greg Farnum wrote: On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Locks' states should not change between composing the cache rejoin ack messages and sending the message. If Locker::eval_gather() is called in MDCache::{inode,dentry}_remove_replica(), it may wake requests and change locks' states. Signed-off-by: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) --- src/mds/MDCache.cc (http://MDCache.cc) | 51 ++- src/mds/MDCache.h | 8 +--- 2 files changed, 35 insertions(+), 24 deletions(-) diff --git a/src/mds/MDCache.cc (http://MDCache.cc) b/src/mds/MDCache.cc (http://MDCache.cc) index 19dc60b..0f6b842 100644 --- a/src/mds/MDCache.cc (http://MDCache.cc) +++ b/src/mds/MDCache.cc (http://MDCache.cc) @@ -3729,6 +3729,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) // possible response(s) MMDSCacheRejoin *ack = 0; // if survivor setvinodeno_t acked_inodes; // if survivor + setSimpleLock * gather_locks; // if survivor bool survivor = false; // am i a survivor? if (mds-is_clientreplay() || mds-is_active() || mds-is_stopping()) { @@ -3851,7 +3852,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(dnl-is_primary()); if (survivor dn-is_replica(from)) - dentry_remove_replica(dn, from); // this induces a lock gather completion + dentry_remove_replica(dn, from, gather_locks); // this induces a lock gather completion This comment is no longer accurate :) int dnonce = dn-add_replica(from); dout(10) have *dn dendl; if (ack) @@ -3864,7 +3865,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) assert(in); if (survivor in-is_replica(from)) - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); int inonce = in-add_replica(from); dout(10) have *in dendl; @@ -3887,7 +3888,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) CInode *in = get_inode(*p); assert(in); // hmm fixme wrt stray? if (survivor in-is_replica(from)) - inode_remove_replica(in, from); // this induces a lock gather completion + inode_remove_replica(in, from, gather_locks); // this induces a lock gather completion Same here. Other than those, looks good. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com int inonce = in-add_replica(from); dout(10) have base *in dendl; @@ -3909,8 +3910,11 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) ack-add_inode_base(in); } - rejoin_scour_survivor_replicas(from, ack, acked_inodes); + rejoin_scour_survivor_replicas(from, ack, gather_locks, acked_inodes); mds-send_message(ack, weak-get_connection()); + + for (setSimpleLock*::iterator p = gather_locks.begin(); p != gather_locks.end(); ++p) + mds-locker-eval_gather(*p); } else { // done? assert(rejoin_gather.count(from)); @@ -4055,7 +4059,9 @@ bool MDCache::parallel_fetch_traverse_dir(inodeno_t ino, filepath path, * all validated replicas are acked with a strong nonce, etc. if that isn't in the * ack, the replica dne, and we can remove it from our replica maps. */ -void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, setvinodeno_t acked_inodes) +void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, + setSimpleLock * gather_locks, + setvinodeno_t acked_inodes) { dout(10) rejoin_scour_survivor_replicas from mds. from dendl; @@ -4070,7 +4076,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (in-is_auth() in-is_replica(from) acked_inodes.count(p-second-vino()) == 0) { - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); dout(10) rem *in dendl; } @@ -4099,7 +4105,7 @@ void MDCache::rejoin_scour_survivor_replicas(int from, MMDSCacheRejoin *ack, set if (dn-is_replica(from) (ack-strong_dentries.count(dir-dirfrag()) == 0 || ack-strong_dentries[dir-dirfrag()].count(string_snap_t(dn-name, dn-last)) == 0)) { - dentry_remove_replica(dn, from); + dentry_remove_replica(dn, from, gather_locks); dout(10) rem *dn dendl; } } @@ -6189,6 +6195,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) return; } + setSimpleLock * gather_locks; // loop over realms for (mapdirfrag_t,MCacheExpire::realm::iterator p = m-realms.begin(); p != m-realms.end(); @@ -6255,7 +6262,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) // remove from our cached_by dout(7) inode expire on *in from mds. from cached_by was in-get_replicas() dendl; - inode_remove_replica(in, from); + inode_remove_replica(in, from, gather_locks); } else { // this is an old nonce, ignore expire. @@ -6332,7 +6339,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m) if (nonce ==
Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request
On 03/21/2013 05:19 AM, Greg Farnum wrote: On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Replicated objects need to be added into the cache immediately Signed-off-by: Yan, Zheng zheng.z@intel.com Why do we need to add them right away? Shouldn't we have a journaled replica if we need it? -Greg The issue I encountered is lock action message received, but replicated objects wasn't in the cache because slave request was delayed. Thanks Yan, Zheng Software Engineer #42 @ http://inktank.com | http://ceph.com --- src/mds/MDCache.cc | 12 src/mds/MDCache.h | 2 +- src/mds/MDS.cc | 6 +++--- src/mds/Server.cc | 55 +++--- 4 files changed, 56 insertions(+), 19 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 0f6b842..b668842 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context *fin, bufferlist bl, int r) /* */ +int MDCache::get_num_client_requests() +{ + int count = 0; + for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); + p != active_requests.end(); + ++p) { + if (p-second-reqid.name.is_client() !p-second-is_slave()) + count++; + } + return count; +} + /* This function takes over the reference to the passed Message */ MDRequest *MDCache::request_start(MClientRequest *req) { diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index a9f05c6..4634121 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -240,7 +240,7 @@ protected: hash_mapmetareqid_t, MDRequest* active_requests; public: - int get_num_active_requests() { return active_requests.size(); } + int get_num_client_requests(); MDRequest* request_start(MClientRequest *req); MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by); diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index b91dcbd..e99eecc 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m) mdcache-is_open() replay_queue.empty() want_state == MDSMap::STATE_CLIENTREPLAY) { - dout(10) still have mdcache-get_num_active_requests() - active replay requests dendl; - if (mdcache-get_num_active_requests() == 0) + int num_requests = mdcache-get_num_client_requests(); + dout(10) still have num_requests active replay requests dendl; + if (num_requests == 0) clientreplay_done(); } diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 4c4c86b..8e89e4c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -107,10 +107,8 @@ void Server::dispatch(Message *m) (m-get_type() == CEPH_MSG_CLIENT_REQUEST (static_castMClientRequest*(m))-is_replay( { // replaying! - } else if (mds-is_clientreplay() m-get_type() == MSG_MDS_SLAVE_REQUEST - ((static_castMMDSSlaveRequest*(m))-is_reply() || - !mds-mdsmap-is_active(m-get_source().num( { - // slave reply or the master is also in the clientreplay stage + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) { + // handle_slave_request() will wait if necessary } else { dout(3) not active yet, waiting dendl; mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) if (m-is_reply()) return handle_slave_request_reply(m); + CDentry *straydn = NULL; + if (m-stray.length() 0) { + straydn = mdcache-add_replica_stray(m-stray, from); + assert(straydn); + m-stray.clear(); + } + // am i a new slave? MDRequest *mdr = NULL; if (mdcache-have_request(m-get_reqid())) { @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) m-put(); return; } - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), m-get_source().num()); + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from); } assert(mdr-slave_request == 0); // only one at a time, please! + + if (straydn) { + mdr-pin(straydn); + mdr-straydn = straydn; + } + + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } else if (mds-is_clientreplay() !mds-mdsmap-is_clientreplay(from) + mdr-locks.empty()) { + dout(3) not active yet, waiting dendl; + mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); + return; + } + mdr-slave_request = m; dispatch_slave_request(mdr); @@ -1339,6 +1361,12 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) { int from = m-get_source().num(); + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active yet, waiting dendl; + mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); + return; + } + if (m-get_op() == MMDSSlaveRequest::OP_COMMITTED) { metareqid_t r = m-get_reqid();
Re: [PATCH 13/39] mds: don't send resolve message between active MDS
On 03/21/2013 05:56 AM, Gregory Farnum wrote: On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com When MDS cluster is resolving, current behavior is sending subtree resolve message to all other MDS and waiting for all other MDS' resolve message. The problem is that active MDS can have diffent subtree map due to rename. Besides gathering active MDS's resolve messages are also racy. The only function for these messages is disambiguate other MDS' import. We can replace it by import finish notification. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 12 +--- src/mds/Migrator.cc | 25 +++-- src/mds/Migrator.h | 3 ++- 3 files changed, 34 insertions(+), 6 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index c455a20..73c1d59 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2517,7 +2517,8 @@ void MDCache::send_subtree_resolves() ++p) { if (*p == mds-whoami) continue; -resolves[*p] = new MMDSResolve; +if (mds-is_resolve() || mds-mdsmap-is_resolve(*p)) + resolves[*p] = new MMDSResolve; } // known @@ -2837,7 +2838,7 @@ void MDCache::handle_resolve(MMDSResolve *m) migrator-import_reverse(dir); } else { dout(7) ambiguous import succeeded on *dir dendl; - migrator-import_finish(dir); + migrator-import_finish(dir, true); } my_ambiguous_imports.erase(p); // no longer ambiguous. } @@ -3432,7 +3433,12 @@ void MDCache::rejoin_send_rejoins() ++p) { CDir *dir = p-first; assert(dir-is_subtree_root()); -assert(!dir-is_ambiguous_dir_auth()); +if (dir-is_ambiguous_dir_auth()) { + // exporter is recovering, importer is survivor. The importer has to be the MDS this code is running on, right? This code is for bystanders. The exporter is recovering, and its resolve message didn't claim the subtree. So the export must succeed. + assert(rejoins.count(dir-authority().first)); + assert(!rejoins.count(dir-authority().second)); + continue; +} // my subtree? if (dir-is_auth()) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 5e53803..833df12 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2088,6 +2088,23 @@ void Migrator::import_reverse(CDir *dir) } } +void Migrator::import_notify_finish(CDir *dir, setCDir* bounds) +{ + dout(7) import_notify_finish *dir dendl; + + for (setint::iterator p = import_bystanders[dir].begin(); + p != import_bystanders[dir].end(); + ++p) { +MExportDirNotify *notify = + new MExportDirNotify(dir-dirfrag(), false, + pairint,int(import_peer[dir-dirfrag()], mds-get_nodeid()), + pairint,int(mds-get_nodeid(), CDIR_AUTH_UNKNOWN)); I don't think this is quite right — we're notifying them that we've just finished importing data from somebody, right? And so we know that we're the auth node... Yes. In normal case, exporter notifies the bystanders. But if exporter crashes, the importer notifies the bystanders after it confirms ambiguous import succeeds. Thanks Yan, Zheng +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++) + notify-get_bounds().push_back((*i)-dirfrag()); +mds-send_message_mds(notify, *p); + } +} + void Migrator::import_notify_abort(CDir *dir, setCDir* bounds) { dout(7) import_notify_abort *dir dendl; @@ -2183,11 +2200,11 @@ void Migrator::handle_export_finish(MExportDirFinish *m) CDir *dir = cache-get_dirfrag(m-get_dirfrag()); assert(dir); dout(7) handle_export_finish on *dir dendl; - import_finish(dir); + import_finish(dir, false); m-put(); } -void Migrator::import_finish(CDir *dir) +void Migrator::import_finish(CDir *dir, bool notify) { dout(7) import_finish on *dir dendl; @@ -2205,6 +,10 @@ void Migrator::import_finish(CDir *dir) // remove pins setCDir* bounds; cache-get_subtree_bounds(dir, bounds); + + if (notify) +import_notify_finish(dir, bounds); + import_remove_pins(dir, bounds); mapCInode*, mapclient_t,Capability::Export cap_imports; diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h index 7988f32..2889a74 100644 --- a/src/mds/Migrator.h +++ b/src/mds/Migrator.h @@ -273,12 +273,13 @@ protected: void import_reverse_unfreeze(CDir *dir); void import_reverse_final(CDir *dir); void import_notify_abort(CDir *dir, setCDir* bounds); + void import_notify_finish(CDir *dir, setCDir* bounds); void import_logged_start(dirfrag_t df, CDir *dir, int from, mapclient_t,entity_inst_t imported_client_map, mapclient_t,uint64_t sseqmap); void handle_export_finish(MExportDirFinish *m); public: -
Re: [PATCH 27/39] mds: send lock action message when auth MDS is in proper state.
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com For rejoining object, don't send lock ACK message because lock states are still uncertain. The lock ACK may confuse object's auth MDS and trigger assertion. If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK and REQSCATTER messages. MDCache::handle_mds_recovery() will take care of them. Also defer caps release message until clientreplay or active Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 46 ++ src/mds/MDCache.cc | 13 +++-- 2 files changed, 41 insertions(+), 18 deletions(-) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index 28920d4..ece39e3 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -658,6 +658,13 @@ void Locker::eval_gather(SimpleLock *lock, bool first, bool *pneed_issue, listC // replica: tell auth int auth = lock-get_parent()-authority().first; + if (lock-get_parent()-is_rejoining() + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) { + dout(7) eval_gather finished gather, but still rejoining +*lock-get_parent() dendl; + return; + } + if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) { switch (lock-get_state()) { case LOCK_SYNC_LOCK: @@ -1050,9 +1057,11 @@ bool Locker::_rdlock_kick(SimpleLock *lock, bool as_anon) } else { // request rdlock state change from auth int auth = lock-get_parent()-authority().first; - dout(10) requesting rdlock from auth on - *lock on *lock-get_parent() dendl; - mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, mds-get_nodeid()), auth); + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) { + dout(10) requesting rdlock from auth on + *lock on *lock-get_parent() dendl; + mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, mds-get_nodeid()), auth); + } return false; } } @@ -1272,9 +1281,11 @@ bool Locker::wrlock_start(SimpleLock *lock, MDRequest *mut, bool nowait) // replica. // auth should be auth_pinned (see acquire_locks wrlock weird mustpin case). int auth = lock-get_parent()-authority().first; - dout(10) requesting scatter from auth on - *lock on *lock-get_parent() dendl; - mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, mds-get_nodeid()), auth); + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) { + dout(10) requesting scatter from auth on + *lock on *lock-get_parent() dendl; + mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, mds-get_nodeid()), auth); + } break; } } @@ -1899,13 +1910,19 @@ void Locker::request_inode_file_caps(CInode *in) } int auth = in-authority().first; +if (in-is_rejoining() + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) { + mds-wait_for_active_peer(auth, new C_MDL_RequestInodeFileCaps(this, in)); + return; +} + dout(7) request_inode_file_caps ccap_string(wanted) was ccap_string(in-replica_caps_wanted) on *in to mds. auth dendl; in-replica_caps_wanted = wanted; -if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) +if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) mds-send_message_mds(new MInodeFileCaps(in-ino(), in-replica_caps_wanted), auth); } @@ -1924,14 +1941,6 @@ void Locker::handle_inode_file_caps(MInodeFileCaps *m) assert(in); assert(in-is_auth()); - if (mds-is_rejoin() - in-is_rejoining()) { -dout(7) handle_inode_file_caps still rejoining *in , dropping *m dendl; -m-put(); -return; - } This is okay since we catch it in the follow-on functions (I assume that's why you removed it, to avoid checks at more levels than necessary), but if you could note that's why in the commit message it'll prevent anyone else from needing to go check like I did. :) The code looks good. Reviewed-by: Greg Farnum g...@inktank.com - - dout(7) handle_inode_file_caps replica mds. from wants caps ccap_string(m-get_caps()) on *in dendl; if (m-get_caps()) @@ -2850,6 +2859,11 @@ void Locker::handle_client_cap_release(MClientCapRelease *m) client_t client = m-get_source().num(); dout(10) handle_client_cap_release *m dendl; + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { +mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); +return; + } + for (vectorceph_mds_cap_item::iterator p = m-caps.begin(); p != m-caps.end(); ++p) { inodeno_t ino((uint64_t)p-ino); CInode *in = mdcache-get_inode(ino);
Re: [PATCH 28/39] mds: add dirty imported dirfrag to LogSegment
Whoops! Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc | 7 +-- src/mds/CDir.h | 2 +- src/mds/Migrator.cc | 2 +- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index af0ae9c..34bd8d3 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -2164,7 +2164,7 @@ void CDir::finish_export(utime_t now) dirty_old_rstat.clear(); } -void CDir::decode_import(bufferlist::iterator blp, utime_t now) +void CDir::decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls) { ::decode(first, blp); ::decode(fnode, blp); @@ -2177,7 +2177,10 @@ void CDir::decode_import(bufferlist::iterator blp, utime_t now) ::decode(s, blp); state = MASK_STATE_IMPORT_KEPT; state |= (s MASK_STATE_EXPORTED); - if (is_dirty()) get(PIN_DIRTY); + if (is_dirty()) { +get(PIN_DIRTY); +_mark_dirty(ls); + } ::decode(dir_rep, blp); diff --git a/src/mds/CDir.h b/src/mds/CDir.h index f4a3a3d..7e1db73 100644 --- a/src/mds/CDir.h +++ b/src/mds/CDir.h @@ -550,7 +550,7 @@ public: void abort_export() { put(PIN_TEMPEXPORTING); } - void decode_import(bufferlist::iterator blp, utime_t now); + void decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls); // -- auth pins -- diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 833df12..d626cb1 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2397,7 +2397,7 @@ int Migrator::decode_import_dir(bufferlist::iterator blp, dout(7) decode_import_dir *dir dendl; // assimilate state - dir-decode_import(blp, now); + dir-decode_import(blp, now, ls); // mark (may already be marked from get_or_open_dir() above) if (!dir-is_auth()) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 29/39] mds: avoid double auth pin for file recovery
This looks good on its face but I haven't had the chance to dig through the recovery queue stuff yet (it's on my list following some issues with recovery speed). How'd you run across this? If it's being added to the recovery queue multiple times I want to make sure we don't have some other machinery trying to dequeue it multiple times, or a single waiter which needs to be a list or something. -Greg On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 973a4d0..e9a79cd 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in) dout(15) _queue_file_recover *in dendl; assert(in-is_auth()); in-state_clear(CInode::STATE_NEEDSRECOVER); - in-state_set(CInode::STATE_RECOVERING); - in-auth_pin(this); + if (!in-state_test(CInode::STATE_RECOVERING)) { +in-state_set(CInode::STATE_RECOVERING); +in-auth_pin(this); + } file_recover_queue.insert(in); } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 27/39] mds: send lock action message when auth MDS is in proper state.
On 03/21/2013 11:12 AM, Gregory Farnum wrote: On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com For rejoining object, don't send lock ACK message because lock states are still uncertain. The lock ACK may confuse object's auth MDS and trigger assertion. If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK and REQSCATTER messages. MDCache::handle_mds_recovery() will take care of them. Also defer caps release message until clientreplay or active Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 46 ++ src/mds/MDCache.cc | 13 +++-- 2 files changed, 41 insertions(+), 18 deletions(-) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index 28920d4..ece39e3 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -658,6 +658,13 @@ void Locker::eval_gather(SimpleLock *lock, bool first, bool *pneed_issue, listC // replica: tell auth int auth = lock-get_parent()-authority().first; + if (lock-get_parent()-is_rejoining() + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) { + dout(7) eval_gather finished gather, but still rejoining +*lock-get_parent() dendl; + return; + } + if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) { switch (lock-get_state()) { case LOCK_SYNC_LOCK: @@ -1050,9 +1057,11 @@ bool Locker::_rdlock_kick(SimpleLock *lock, bool as_anon) } else { // request rdlock state change from auth int auth = lock-get_parent()-authority().first; - dout(10) requesting rdlock from auth on - *lock on *lock-get_parent() dendl; - mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, mds-get_nodeid()), auth); + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) { + dout(10) requesting rdlock from auth on + *lock on *lock-get_parent() dendl; + mds-send_message_mds(new MLock(lock, LOCK_AC_REQRDLOCK, mds-get_nodeid()), auth); + } return false; } } @@ -1272,9 +1281,11 @@ bool Locker::wrlock_start(SimpleLock *lock, MDRequest *mut, bool nowait) // replica. // auth should be auth_pinned (see acquire_locks wrlock weird mustpin case). int auth = lock-get_parent()-authority().first; - dout(10) requesting scatter from auth on - *lock on *lock-get_parent() dendl; - mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, mds-get_nodeid()), auth); + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) { + dout(10) requesting scatter from auth on + *lock on *lock-get_parent() dendl; + mds-send_message_mds(new MLock(lock, LOCK_AC_REQSCATTER, mds-get_nodeid()), auth); + } break; } } @@ -1899,13 +1910,19 @@ void Locker::request_inode_file_caps(CInode *in) } int auth = in-authority().first; +if (in-is_rejoining() + mds-mdsmap-get_state(auth) == MDSMap::STATE_REJOIN) { + mds-wait_for_active_peer(auth, new C_MDL_RequestInodeFileCaps(this, in)); + return; +} + dout(7) request_inode_file_caps ccap_string(wanted) was ccap_string(in-replica_caps_wanted) on *in to mds. auth dendl; in-replica_caps_wanted = wanted; -if (mds-mdsmap-get_state(auth) = MDSMap::STATE_REJOIN) +if (mds-mdsmap-is_clientreplay_or_active_or_stopping(auth)) mds-send_message_mds(new MInodeFileCaps(in-ino(), in-replica_caps_wanted), auth); } @@ -1924,14 +1941,6 @@ void Locker::handle_inode_file_caps(MInodeFileCaps *m) assert(in); assert(in-is_auth()); - if (mds-is_rejoin() - in-is_rejoining()) { -dout(7) handle_inode_file_caps still rejoining *in , dropping *m dendl; -m-put(); -return; - } This is okay since we catch it in the follow-on functions (I assume that's why you removed it, to avoid checks at more levels than necessary), but if you could note that's why in the commit message it'll prevent anyone else from needing to go check like I did. :) if an inode is auth, it can not be rejoining. that's why I removed it. Thanks Yan, Zheng The code looks good. Reviewed-by: Greg Farnum g...@inktank.com - - dout(7) handle_inode_file_caps replica mds. from wants caps ccap_string(m-get_caps()) on *in dendl; if (m-get_caps()) @@ -2850,6 +2859,11 @@ void Locker::handle_client_cap_release(MClientCapRelease *m) client_t client = m-get_source().num(); dout(10) handle_client_cap_release *m dendl; + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { +mds-wait_for_replay(new C_MDS_RetryMessage(mds, m)); +return; + } + for
Re: [PATCH 30/39] mds: check MDS peer's state through mdsmap
Yep. Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index d626cb1..143d71e 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -238,7 +238,7 @@ void Migrator::handle_mds_failure_or_stop(int who) export_unlock(dir); export_locks.erase(dir); dir-state_clear(CDir::STATE_EXPORTING); - if (export_peer[dir] != who) // tell them. + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // tell them. mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), export_peer[dir]); break; @@ -247,7 +247,7 @@ void Migrator::handle_mds_failure_or_stop(int who) dir-unfreeze_tree(); // cancel the freeze export_state.erase(dir); // clean up dir-state_clear(CDir::STATE_EXPORTING); - if (export_peer[dir] != who) // tell them. + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // tell them. mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), export_peer[dir]); break; @@ -278,7 +278,7 @@ void Migrator::handle_mds_failure_or_stop(int who) export_unlock(dir); export_locks.erase(dir); dir-state_clear(CDir::STATE_EXPORTING); - if (export_peer[dir] != who) // tell them. + if (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) // tell them. mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), export_peer[dir]); break; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 31/39] mds: unfreeze subtree if import aborts in PREPPED state
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 143d71e..963706c 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -1658,11 +1658,14 @@ void Migrator::handle_export_cancel(MExportDirCancel *m) CInode *in = cache-get_inode(df.ino); assert(in); import_reverse_discovered(df, in); - } else if (import_state[df] == IMPORT_PREPPING || -import_state[df] == IMPORT_PREPPED) { + } else if (import_state[df] == IMPORT_PREPPING) { CDir *dir = mds-mdcache-get_dirfrag(df); assert(dir); import_reverse_prepping(dir); + } else if (import_state[df] == IMPORT_PREPPED) { +CDir *dir = mds-mdcache-get_dirfrag(df); +assert(dir); +import_reverse_unfreeze(dir); } else { assert(0 == got export_cancel in weird state); } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 32/39] mds: fix export cancel notification
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The comment says that if the importer is dead, bystanders thinks the exporter is the only auth, as per mdcache-handle_mds_failure(). But there is no such code in MDCache::handle_mds_failure(). Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 20 +--- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 963706c..40a5394 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -1390,17 +1390,9 @@ void Migrator::export_logged_finish(CDir *dir) for (setint::iterator p = export_notify_ack_waiting[dir].begin(); p != export_notify_ack_waiting[dir].end(); ++p) { -MExportDirNotify *notify; -if (mds-mdsmap-is_clientreplay_or_active_or_stopping(export_peer[dir])) - // dest is still alive. - notify = new MExportDirNotify(dir-dirfrag(), true, - pairint,int(mds-get_nodeid(), dest), - pairint,int(dest, CDIR_AUTH_UNKNOWN)); -else - // dest is dead. bystanders will think i am only auth, as per mdcache-handle_mds_failure() - notify = new MExportDirNotify(dir-dirfrag(), true, - pairint,int(mds-get_nodeid(), CDIR_AUTH_UNKNOWN), - pairint,int(dest, CDIR_AUTH_UNKNOWN)); +MExportDirNotify *notify = new MExportDirNotify(dir-dirfrag(), true, + pairint,int(mds-get_nodeid(), dest), + pairint,int(dest, CDIR_AUTH_UNKNOWN)); for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++) notify-get_bounds().push_back((*i)-dirfrag()); @@ -2115,11 +2107,9 @@ void Migrator::import_notify_abort(CDir *dir, setCDir* bounds) for (setint::iterator p = import_bystanders[dir].begin(); p != import_bystanders[dir].end(); ++p) { -// NOTE: the bystander will think i am _only_ auth, because they will have seen -// the exporter's failure and updated the subtree auth. see mdcache-handle_mds_failure(). -MExportDirNotify *notify = +MExportDirNotify *notify = new MExportDirNotify(dir-dirfrag(), true, - pairint,int(mds-get_nodeid(), CDIR_AUTH_UNKNOWN), + pairint,int(import_peer[dir-dirfrag()], mds-get_nodeid()), pairint,int(import_peer[dir-dirfrag()], CDIR_AUTH_UNKNOWN)); for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); i++) notify-get_bounds().push_back((*i)-dirfrag()); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 29/39] mds: avoid double auth pin for file recovery
On 03/21/2013 11:20 AM, Gregory Farnum wrote: This looks good on its face but I haven't had the chance to dig through the recovery queue stuff yet (it's on my list following some issues with recovery speed). How'd you run across this? If it's being added to the recovery queue multiple times I want to make sure we don't have some other machinery trying to dequeue it multiple times, or a single waiter which needs to be a list or something. -Greg Two clients that were writing the same file crashed successively. Thanks, Yan, Zheng On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 973a4d0..e9a79cd 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in) dout(15) _queue_file_recover *in dendl; assert(in-is_auth()); in-state_clear(CInode::STATE_NEEDSRECOVER); - in-state_set(CInode::STATE_RECOVERING); - in-auth_pin(this); + if (!in-state_test(CInode::STATE_RECOVERING)) { +in-state_set(CInode::STATE_RECOVERING); +in-auth_pin(this); + } file_recover_queue.insert(in); } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/39] mds: notify bystanders if export aborts
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com So bystanders know the subtree is single auth earlier. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 34 ++ src/mds/Migrator.h | 1 + 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 40a5394..0672d03 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -251,25 +251,28 @@ void Migrator::handle_mds_failure_or_stop(int who) mds-send_message_mds(new MExportDirCancel(dir-dirfrag()), export_peer[dir]); break; - // NOTE: state order reversal, warning comes after loggingstart+prepping + // NOTE: state order reversal, warning comes after prepping case EXPORT_WARNING: dout(10) export state=warning : unpinning bounds, unfreezing, notifying dendl; // fall-thru case EXPORT_PREPPING: if (p-second != EXPORT_WARNING) - dout(10) export state=loggingstart|prepping : unpinning bounds, unfreezing dendl; + dout(10) export state=prepping : unpinning bounds, unfreezing dendl; { // unpin bounds setCDir* bounds; cache-get_subtree_bounds(dir, bounds); - for (setCDir*::iterator p = bounds.begin(); - p != bounds.end(); - ++p) { - CDir *bd = *p; + for (setCDir*::iterator q = bounds.begin(); + q != bounds.end(); + ++q) { + CDir *bd = *q; bd-put(CDir::PIN_EXPORTBOUND); bd-state_clear(CDir::STATE_EXPORTBOUND); } + // notify bystanders + if (p-second == EXPORT_WARNING) + export_notify_abort(dir, bounds); } dir-unfreeze_tree(); export_state.erase(dir); // clean up @@ -1307,9 +1310,21 @@ void Migrator::handle_export_ack(MExportDirAck *m) m-put(); } +void Migrator::export_notify_abort(CDir *dir, setCDir* bounds) +{ + dout(7) export_notify_abort *dir dendl; - - + for (setint::iterator p = export_notify_ack_waiting[dir].begin(); + p != export_notify_ack_waiting[dir].end(); + ++p) { +MExportDirNotify *notify = new MExportDirNotify(dir-dirfrag(), false, + pairint,int(mds-get_nodeid(),export_peer[dir]), + pairint,int(mds-get_nodeid(),CDIR_AUTH_UNKNOWN)); +for (setCDir*::iterator i = bounds.begin(); i != bounds.end(); ++i) + notify-get_bounds().push_back((*i)-dirfrag()); +mds-send_message_mds(notify, *p); + } +} /* * this happens if hte dest failes after i send teh export data but before it is acked @@ -1356,6 +1371,9 @@ void Migrator::export_reverse(CDir *dir) bd-state_clear(CDir::STATE_EXPORTBOUND); } + // notify bystanders + export_notify_abort(dir, bounds); + // process delayed expires cache-process_delayed_expire(dir); diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h index 2889a74..f395bc1 100644 --- a/src/mds/Migrator.h +++ b/src/mds/Migrator.h @@ -227,6 +227,7 @@ public: void export_go(CDir *dir); void export_go_synced(CDir *dir); void export_reverse(CDir *dir); + void export_notify_abort(CDir *dir, setCDir* bounds); void handle_export_ack(MExportDirAck *m); void export_logged_finish(CDir *dir); void handle_export_notify_ack(MExportDirNotifyAck *m); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 34/39] mds: don't open dirfrag while subtree is frozen
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index e9a79cd..30687ec 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -7101,9 +7101,9 @@ int MDCache::path_traverse(MDRequest *mdr, Message *req, Context *fin, // wh if (!curdir) { if (cur-is_auth()) { // parent dir frozen_dir? -if (cur-is_frozen_dir()) { - dout(7) traverse: *cur-get_parent_dir() is frozen_dir, waiting dendl; - cur-get_parent_dn()-get_dir()-add_waiter(CDir::WAIT_UNFREEZE, _get_waiter(mdr, req, fin)); +if (cur-is_frozen()) { + dout(7) traverse: *cur is frozen, waiting dendl; + cur-add_waiter(CDir::WAIT_UNFREEZE, _get_waiter(mdr, req, fin)); return 1; } curdir = cur-get_or_open_dirfrag(this, fg); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 35/39] mds: clear dirty inode rstat if import fails
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc | 1 + src/mds/Migrator.cc | 2 ++ 2 files changed, 3 insertions(+) diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index 34bd8d3..47b6753 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -1022,6 +1022,7 @@ void CDir::assimilate_dirty_rstat_inodes() for (elistCInode*::iterator p = dirty_rstat_inodes.begin_use_current(); !p.end(); ++p) { CInode *in = *p; +assert(in-is_auth()); if (in-is_frozen()) continue; diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 0672d03..f563b8d 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2052,6 +2052,8 @@ void Migrator::import_reverse(CDir *dir) in-clear_replica_map(); if (in-is_dirty()) in-mark_clean(); + in-clear_dirty_rstat(); + in-authlock.clear_gather(); in-linklock.clear_gather(); in-dirfragtreelock.clear_gather(); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/39] mds: try merging subtree after clear EXPORTBOUND
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index f563b8d..9cbad87 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -1340,10 +1340,6 @@ void Migrator::export_reverse(CDir *dir) setCDir* bounds; cache-get_subtree_bounds(dir, bounds); - // adjust auth, with possible subtree merge. - cache-adjust_subtree_auth(dir, mds-get_nodeid()); - cache-try_subtree_merge(dir); // NOTE: may journal subtree_map as side-effect - // remove exporting pins listCDir* rq; rq.push_back(dir); @@ -1371,6 +1367,10 @@ void Migrator::export_reverse(CDir *dir) bd-state_clear(CDir::STATE_EXPORTBOUND); } + // adjust auth, with possible subtree merge. + cache-adjust_subtree_auth(dir, mds-get_nodeid()); + cache-try_subtree_merge(dir); // NOTE: may journal subtree_map as side-effect + // notify bystanders export_notify_abort(dir, bounds); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/39] mds: eval inodes with caps imported by cache rejoin message
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 1 + 1 file changed, 1 insertion(+) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 30687ec..24f1109 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -3823,6 +3823,7 @@ void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin *weak) dout(10) claiming cap import p-first client. q-first on *in dendl; rejoin_import_cap(in, q-first, q-second, from); } + mds-locker-eval(in, CEPH_CAP_LOCKS, true); } } else { assert(mds-is_rejoin()); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 38/39] mds: don't replicate purging dentry
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com open_remote_ino is racy, it's possible someone deletes the inode's last linkage while the MDS is discovering the inode. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 24f1109..d730ff1 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -9225,8 +9225,15 @@ void MDCache::handle_discover(MDiscover *dis) if (dis-get_want_ino()) { // lookup by ino CInode *in = get_inode(dis-get_want_ino(), snapid); - if (in in-is_auth() in-get_parent_dn()-get_dir() == curdir) + if (in in-is_auth() in-get_parent_dn()-get_dir() == curdir) { dn = in-get_parent_dn(); + if (dn-state_test(CDentry::STATE_PURGING)) { + // set error flag in reply + dout(7) dentry *dn is purging, flagging error ino dendl; + reply-set_flag_error_ino(); + break; + } + } } else if (dis-get_want().depth() 0) { // lookup dentry dn = curdir-lookup(dis-get_dentry(i), snapid); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 39/39] mds: clear scatter dirty if replica inode has no auth subtree
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com This avoids sending superfluous scatterlock state to recovering MDS Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CInode.cc | 5 +++-- src/mds/CInode.h| 2 +- src/mds/MDCache.cc | 13 ++--- src/mds/Migrator.cc | 15 +++ 4 files changed, 25 insertions(+), 10 deletions(-) diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc index 42137f3..25cb6c1 100644 --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -615,12 +615,13 @@ void CInode::close_dirfrags() close_dirfrag(dirfrags.begin()-first); } -bool CInode::has_subtree_root_dirfrag() +bool CInode::has_subtree_root_dirfrag(int auth) { for (mapfrag_t,CDir*::iterator p = dirfrags.begin(); p != dirfrags.end(); ++p) -if (p-second-is_subtree_root()) +if (p-second-is_subtree_root() + (auth == -1 || p-second-dir_auth.first == auth)) return true; return false; } diff --git a/src/mds/CInode.h b/src/mds/CInode.h index f7b8f33..bea7430 100644 --- a/src/mds/CInode.h +++ b/src/mds/CInode.h @@ -344,7 +344,7 @@ public: CDir *add_dirfrag(CDir *dir); void close_dirfrag(frag_t fg); void close_dirfrags(); - bool has_subtree_root_dirfrag(); + bool has_subtree_root_dirfrag(int auth=-1); void force_dirfrags(); void verify_dirfrags(); diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index d730ff1..75c7ded 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -3330,8 +3330,10 @@ void MDCache::recalc_auth_bits() setCInode* subtree_inodes; for (mapCDir*,setCDir* ::iterator p = subtrees.begin(); p != subtrees.end(); - ++p) -subtree_inodes.insert(p-first-inode); + ++p) { +if (p-first-dir_auth.first == mds-get_nodeid()) + subtree_inodes.insert(p-first-inode); + } for (mapCDir*,setCDir* ::iterator p = subtrees.begin(); p != subtrees.end(); @@ -3390,11 +3392,8 @@ void MDCache::recalc_auth_bits() if (dnl-get_inode()-is_dirty()) dnl-get_inode()-mark_clean(); // avoid touching scatterlocks for our subtree roots! - if (subtree_inodes.count(dnl-get_inode()) == 0) { - dnl-get_inode()-filelock.remove_dirty(); - dnl-get_inode()-nestlock.remove_dirty(); - dnl-get_inode()-dirfragtreelock.remove_dirty(); - } + if (subtree_inodes.count(dnl-get_inode()) == 0) + dnl-get_inode()-clear_scatter_dirty(); } // recurse? diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 9cbad87..49d21ab 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -1095,6 +1095,10 @@ void Migrator::finish_export_inode(CInode *in, utime_t now, listContext* fini in-clear_dirty_rstat(); + // no more auth subtree? clear scatter dirty + if (!in-has_subtree_root_dirfrag(mds-get_nodeid())) +in-clear_scatter_dirty(); + in-item_open_file.remove_myself(); // waiters @@ -1534,6 +1538,11 @@ void Migrator::export_finish(CDir *dir) cache-adjust_subtree_auth(dir, export_peer[dir]); cache-try_subtree_merge(dir); // NOTE: may journal subtree_map as sideeffect + // no more auth subtree? clear scatter dirty + if (!dir-get_inode()-is_auth() + !dir-get_inode()-has_subtree_root_dirfrag(mds-get_nodeid())) +dir-get_inode()-clear_scatter_dirty(); + // unpin path export_unlock(dir); @@ -2020,6 +2029,10 @@ void Migrator::import_reverse(CDir *dir) cache-trim_non_auth_subtree(dir); cache-adjust_subtree_auth(dir, import_peer[dir-dirfrag()]); + if (!dir-get_inode()-is_auth() + !dir-get_inode()-has_subtree_root_dirfrag(mds-get_nodeid())) +dir-get_inode()-clear_scatter_dirty(); + // adjust auth bits. listCDir* q; q.push_back(dir); @@ -2053,6 +2066,8 @@ void Migrator::import_reverse(CDir *dir) if (in-is_dirty()) in-mark_clean(); in-clear_dirty_rstat(); + if (!in-has_subtree_root_dirfrag(mds-get_nodeid())) + in-clear_scatter_dirty(); in-authlock.clear_gather(); in-linklock.clear_gather(); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/39] mds: don't delay processing replica buffer in slave request
On Thu, 21 Mar 2013, Yan, Zheng wrote: On 03/21/2013 05:19 AM, Greg Farnum wrote: On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com Replicated objects need to be added into the cache immediately Signed-off-by: Yan, Zheng zheng.z@intel.com Why do we need to add them right away? Shouldn't we have a journaled replica if we need it? -Greg The issue I encountered is lock action message received, but replicated objects wasn't in the cache because slave request was delayed. This makes sense to me; the add_replica_*() methods that create and push replicas of cache objects to other nodes need to always be applied immediately, or else the cache coherency falls apart. There are similar games played between the client and mds with the caps protocol, although in that case IIRC there are certain limited circumstances where we can delay processing the message. For mds-mds traffic, I don't think that's possible, unless *all* potentially dependent traffic is also delayed to preserve ordering and so forth. [That said, I didn't review the actual patch :)] sage Thanks Yan, Zheng Software Engineer #42 @ http://inktank.com | http://ceph.com --- src/mds/MDCache.cc | 12 src/mds/MDCache.h | 2 +- src/mds/MDS.cc | 6 +++--- src/mds/Server.cc | 55 +++--- 4 files changed, 56 insertions(+), 19 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 0f6b842..b668842 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -7722,6 +7722,18 @@ void MDCache::_find_ino_dir(inodeno_t ino, Context *fin, bufferlist bl, int r) /* */ +int MDCache::get_num_client_requests() +{ + int count = 0; + for (hash_mapmetareqid_t, MDRequest*::iterator p = active_requests.begin(); + p != active_requests.end(); + ++p) { + if (p-second-reqid.name.is_client() !p-second-is_slave()) + count++; + } + return count; +} + /* This function takes over the reference to the passed Message */ MDRequest *MDCache::request_start(MClientRequest *req) { diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index a9f05c6..4634121 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -240,7 +240,7 @@ protected: hash_mapmetareqid_t, MDRequest* active_requests; public: - int get_num_active_requests() { return active_requests.size(); } + int get_num_client_requests(); MDRequest* request_start(MClientRequest *req); MDRequest* request_start_slave(metareqid_t rid, __u32 attempt, int by); diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index b91dcbd..e99eecc 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1900,9 +1900,9 @@ bool MDS::_dispatch(Message *m) mdcache-is_open() replay_queue.empty() want_state == MDSMap::STATE_CLIENTREPLAY) { - dout(10) still have mdcache-get_num_active_requests() - active replay requests dendl; - if (mdcache-get_num_active_requests() == 0) + int num_requests = mdcache-get_num_client_requests(); + dout(10) still have num_requests active replay requests dendl; + if (num_requests == 0) clientreplay_done(); } diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 4c4c86b..8e89e4c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -107,10 +107,8 @@ void Server::dispatch(Message *m) (m-get_type() == CEPH_MSG_CLIENT_REQUEST (static_castMClientRequest*(m))-is_replay( { // replaying! - } else if (mds-is_clientreplay() m-get_type() == MSG_MDS_SLAVE_REQUEST - ((static_castMMDSSlaveRequest*(m))-is_reply() || - !mds-mdsmap-is_active(m-get_source().num( { - // slave reply or the master is also in the clientreplay stage + } else if (m-get_type() == MSG_MDS_SLAVE_REQUEST) { + // handle_slave_request() will wait if necessary } else { dout(3) not active yet, waiting dendl; mds-wait_for_active(new C_MDS_RetryMessage(mds, m)); @@ -1291,6 +1289,13 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) if (m-is_reply()) return handle_slave_request_reply(m); + CDentry *straydn = NULL; + if (m-stray.length() 0) { + straydn = mdcache-add_replica_stray(m-stray, from); + assert(straydn); + m-stray.clear(); + } + // am i a new slave? MDRequest *mdr = NULL; if (mdcache-have_request(m-get_reqid())) { @@ -1326,9 +1331,26 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) m-put(); return; } - mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), m-get_source().num()); + mdr = mdcache-request_start_slave(m-get_reqid(), m-get_attempt(), from); } assert(mdr-slave_request == 0); // only one at a time, please! + + if (straydn) { + mdr-pin(straydn); + mdr-straydn = straydn; + } + + if (!mds-is_clientreplay() !mds-is_active() !mds-is_stopping()) { + dout(3) not clientreplay|active
Re: [PATCH 29/39] mds: avoid double auth pin for file recovery
On Thu, 21 Mar 2013, Yan, Zheng wrote: On 03/21/2013 11:20 AM, Gregory Farnum wrote: This looks good on its face but I haven't had the chance to dig through the recovery queue stuff yet (it's on my list following some issues with recovery speed). How'd you run across this? If it's being added to the recovery queue multiple times I want to make sure we don't have some other machinery trying to dequeue it multiple times, or a single waiter which needs to be a list or something. -Greg Two clients that were writing the same file crashed successively. Hmm, I would love to have a test case for this. It should be pretty easy to construct some tests with libcephfs that fork, connect and do some operations, and are then killed by the parent, who verifies the resulting recovery occurs. This is some of the more fragile, not just because it is rarely tested. sage Thanks, Yan, Zheng On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 973a4d0..e9a79cd 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -5502,8 +5502,10 @@ void MDCache::_queue_file_recover(CInode *in) dout(15) _queue_file_recover *in dendl; assert(in-is_auth()); in-state_clear(CInode::STATE_NEEDSRECOVER); - in-state_set(CInode::STATE_RECOVERING); - in-auth_pin(this); + if (!in-state_test(CInode::STATE_RECOVERING)) { +in-state_set(CInode::STATE_RECOVERING); +in-auth_pin(this); + } file_recover_queue.insert(in); } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html