Re: ceph:rgw issue #10698
Hi Yehuda, Would it be possible to have the bug fixes #10698 + #10062 for the S3 POST issue be backported for the new release of firefly? This feature is very important for us, our video conversion engine relies on the user S3 browser POST. Best regards, Valery On 17/02/15 18:08 , Yehuda Sadeh-Weinraub wrote: Subject: ceph:rgw issue #10698 Hello Yehuda, The issue http://tracker.ceph.com/issues/10698 rgw: not failing POST requests if keystone not configured is marked as resolved, but I don't think it is backported in firefly. Issue http://tracker.ceph.com/issues/10062 should already be backported, if I'm not wrong... Neither made it in. The bug wasn't set to be backported to firefly. We can set it to get backported if there's a demand, however, I'm not sure that it's going to be a trivial backport. Yehuda -- SWITCH -- Valery Tschopp, Software Engineer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland email: valery.tsch...@switch.ch phone: +41 44 268 1544 smime.p7s Description: S/MIME Cryptographic Signature
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Nice Work Mark ! I don't see any tuning about sharding in the config file sample (osd_op_num_threads_per_shard,osd_op_num_shards,...) as you only use 1 ssd for the bench, I think it should improve results for hammer ? - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-devel@vger.kernel.org Cc: ceph-users ceph-us...@lists.ceph.com Envoyé: Mardi 17 Février 2015 18:37:01 Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dumpling integration branch for v0.67.12 ready for QE
On 18/02/2015 18:38, Yuri Weinstein wrote: Hi all I updated all issues in http://tracker.ceph.com/issues/10560 Based on what is listed there, we have http://tracker.ceph.com/issues/10801 - Yehuda pls comment http://tracker.ceph.com/issues/10694 - Sam pls re-confirm rbd - Josh, I understood that we are good to go, pls re-confirm. I can re-run some suites if you'd like and we can make a call on this release. Loic - back to you, let me know what you think. As long as you're satisfied with the test results, I have no further comment :-) Cheers Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com, Zack Cerza z...@redhat.com, Sandon Van Ness svann...@redhat.com Sent: Thursday, February 12, 2015 2:17:49 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE On 12/02/2015 23:06, Yuri Weinstein wrote: I linked all issues related to this release testing to the ticket http://tracker.ceph.com/issues/10560 After the team leads make a call of those, including environment issues, I suggest re-running suites the failed again. Loic, I'd re-run them in the Octo, since we already started there, if you agree ? Sure :-) Thx YuriW - Original Message - From: Yuri Weinstein ywein...@redhat.com To: Loic Dachary l...@dachary.org Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com Sent: Wednesday, February 11, 2015 2:24:33 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE I replied to individual suites runs, but just wanted to summarize QE validation status. The following suites were executed in the Octo lab (we will use Sepia in the future if nobody objects). upgrade:dumpling ['45493'] http://tracker.ceph.com/issues/10694 - Known Won't fix Assertion: osd/Watch.cc: 290: FAILED assert(!cb) *** Sam - pls confirm the Won't fix status. ['45495', '45496', '45498', '45499', '45500'] http://tracker.ceph.com/issues/10838 s3tests failed *** Yehuda - need your verdict on s3tests. fs All green ! rados ['45054'] http://tracker.ceph.com/issues/10841 Issued certificate has expired *** Sandon pls comment. ['45168', '45169'] http://tracker.ceph.com/issues/10840 coredump ceph_test_filestore_idempotent_sequence *** Sam - pls comment ['45215'] Missing packages - no ticket FYI Failed to fetch http://apt-mirror.front.sepia.ceph.com/archive.ubuntu.com/ubuntu/dists/trusty-updates/universe/binary-i386/Packages Hash Sum mismatch *** Zack, Sandon ? ceph-deploy Travis - pls suggest In general I am not sure if we needed to test this - Sage? rbd ['45365', '45366', '45367'] http://tracker.ceph.com/issues/10842 unable to connect to apt-mirror.front.sepia.ceph.com ['45349', '45350', '45351', '45355', '45356', '45357', '45363'] http://tracker.ceph.com/issues/10802 error: image still has watchers (duplicate of 10680) *** Zack, Sandon, Josh - all environment noise, pls comment. rgw ['45382', '45390'] http://tracker.ceph.com/issues/10843 s3tests failed - could be related or duplicate of 10838 *** Yehuda - same as issues in upgrades? I am standing by for you analysis/replies and recommendations for next steps. Loic - let me know is you want to follow specific items in our backport testing process that I missed here. PS: I would think that you could've wanted to assign the release ticket to QE (me) for validation and at this point I could've re-assigned it back to devel (you), a? Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, February 10, 2015 9:05:31 AM Subject: dumpling integration branch for v0.67.12 ready for QE Hi Yuri, The dumpling integration branch for v0.67.12 as found at https://github.com/ceph/ceph/commits/dumpling-backports has been approved by Yehuda, Josh and Sam and is ready for QE. For the record, the head is https://github.com/ceph/ceph/commit/3944c77c404c4a05886fe8276d5d0dd7e4f20410 I think it would be best for the QE tests to use the dumpling-backports. The alternative would be to merge dumpling-backports into dumpling. However, since testing may take a long time and require more patches, it probably is better to not do that iterative process on the dumpling branch itself. As it is now, there already are a number of commits in the dumpling branch that should really be in the dumpling-backports: they do not belong to v0.67.11 and are going to be released in v0.67.12. In the future though, the dumpling branch will only receive commits that have been
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
I don't have really good insight yet into how tweaking these would affect single-osd performance. I know the PCIe SSDs do have multiple controllers on-board so perhaps increasing the number of shards would improve things, but I suspect that going too high could maybe start hurting performance as well. Have you done any testing here? It could be an interesting follow-up paper. I think it should be tunned regarding number of osds and number of cores you have. I have done test in past with sommath values osd_op_num_threads_per_shard = 1 osd_op_num_shards = 25 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 But don't have take time to try differents values. But I was to be able to reach 12iops 4k read with 3osd if I remember. (But I was limited by client cpu) I'm going to do big benchmark next month (3 nodes (20cores) with 6ssd each), So I'll try to test different sharding values, with different number of osd. - Mail original - De: Mark Nelson mnel...@redhat.com À: aderumier aderum...@odiso.com Cc: ceph-devel ceph-devel@vger.kernel.org, ceph-users ceph-us...@lists.ceph.com Envoyé: Mercredi 18 Février 2015 15:56:44 Objet: Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi Alex, Thanks! I didn't tweak the sharding settings at all, so they are just at the default values: OPTION(osd_op_num_threads_per_shard, OPT_INT, 2) OPTION(osd_op_num_shards, OPT_INT, 5) I don't have really good insight yet into how tweaking these would affect single-osd performance. I know the PCIe SSDs do have multiple controllers on-board so perhaps increasing the number of shards would improve things, but I suspect that going too high could maybe start hurting performance as well. Have you done any testing here? It could be an interesting follow-up paper. Mark On 02/18/2015 02:34 AM, Alexandre DERUMIER wrote: Nice Work Mark ! I don't see any tuning about sharding in the config file sample (osd_op_num_threads_per_shard,osd_op_num_shards,...) as you only use 1 ssd for the bench, I think it should improve results for hammer ? - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-devel@vger.kernel.org Cc: ceph-users ceph-us...@lists.ceph.com Envoyé: Mardi 17 Février 2015 18:37:01 Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[radosgw] unconsistency between bucket and bucket.instance metadata
Hi all, Context : Firefly 0.80.8, Ubuntu 14.04 LTS, Lab cluster Yesterday, I successfully deleted a s3 bucket Bucket001ghis after removing the contents that were in. Today, as I was browsing the radosgw system metadata, I discovered an difference between the bucket metadata and the bucket.instance metadata as followed. radosgw-admin --name client.radosgw.fr-rennes-radosgw1 metadata list bucket [ bucket001ghis, ghis, bucket001johndoe, bucket001transfert, myb1, mybucket] radosgw-admin --name client.radosgw.fr-rennes-radosgw1 metadata list bucket.instance [ bucket001ghis:fr-rennes-radosgw1.247011.1, Bucket001ghis:fr-rennes-radosgw1.244654.2, myb1:fr-rennes-radosgw1.246846.1, mybucket:fr-rennes-radosgw1.244748.1, bucket001transfert:fr-rennes-radosgw1.244654.1, bucket001johndoe:fr-rennes-radosgw1.244742.1, ghis:fr-rennes-radosgw1.246056.1] Bucket001ghis:fr-rennes-radosgw1.244654.2 is still referenced in the bucket.instance metadata. What can be the defect? Best regards _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dumpling integration branch for v0.67.12 ready for QE
Yup, 10694 is a known bug in dumpling which we probably don't want to fix. The rados tests look ok to me I think. -Sam On Wed, Feb 18, 2015 at 9:38 AM, Yuri Weinstein ywein...@redhat.com wrote: Hi all I updated all issues in http://tracker.ceph.com/issues/10560 Based on what is listed there, we have http://tracker.ceph.com/issues/10801 - Yehuda pls comment http://tracker.ceph.com/issues/10694 - Sam pls re-confirm rbd - Josh, I understood that we are good to go, pls re-confirm. I can re-run some suites if you'd like and we can make a call on this release. Loic - back to you, let me know what you think. Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com, Zack Cerza z...@redhat.com, Sandon Van Ness svann...@redhat.com Sent: Thursday, February 12, 2015 2:17:49 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE On 12/02/2015 23:06, Yuri Weinstein wrote: I linked all issues related to this release testing to the ticket http://tracker.ceph.com/issues/10560 After the team leads make a call of those, including environment issues, I suggest re-running suites the failed again. Loic, I'd re-run them in the Octo, since we already started there, if you agree ? Sure :-) Thx YuriW - Original Message - From: Yuri Weinstein ywein...@redhat.com To: Loic Dachary l...@dachary.org Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com Sent: Wednesday, February 11, 2015 2:24:33 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE I replied to individual suites runs, but just wanted to summarize QE validation status. The following suites were executed in the Octo lab (we will use Sepia in the future if nobody objects). upgrade:dumpling ['45493'] http://tracker.ceph.com/issues/10694 - Known Won't fix Assertion: osd/Watch.cc: 290: FAILED assert(!cb) *** Sam - pls confirm the Won't fix status. ['45495', '45496', '45498', '45499', '45500'] http://tracker.ceph.com/issues/10838 s3tests failed *** Yehuda - need your verdict on s3tests. fs All green ! rados ['45054'] http://tracker.ceph.com/issues/10841 Issued certificate has expired *** Sandon pls comment. ['45168', '45169'] http://tracker.ceph.com/issues/10840 coredump ceph_test_filestore_idempotent_sequence *** Sam - pls comment ['45215'] Missing packages - no ticket FYI Failed to fetch http://apt-mirror.front.sepia.ceph.com/archive.ubuntu.com/ubuntu/dists/trusty-updates/universe/binary-i386/Packages Hash Sum mismatch *** Zack, Sandon ? ceph-deploy Travis - pls suggest In general I am not sure if we needed to test this - Sage? rbd ['45365', '45366', '45367'] http://tracker.ceph.com/issues/10842 unable to connect to apt-mirror.front.sepia.ceph.com ['45349', '45350', '45351', '45355', '45356', '45357', '45363'] http://tracker.ceph.com/issues/10802 error: image still has watchers (duplicate of 10680) *** Zack, Sandon, Josh - all environment noise, pls comment. rgw ['45382', '45390'] http://tracker.ceph.com/issues/10843 s3tests failed - could be related or duplicate of 10838 *** Yehuda - same as issues in upgrades? I am standing by for you analysis/replies and recommendations for next steps. Loic - let me know is you want to follow specific items in our backport testing process that I missed here. PS: I would think that you could've wanted to assign the release ticket to QE (me) for validation and at this point I could've re-assigned it back to devel (you), a? Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, February 10, 2015 9:05:31 AM Subject: dumpling integration branch for v0.67.12 ready for QE Hi Yuri, The dumpling integration branch for v0.67.12 as found at https://github.com/ceph/ceph/commits/dumpling-backports has been approved by Yehuda, Josh and Sam and is ready for QE. For the record, the head is https://github.com/ceph/ceph/commit/3944c77c404c4a05886fe8276d5d0dd7e4f20410 I think it would be best for the QE tests to use the dumpling-backports. The alternative would be to merge dumpling-backports into dumpling. However, since testing may take a long time and require more patches, it probably is better to not do that iterative process on the dumpling branch itself. As it is now, there already are a number of commits in the dumpling branch that should really be in the dumpling-backports: they do not belong to v0.67.11 and are going to be released in v0.67.12. In the future though, the dumpling
Re: ceph:rgw issue #10698
- Original Message - From: Valery Tschopp valery.tsch...@switch.ch To: Yehuda Sadeh-Weinraub yeh...@redhat.com Cc: ceph-devel ceph-devel@vger.kernel.org Sent: Wednesday, February 18, 2015 12:32:47 AM Subject: Re: ceph:rgw issue #10698 Hi Yehuda, Would it be possible to have the bug fixes #10698 + #10062 for the S3 POST issue be backported for the new release of firefly? This feature is very important for us, our video conversion engine relies on the user S3 browser POST. I reopened the issues, set them as pending for backport. Pushed the wip-firefly-rgw-backports branch with these fixes in it. Yehuda Best regards, Valery On 17/02/15 18:08 , Yehuda Sadeh-Weinraub wrote: Subject: ceph:rgw issue #10698 Hello Yehuda, The issue http://tracker.ceph.com/issues/10698 rgw: not failing POST requests if keystone not configured is marked as resolved, but I don't think it is backported in firefly. Issue http://tracker.ceph.com/issues/10062 should already be backported, if I'm not wrong... Neither made it in. The bug wasn't set to be backported to firefly. We can set it to get backported if there's a demand, however, I'm not sure that it's going to be a trivial backport. Yehuda -- SWITCH -- Valery Tschopp, Software Engineer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland email: valery.tsch...@switch.ch phone: +41 44 268 1544 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dumpling integration branch for v0.67.12 ready for QE
Hi all I updated all issues in http://tracker.ceph.com/issues/10560 Based on what is listed there, we have http://tracker.ceph.com/issues/10801 - Yehuda pls comment http://tracker.ceph.com/issues/10694 - Sam pls re-confirm rbd - Josh, I understood that we are good to go, pls re-confirm. I can re-run some suites if you'd like and we can make a call on this release. Loic - back to you, let me know what you think. Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com, Zack Cerza z...@redhat.com, Sandon Van Ness svann...@redhat.com Sent: Thursday, February 12, 2015 2:17:49 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE On 12/02/2015 23:06, Yuri Weinstein wrote: I linked all issues related to this release testing to the ticket http://tracker.ceph.com/issues/10560 After the team leads make a call of those, including environment issues, I suggest re-running suites the failed again. Loic, I'd re-run them in the Octo, since we already started there, if you agree ? Sure :-) Thx YuriW - Original Message - From: Yuri Weinstein ywein...@redhat.com To: Loic Dachary l...@dachary.org Cc: Ceph Development ceph-devel@vger.kernel.org, Sage Weil s...@redhat.com, Tamil Muthamizhan tmuth...@redhat.com Sent: Wednesday, February 11, 2015 2:24:33 PM Subject: Re: dumpling integration branch for v0.67.12 ready for QE I replied to individual suites runs, but just wanted to summarize QE validation status. The following suites were executed in the Octo lab (we will use Sepia in the future if nobody objects). upgrade:dumpling ['45493'] http://tracker.ceph.com/issues/10694 - Known Won't fix Assertion: osd/Watch.cc: 290: FAILED assert(!cb) *** Sam - pls confirm the Won't fix status. ['45495', '45496', '45498', '45499', '45500'] http://tracker.ceph.com/issues/10838 s3tests failed *** Yehuda - need your verdict on s3tests. fs All green ! rados ['45054'] http://tracker.ceph.com/issues/10841 Issued certificate has expired *** Sandon pls comment. ['45168', '45169'] http://tracker.ceph.com/issues/10840 coredump ceph_test_filestore_idempotent_sequence *** Sam - pls comment ['45215'] Missing packages - no ticket FYI Failed to fetch http://apt-mirror.front.sepia.ceph.com/archive.ubuntu.com/ubuntu/dists/trusty-updates/universe/binary-i386/Packages Hash Sum mismatch *** Zack, Sandon ? ceph-deploy Travis - pls suggest In general I am not sure if we needed to test this - Sage? rbd ['45365', '45366', '45367'] http://tracker.ceph.com/issues/10842 unable to connect to apt-mirror.front.sepia.ceph.com ['45349', '45350', '45351', '45355', '45356', '45357', '45363'] http://tracker.ceph.com/issues/10802 error: image still has watchers (duplicate of 10680) *** Zack, Sandon, Josh - all environment noise, pls comment. rgw ['45382', '45390'] http://tracker.ceph.com/issues/10843 s3tests failed - could be related or duplicate of 10838 *** Yehuda - same as issues in upgrades? I am standing by for you analysis/replies and recommendations for next steps. Loic - let me know is you want to follow specific items in our backport testing process that I missed here. PS: I would think that you could've wanted to assign the release ticket to QE (me) for validation and at this point I could've re-assigned it back to devel (you), a? Thx YuriW - Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org Sent: Tuesday, February 10, 2015 9:05:31 AM Subject: dumpling integration branch for v0.67.12 ready for QE Hi Yuri, The dumpling integration branch for v0.67.12 as found at https://github.com/ceph/ceph/commits/dumpling-backports has been approved by Yehuda, Josh and Sam and is ready for QE. For the record, the head is https://github.com/ceph/ceph/commit/3944c77c404c4a05886fe8276d5d0dd7e4f20410 I think it would be best for the QE tests to use the dumpling-backports. The alternative would be to merge dumpling-backports into dumpling. However, since testing may take a long time and require more patches, it probably is better to not do that iterative process on the dumpling branch itself. As it is now, there already are a number of commits in the dumpling branch that should really be in the dumpling-backports: they do not belong to v0.67.11 and are going to be released in v0.67.12. In the future though, the dumpling branch will only receive commits that have been carefully tested and all the integration work will be on the dumpling-backports branch exclusively. So that third parties do not have
12 March - Ceph Day San Francisco
Hey cephers, We still have a couple of speaking slots open for Ceph Day San Francisco on 12 March. I'm open to both high level what have you been doing with Ceph type talks as well as more technical here is what we're writing and/or integrating with Ceph. I know many folks will be at VAULT, but we figured there would still be plenty of folks left on the west coast, so let me know if you'd be interested in speaking. Thanks! -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dumpling integration branch for v0.67.12 ready for QE
- Original Message - From: Loic Dachary l...@dachary.org To: Yuri Weinstein ywein...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org, Tamil Muthamizhan tmuth...@redhat.com Sent: Wednesday, February 18, 2015 9:56:14 AM Subject: Re: dumpling integration branch for v0.67.12 ready for QE On 18/02/2015 18:38, Yuri Weinstein wrote: Hi all I updated all issues in http://tracker.ceph.com/issues/10560 Based on what is listed there, we have http://tracker.ceph.com/issues/10801 - Yehuda pls comment http://tracker.ceph.com/issues/10694 - Sam pls re-confirm rbd - Josh, I understood that we are good to go, pls re-confirm. Yes, good to go from my perspective. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
poll: If calamari could monitor and alert on 5 things...
What would they be? Please respond with your top 5 things I want to know about a Ceph cluster. I want Calamari to have improved monitoring of Ceph. I would like to focus on getting a few things exposed really well through the calamari-api. if you need some inspiration there is http://redhatstorage.redhat.com/2015/02/12/10-commands-every-ceph-administrator-should-know/ Calamari already exposes a number of these. regards, Gregory -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
full_ratios - please explain?
Can someone explain the interaction and effects of all of these full_ratio parameters? I havent found any real good explanation of how they affect the distribution of data once the cluster gets above the nearfull and close to the close ratios. mon_osd_full_ratio mon_osd_nearfull_ratio osd_backfill_full_ratio osd_failsafe_full_ratio osd_failsafe_nearfull_ratio We have a cluster with about 144 OSDs (518 TB) and trying to get it to a 90% full rate for testing purposes. We've found that when some of the OSDs get above the mon_osd_full_ratio value (.95 in our system), then it stops accepting any new data, even though there is plenty of space left on other OSDs that are not yet even up to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a bit, but eventually it becomes unbalanced and stops working again. Is there a recommended combination of values to use that will allow the cluster to continue accepting data and rebalancing correctly above 90%. thanks, Wyllys Ingersoll -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
On Wed, Feb 18, 2015 at 4:27 PM, Ilya Dryomov idryo...@gmail.com wrote: a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } Sorry, this is a dup - ignore it. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
On 02/18/2015 07:27 AM, Ilya Dryomov wrote: a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. You are generous in suggesting it's a mechanical mistake. But it is a mistake nevertheless. The fix looks good. Reviewed-by: Alex Elder el...@linaro.org Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full_ratios - please explain?
On Wed, 18 Feb 2015, Wyllys Ingersoll wrote: Thanks! More below inline... On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander w...@42on.com wrote: On 18-02-15 15:39, Wyllys Ingersoll wrote: Can someone explain the interaction and effects of all of these full_ratio parameters? I havent found any real good explanation of how they affect the distribution of data once the cluster gets above the nearfull and close to the close ratios. When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster goes from HEALTH_OK into HEALTH_WARN state. mon_osd_full_ratio mon_osd_nearfull_ratio osd_backfill_full_ratio osd_failsafe_full_ratio osd_failsafe_nearfull_ratio We have a cluster with about 144 OSDs (518 TB) and trying to get it to a 90% full rate for testing purposes. We've found that when some of the OSDs get above the mon_osd_full_ratio value (.95 in our system), then it stops accepting any new data, even though there is plenty of space left on other OSDs that are not yet even up to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a bit, but eventually it becomes unbalanced and stops working again. Yes, that is because with Ceph safety goes first. When only one OSD goes over the full ratio the whole cluster stops I/O. Which full_ratio? The problem is that there are at least 3 full_ratios - mon_osd_full_ratio, osd_failsafe_full_ratio, and osd_backfill_full_ratio - how do they interact? What is the consequence of having one be higher than the others? mon_osd_full_ratio (.95) ... when any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted. mon_osd_nearfull_ratio (.85) ... when any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. osd_backfill_full_ratio (.85) ... when an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD. It should be lower than the The osd_failsafe_full_ratio (.97) is a final sanity check that makes the OSD throw out writes if it is really close to full. It's bad news if an OSD fills up completely so we do what we can to prevent it. Its seems extreme that 1 full osd out of potentially hundreds would cause all IO into the cluster to stop when there are literally 10s or 100s of terrabytes of space left on other, less-full OSDs. Yes, but the nature of hash-based distribution is that you don't know where a write will go, so you don't want to let the cluster fill up. 85% is pretty conservative; you could increase it if you're comfortable. Just be aware that file systems over 80% start to get very slow so it is a bad idea to run them this full anyway. The confusion for me (and probably for others) is the proliferation of full_ratio parameters and a lack of clarity on how they all affect the cluster health and ability to balance when things start to fill up. CRUSH does not take OSD utilization into account when placing data, so it's almost impossible to predict which I/O can continue. Data safety and integrity is priority number 1. Full disks are a danger to those priorities, so I/O is stopped. Understood, but 1 full disk out of hundreds should not cause the entire system to stop accepting new data or even balancing out the data that it already has especially when there is room to grow yet on other OSDs. The proper response to this currently is that if an OSD reaches the lower nearfull threshold the admin gets a warning and triggers some rebalancing. That's why it's 10% lower then the actual full cutoff--there is plenty of time to adjust weights and/or expand the cluster. It's not an ideal approach, perhaps, but it's simple and works well enough. And it's not clear that there's is anything better we can do that isn't also very complicated... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full_ratios - please explain?
OK, thanks for the clarifications! -Wyllys On Wed, Feb 18, 2015 at 10:52 AM, Sage Weil s...@newdream.net wrote: On Wed, 18 Feb 2015, Wyllys Ingersoll wrote: Thanks! More below inline... On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander w...@42on.com wrote: On 18-02-15 15:39, Wyllys Ingersoll wrote: Can someone explain the interaction and effects of all of these full_ratio parameters? I havent found any real good explanation of how they affect the distribution of data once the cluster gets above the nearfull and close to the close ratios. When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster goes from HEALTH_OK into HEALTH_WARN state. mon_osd_full_ratio mon_osd_nearfull_ratio osd_backfill_full_ratio osd_failsafe_full_ratio osd_failsafe_nearfull_ratio We have a cluster with about 144 OSDs (518 TB) and trying to get it to a 90% full rate for testing purposes. We've found that when some of the OSDs get above the mon_osd_full_ratio value (.95 in our system), then it stops accepting any new data, even though there is plenty of space left on other OSDs that are not yet even up to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a bit, but eventually it becomes unbalanced and stops working again. Yes, that is because with Ceph safety goes first. When only one OSD goes over the full ratio the whole cluster stops I/O. Which full_ratio? The problem is that there are at least 3 full_ratios - mon_osd_full_ratio, osd_failsafe_full_ratio, and osd_backfill_full_ratio - how do they interact? What is the consequence of having one be higher than the others? mon_osd_full_ratio (.95) ... when any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted. mon_osd_nearfull_ratio (.85) ... when any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. osd_backfill_full_ratio (.85) ... when an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD. It should be lower than the The osd_failsafe_full_ratio (.97) is a final sanity check that makes the OSD throw out writes if it is really close to full. It's bad news if an OSD fills up completely so we do what we can to prevent it. Its seems extreme that 1 full osd out of potentially hundreds would cause all IO into the cluster to stop when there are literally 10s or 100s of terrabytes of space left on other, less-full OSDs. Yes, but the nature of hash-based distribution is that you don't know where a write will go, so you don't want to let the cluster fill up. 85% is pretty conservative; you could increase it if you're comfortable. Just be aware that file systems over 80% start to get very slow so it is a bad idea to run them this full anyway. The confusion for me (and probably for others) is the proliferation of full_ratio parameters and a lack of clarity on how they all affect the cluster health and ability to balance when things start to fill up. CRUSH does not take OSD utilization into account when placing data, so it's almost impossible to predict which I/O can continue. Data safety and integrity is priority number 1. Full disks are a danger to those priorities, so I/O is stopped. Understood, but 1 full disk out of hundreds should not cause the entire system to stop accepting new data or even balancing out the data that it already has especially when there is room to grow yet on other OSDs. The proper response to this currently is that if an OSD reaches the lower nearfull threshold the admin gets a warning and triggers some rebalancing. That's why it's 10% lower then the actual full cutoff--there is plenty of time to adjust weights and/or expand the cluster. It's not an ideal approach, perhaps, but it's simple and works well enough. And it's not clear that there's is anything better we can do that isn't also very complicated... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Hi Alex, Thanks! I didn't tweak the sharding settings at all, so they are just at the default values: OPTION(osd_op_num_threads_per_shard, OPT_INT, 2) OPTION(osd_op_num_shards, OPT_INT, 5) I don't have really good insight yet into how tweaking these would affect single-osd performance. I know the PCIe SSDs do have multiple controllers on-board so perhaps increasing the number of shards would improve things, but I suspect that going too high could maybe start hurting performance as well. Have you done any testing here? It could be an interesting follow-up paper. Mark On 02/18/2015 02:34 AM, Alexandre DERUMIER wrote: Nice Work Mark ! I don't see any tuning about sharding in the config file sample (osd_op_num_threads_per_shard,osd_op_num_shards,...) as you only use 1 ssd for the bench, I think it should improve results for hammer ? - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-devel@vger.kernel.org Cc: ceph-users ceph-us...@lists.ceph.com Envoyé: Mardi 17 Février 2015 18:37:01 Objet: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Hi Andrei, On 02/18/2015 09:08 AM, Andrei Mikhailovsky wrote: Mark, many thanks for your effort and ceph performance tests. This puts things in perspective. Looking at the results, I was a bit concerned that the IOPs performance in niether releases come even marginally close to the capabilities of the underlying ssd device. Even the fastest PCI ssds have only managed to achieve about the 1/6th IOPs of the raw device. Perspective is definitely good! Any time you are dealing with latency sensitive workloads, there are a lot of bottlenecks that can limit your performance. There's a world of difference between streaming data to a raw SSD as fast as possible and writing data out to a distributed storage system that is calculating data placement, invoking the TCP stack, doing CRC checks, journaling writes, invoking the VM layer to cache data in case it's hot (which in this case it's not). I guess there is a great deal more optimisations to be done in the upcoming LTS releases to make the IOPs rate close to the raw device performance. There is definitely still room for improvement! It's important to remember though that there is always going to be a trade off between flexibility, data integrity, and performance. If low latency is your number one need before anything else, you are probably best off eliminating as much software as possible between you and the device (except possibly if you can make clever use of caching). While Ceph itself is some times the bottleneck, in many cases we've found that bottlenecks in the software that surrounds Ceph are just as big obstacles (filesystem, VM layer, TCP stack, leveldb, etc). If you need a distributed storage system that can universally maintain native SSD levels of performance, the entire stack has to be highly tuned. I have done some testing in the past and noticed that despite the server having a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) the ceph would not perform well when used with ssds. I was testing with Firefly + auth and my IOPs rate was around the 3K mark. Something is holding ceph back from performing well with ssds ((( Out of curiosity, did you try the same tests directly on the SSD? Andrei *From: *Mark Nelson mnel...@redhat.com *To: *ceph-devel ceph-devel@vger.kernel.org *Cc: *ceph-us...@lists.ceph.com *Sent: *Tuesday, 17 February, 2015 5:37:01 PM *Subject: *[ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performancecomparison Hi All, I wrote up a short document describing some tests I ran recently to look at how SSD backed OSD performance has changed across our LTS releases. This is just looking at RADOS performance and not RBD or RGW. It also doesn't offer any real explanations regarding the results. It's just a first high level step toward understanding some of the behaviors folks on the mailing list have reported over the last couple of releases. I hope you find it useful. Mark ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix double __remove_osd() problem
On 02/18/2015 07:25 AM, Ilya Dryomov wrote: It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: osd3 is idle, on the osd lru list con reset - osd3 con_fault_finish() osd_reset() osdmap - osd3 down ceph_osdc_handle_map() takes map_sem kick_requests() takes request_mutex reset_changed_osds() __reset_osd() __remove_osd() releases request_mutex releases map_sem takes map_sem takes request_mutex __kick_osd_requests() __reset_osd() __remove_osd() -- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 So now you rely on the RB node in the osd getting cleared as a signal that it has been removed already. Yes, that sounds like refcounting isn't working as desired... The mutex around all calls to (now) remove_osd() avoids races. I like the RB_CLEAR_NODE() call anyway. OK, enough chit chat. This looks OK to me. Reviewed-by: Alex Elder el...@linaro.org Cc: Sage Weil sw...@redhat.com Cc: sta...@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: sta...@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: sta...@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 53299c7b0ca4..f693a2f8ac86 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1048,14 +1048,24 @@ static void put_osd(struct ceph_osd *osd) */ static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) { - dout(__remove_osd %p\n, osd); + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); WARN_ON(!list_empty(osd-o_requests)); WARN_ON(!list_empty(osd-o_linger_requests)); - rb_erase(osd-o_node, osdc-osds); list_del_init(osd-o_osd_lru); - ceph_con_close(osd-o_con); - put_osd(osd); + rb_erase(osd-o_node, osdc-osds); + RB_CLEAR_NODE(osd-o_node); +} + +static void remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) +{ + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); + + if (!RB_EMPTY_NODE(osd-o_node)) { + ceph_con_close(osd-o_con); + __remove_osd(osdc, osd); + put_osd(osd); + } } static void remove_all_osds(struct ceph_osd_client *osdc) @@ -1065,7 +1075,7 @@ static void remove_all_osds(struct ceph_osd_client *osdc) while (!RB_EMPTY_ROOT(osdc-osds)) { struct ceph_osd *osd = rb_entry(rb_first(osdc-osds), struct ceph_osd, o_node); - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1106,7 +1116,7 @@ static void remove_old_osds(struct ceph_osd_client *osdc) list_for_each_entry_safe(osd, nosd, osdc-osd_lru, o_osd_lru) { if (time_before(jiffies, osd-lru_ttl)) break; - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1121,8 +1131,7 @@ static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) dout(__reset_osd %p osd%d\n, osd, osd-o_osd); if (list_empty(osd-o_requests) list_empty(osd-o_linger_requests)) { - __remove_osd(osdc, osd); - + remove_osd(osdc, osd); return -ENODEV; } @@ -1926,6 +1935,7 @@ static void reset_changed_osds(struct ceph_osd_client *osdc) { struct rb_node *p, *n; + dout(%s %p\n, __func__, osdc); for (p = rb_first(osdc-osds); p; p = n) { struct ceph_osd *osd = rb_entry(p, struct ceph_osd, o_node); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: fix double __remove_osd() problem
It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: osd3 is idle, on the osd lru list con reset - osd3 con_fault_finish() osd_reset() osdmap - osd3 down ceph_osdc_handle_map() takes map_sem kick_requests() takes request_mutex reset_changed_osds() __reset_osd() __remove_osd() releases request_mutex releases map_sem takes map_sem takes request_mutex __kick_osd_requests() __reset_osd() __remove_osd() -- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil sw...@redhat.com Cc: sta...@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: sta...@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: sta...@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 53299c7b0ca4..f693a2f8ac86 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1048,14 +1048,24 @@ static void put_osd(struct ceph_osd *osd) */ static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) { - dout(__remove_osd %p\n, osd); + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); WARN_ON(!list_empty(osd-o_requests)); WARN_ON(!list_empty(osd-o_linger_requests)); - rb_erase(osd-o_node, osdc-osds); list_del_init(osd-o_osd_lru); - ceph_con_close(osd-o_con); - put_osd(osd); + rb_erase(osd-o_node, osdc-osds); + RB_CLEAR_NODE(osd-o_node); +} + +static void remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) +{ + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); + + if (!RB_EMPTY_NODE(osd-o_node)) { + ceph_con_close(osd-o_con); + __remove_osd(osdc, osd); + put_osd(osd); + } } static void remove_all_osds(struct ceph_osd_client *osdc) @@ -1065,7 +1075,7 @@ static void remove_all_osds(struct ceph_osd_client *osdc) while (!RB_EMPTY_ROOT(osdc-osds)) { struct ceph_osd *osd = rb_entry(rb_first(osdc-osds), struct ceph_osd, o_node); - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1106,7 +1116,7 @@ static void remove_old_osds(struct ceph_osd_client *osdc) list_for_each_entry_safe(osd, nosd, osdc-osd_lru, o_osd_lru) { if (time_before(jiffies, osd-lru_ttl)) break; - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1121,8 +1131,7 @@ static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) dout(__reset_osd %p osd%d\n, osd, osd-o_osd); if (list_empty(osd-o_requests) list_empty(osd-o_linger_requests)) { - __remove_osd(osdc, osd); - + remove_osd(osdc, osd); return -ENODEV; } @@ -1926,6 +1935,7 @@ static void reset_changed_osds(struct ceph_osd_client *osdc) { struct rb_node *p, *n; + dout(%s %p\n, __func__, osdc); for (p = rb_first(osdc-osds); p; p = n) { struct ceph_osd *osd = rb_entry(p, struct ceph_osd, o_node); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full_ratios - please explain?
On 18-02-15 15:39, Wyllys Ingersoll wrote: Can someone explain the interaction and effects of all of these full_ratio parameters? I havent found any real good explanation of how they affect the distribution of data once the cluster gets above the nearfull and close to the close ratios. When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster goes from HEALTH_OK into HEALTH_WARN state. mon_osd_full_ratio mon_osd_nearfull_ratio osd_backfill_full_ratio osd_failsafe_full_ratio osd_failsafe_nearfull_ratio We have a cluster with about 144 OSDs (518 TB) and trying to get it to a 90% full rate for testing purposes. We've found that when some of the OSDs get above the mon_osd_full_ratio value (.95 in our system), then it stops accepting any new data, even though there is plenty of space left on other OSDs that are not yet even up to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a bit, but eventually it becomes unbalanced and stops working again. Yes, that is because with Ceph safety goes first. When only one OSD goes over the full ratio the whole cluster stops I/O. CRUSH does not take OSD utilization into account when placing data, so it's almost impossible to predict which I/O can continue. Data safety and integrity is priority number 1. Full disks are a danger to those priorities, so I/O is stopped. Is there a recommended combination of values to use that will allow the cluster to continue accepting data and rebalancing correctly above 90%. No, not with those values. Monitor your filesystems that they stay below those values. If one OSD becomes to full you can weigh it down using CRUSH to have some data move away from it. thanks, Wyllys Ingersoll -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix double __remove_osd() problem
On Wed, 18 Feb 2015, Ilya Dryomov wrote: It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: osd3 is idle, on the osd lru list con reset - osd3 con_fault_finish() osd_reset() osdmap - osd3 down ceph_osdc_handle_map() takes map_sem kick_requests() takes request_mutex reset_changed_osds() __reset_osd() __remove_osd() releases request_mutex releases map_sem takes map_sem takes request_mutex __kick_osd_requests() __reset_osd() __remove_osd() -- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil sw...@redhat.com Cc: sta...@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: sta...@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: sta...@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov idryo...@gmail.com Reviewed-by: Sage Weil s...@redhat.com --- net/ceph/osd_client.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 53299c7b0ca4..f693a2f8ac86 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1048,14 +1048,24 @@ static void put_osd(struct ceph_osd *osd) */ static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) { - dout(__remove_osd %p\n, osd); + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); WARN_ON(!list_empty(osd-o_requests)); WARN_ON(!list_empty(osd-o_linger_requests)); - rb_erase(osd-o_node, osdc-osds); list_del_init(osd-o_osd_lru); - ceph_con_close(osd-o_con); - put_osd(osd); + rb_erase(osd-o_node, osdc-osds); + RB_CLEAR_NODE(osd-o_node); +} + +static void remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) +{ + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); + + if (!RB_EMPTY_NODE(osd-o_node)) { + ceph_con_close(osd-o_con); + __remove_osd(osdc, osd); + put_osd(osd); + } } static void remove_all_osds(struct ceph_osd_client *osdc) @@ -1065,7 +1075,7 @@ static void remove_all_osds(struct ceph_osd_client *osdc) while (!RB_EMPTY_ROOT(osdc-osds)) { struct ceph_osd *osd = rb_entry(rb_first(osdc-osds), struct ceph_osd, o_node); - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1106,7 +1116,7 @@ static void remove_old_osds(struct ceph_osd_client *osdc) list_for_each_entry_safe(osd, nosd, osdc-osd_lru, o_osd_lru) { if (time_before(jiffies, osd-lru_ttl)) break; - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1121,8 +1131,7 @@ static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) dout(__reset_osd %p osd%d\n, osd, osd-o_osd); if (list_empty(osd-o_requests) list_empty(osd-o_linger_requests)) { - __remove_osd(osdc, osd); - + remove_osd(osdc, osd); return -ENODEV; } @@ -1926,6 +1935,7 @@ static void reset_changed_osds(struct ceph_osd_client *osdc) { struct rb_node *p, *n; + dout(%s %p\n, __func__, osdc); for (p = rb_first(osdc-osds); p; p = n) { struct ceph_osd *osd = rb_entry(p, struct ceph_osd, o_node); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
On Wed, 18 Feb 2015, Ilya Dryomov wrote: a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com Reviewed-by: Sage Weil s...@redhat.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
02/18/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! Please add an agenda item if there is something you want to talk about. I'll be talking a little bit about some of the SSD testing we posted about on the list yesterday. Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full_ratios - please explain?
Thanks! More below inline... On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander w...@42on.com wrote: On 18-02-15 15:39, Wyllys Ingersoll wrote: Can someone explain the interaction and effects of all of these full_ratio parameters? I havent found any real good explanation of how they affect the distribution of data once the cluster gets above the nearfull and close to the close ratios. When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster goes from HEALTH_OK into HEALTH_WARN state. mon_osd_full_ratio mon_osd_nearfull_ratio osd_backfill_full_ratio osd_failsafe_full_ratio osd_failsafe_nearfull_ratio We have a cluster with about 144 OSDs (518 TB) and trying to get it to a 90% full rate for testing purposes. We've found that when some of the OSDs get above the mon_osd_full_ratio value (.95 in our system), then it stops accepting any new data, even though there is plenty of space left on other OSDs that are not yet even up to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a bit, but eventually it becomes unbalanced and stops working again. Yes, that is because with Ceph safety goes first. When only one OSD goes over the full ratio the whole cluster stops I/O. Which full_ratio? The problem is that there are at least 3 full_ratios - mon_osd_full_ratio, osd_failsafe_full_ratio, and osd_backfill_full_ratio - how do they interact? What is the consequence of having one be higher than the others? Its seems extreme that 1 full osd out of potentially hundreds would cause all IO into the cluster to stop when there are literally 10s or 100s of terrabytes of space left on other, less-full OSDs. The confusion for me (and probably for others) is the proliferation of full_ratio parameters and a lack of clarity on how they all affect the cluster health and ability to balance when things start to fill up. CRUSH does not take OSD utilization into account when placing data, so it's almost impossible to predict which I/O can continue. Data safety and integrity is priority number 1. Full disks are a danger to those priorities, so I/O is stopped. Understood, but 1 full disk out of hundreds should not cause the entire system to stop accepting new data or even balancing out the data that it already has especially when there is room to grow yet on other OSDs. If 1 disk reaches the full_ratio, but 99 (or 999) others are still well below that value, why doesn't it get balanced out ( assuming the crush map considers all OSDs equal and all the pools have similar pg_num values) ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
disk failure prediction
Interesting paper at FAST: https://www.usenix.org/system/files/conference/fast15/fast15-paper-ma.pdf Short version: reallocated sectors correllates with impending disk failures (this sounds like what Sandon has been telling us for ages) and by preemptively replacing disks with impending failures reduced EMC's rate of triple-failures by 80%, and looking at the joint failure probability within each raid set reduces the failure rate by 98%. We wouldn't see quite the same results since our raid sets are effectively entire pools, but this seems like a strong case for adding smart monitoring to the osds or to calamari already and doing some preemptive disk replacement. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html