[PATCH] crc32c: add aarch64 optimized crc32c implementation
ARMv8 defines a set of optional CRC32/CRC32C instructions. This patch defines an optimized function that uses these instructions when available rather than table-based lookup. Optimized function based on a Hadoop patch by Ed Nevill. Autotools updated to check for compiler support. Optimized function is selected at runtime based on HWCAP_CRC32. Added crc32c performance unit test and arch unit test. Tested on AMD Seattle. Passes all crc32c unit tests. Unit test shows ~4x performance increase versus sctp. Signed-off-by: Yazen Ghannam yazen.ghan...@linaro.org Reviewed-by: Steve Capper steve.cap...@linaro.org --- configure.ac | 1 + m4/ax_arm.m4 | 18 ++-- src/arch/arm.c | 2 ++ src/arch/arm.h | 1 + src/common/Makefile.am | 10 - src/common/crc32c.cc | 6 ++ src/common/crc32c_aarch64.c| 47 ++ src/common/crc32c_aarch64.h| 27 src/test/common/test_crc32c.cc | 10 + src/test/test_arch.cc | 14 + 10 files changed, 133 insertions(+), 3 deletions(-) create mode 100644 src/common/crc32c_aarch64.c create mode 100644 src/common/crc32c_aarch64.h diff --git a/configure.ac b/configure.ac index d836b02..60e4feb 100644 --- a/configure.ac +++ b/configure.ac @@ -575,6 +575,7 @@ AC_LANG_POP([C++]) # Find supported SIMD / NEON / SSE extensions supported by the compiler AX_ARM_FEATURES() AM_CONDITIONAL(HAVE_NEON, [ test x$ax_cv_support_neon_ext = xyes]) +AM_CONDITIONAL(HAVE_ARMV8_CRC, [ test x$ax_cv_support_crc_ext = xyes]) AX_INTEL_FEATURES() AM_CONDITIONAL(HAVE_SSSE3, [ test x$ax_cv_support_ssse3_ext = xyes]) AM_CONDITIONAL(HAVE_SSE4_PCLMUL, [ test x$ax_cv_support_pclmuldq_ext = xyes]) diff --git a/m4/ax_arm.m4 b/m4/ax_arm.m4 index 2ccc9a9..37ea0aa 100644 --- a/m4/ax_arm.m4 +++ b/m4/ax_arm.m4 @@ -13,13 +13,27 @@ AC_DEFUN([AX_ARM_FEATURES], fi ;; aarch64*) + AX_CHECK_COMPILE_FLAG(-march=armv8-a, ax_cv_support_armv8=yes, []) + if test x$ax_cv_support_armv8 = xyes; then +ARM_ARCH_FLAGS=-march=armv8-a +ARM_DEFINE_FLAGS=-DARCH_AARCH64 + fi AX_CHECK_COMPILE_FLAG(-march=armv8-a+simd, ax_cv_support_neon_ext=yes, []) if test x$ax_cv_support_neon_ext = xyes; then +ARM_ARCH_FLAGS=$ARM_ARCH_FLAGS+simd +ARM_DEFINE_FLAGS=$ARM_DEFINE_FLAGS -DARM_NEON ARM_NEON_FLAGS=-march=armv8-a+simd -DARCH_AARCH64 -DARM_NEON -AC_SUBST(ARM_NEON_FLAGS) -ARM_FLAGS=$ARM_FLAGS $ARM_NEON_FLAGS AC_DEFINE(HAVE_NEON,,[Support NEON instructions]) +AC_SUBST(ARM_NEON_FLAGS) + fi + AX_CHECK_COMPILE_FLAG(-march=armv8-a+crc, ax_cv_support_crc_ext=yes, []) + if test x$ax_cv_support_crc_ext = xyes; then +ARM_ARCH_FLAGS=$ARM_ARCH_FLAGS+crc +ARM_CRC_FLAGS=-march=armv8-a+crc -DARCH_AARCH64 +AC_DEFINE(HAVE_ARMV8_CRC,,[Support ARMv8 CRC instructions]) +AC_SUBST(ARM_CRC_FLAGS) fi +ARM_FLAGS=$ARM_ARCH_FLAGS $ARM_DEFINE_FLAGS ;; esac diff --git a/src/arch/arm.c b/src/arch/arm.c index 93d079a..5a47e33 100644 --- a/src/arch/arm.c +++ b/src/arch/arm.c @@ -2,6 +2,7 @@ /* flags we export */ int ceph_arch_neon = 0; +int ceph_arch_aarch64_crc32 = 0; #include stdio.h @@ -47,6 +48,7 @@ int ceph_arch_arm_probe(void) ceph_arch_neon = (get_hwcap() HWCAP_NEON) == HWCAP_NEON; #elif __aarch64__ __linux__ ceph_arch_neon = (get_hwcap() HWCAP_ASIMD) == HWCAP_ASIMD; + ceph_arch_aarch64_crc32 = (get_hwcap() HWCAP_CRC32) == HWCAP_CRC32; #else if (0) get_hwcap(); // make compiler shut up diff --git a/src/arch/arm.h b/src/arch/arm.h index f613438..1659b2e 100644 --- a/src/arch/arm.h +++ b/src/arch/arm.h @@ -6,6 +6,7 @@ extern C { #endif extern int ceph_arch_neon; /* true if we have ARM NEON or ASIMD abilities */ +extern int ceph_arch_aarch64_crc32; /* true if we have AArch64 CRC32/CRC32C abilities */ extern int ceph_arch_arm_probe(void); diff --git a/src/common/Makefile.am b/src/common/Makefile.am index 2888194..37d1404 100644 --- a/src/common/Makefile.am +++ b/src/common/Makefile.am @@ -112,11 +112,19 @@ endif LIBCOMMON_DEPS += libcommon_crc.la noinst_LTLIBRARIES += libcommon_crc.la +if HAVE_ARMV8_CRC +libcommon_crc_aarch64_la_SOURCES = common/crc32c_aarch64.c +libcommon_crc_aarch64_la_CFLAGS = $(AM_CFLAGS) $(ARM_CRC_FLAGS) +LIBCOMMON_DEPS += libcommon_crc_aarch64.la +noinst_LTLIBRARIES += libcommon_crc_aarch64.la +endif + noinst_HEADERS += \ common/bloom_filter.hpp \ common/sctp_crc32.h \ common/crc32c_intel_baseline.h \ - common/crc32c_intel_fast.h + common/crc32c_intel_fast.h \ + common/crc32c_aarch64.h # important; libmsg before libauth! diff --git a/src/common/crc32c.cc b/src/common/crc32c.cc index e2e81a4..45432f5 100644 --- a/src/common/crc32c.cc +++
ceph branch status
-- All Branches -- Adam Crume adamcr...@gmail.com 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza alfredo.d...@inktank.com 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Andreas-Joachim Peters andreas.joachim.pet...@cern.ch 2014-10-15 15:09:24 +0200 apeters1971-wip-table-formatter Andrew Shewmaker ags...@gmail.com 2014-11-12 14:00:10 -0800 wip-blkin Backports backpo...@workbench.dachary.org 2015-01-07 13:29:24 + giant-backports Boris Ranto bra...@redhat.com 2014-11-12 14:41:33 +0100 wip-devel-python-split Dan Mick dan.m...@inktank.com 2013-07-16 23:00:06 -0700 wip-5634 Dan Mick dan.m...@redhat.com 2014-11-12 21:35:09 -0800 wip-cli-threads 2014-11-18 15:19:32 -0800 wip-10114-firefly 2014-12-09 19:28:49 -0800 wip-10010 2014-12-10 15:09:32 -0800 wip-8797 2014-12-10 21:30:11 -0800 wip-8797-giant 2014-12-10 21:35:14 -0800 wip-8797-firefly Danny Al-Gaaf danny.al-g...@bisect.de 2014-08-16 12:26:19 +0200 wip-da-cherry-pick-firefly 2014-11-14 19:58:43 +0100 wip-da-SCA-20141114 2015-01-23 17:54:40 +0100 wip-da-SCA-20150107 David Zafman dzaf...@redhat.com 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2014-11-26 09:41:50 -0800 wip-9403 2014-12-02 21:20:17 -0800 wip-zafman-docfix 2015-01-08 15:07:45 -0800 wip-vstart-kvs 2015-01-20 15:58:33 -0800 wip-10534 Dongmao Zhang deanracc...@gmail.com 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum gfar...@redhat.com 2014-11-04 06:55:49 -0800 firefly-7-9869 Greg Farnum g...@inktank.com 2014-10-22 17:30:02 -0700 wip-9869-dumpling 2014-10-23 13:33:44 -0700 wip-forward-scrub Guang Yang ygu...@yahoo-inc.com 2014-08-08 10:41:12 + wip-guangyy-pg-splitting 2014-09-25 00:47:46 + wip-9008 2014-09-30 10:36:39 + guangyy-wip-9614 Haomai Wang haomaiw...@gmail.com 2014-07-27 13:37:49 +0800 wip-flush-set Ilya Dryomov ilya.dryo...@inktank.com 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Jason Dillaman dilla...@redhat.com 2014-11-06 07:13:44 -0500 wip-8901 2014-11-26 16:53:46 -0500 wip-librados-symbols 2014-12-15 23:25:04 -0500 wip-10299 2014-12-19 10:56:50 -0500 wip-librbd-cleanup-aio 2015-01-17 01:53:49 -0500 wip-copy-on-read 2015-01-19 10:28:56 -0500 wip-10270-giant 2015-01-19 10:30:50 -0500 wip-10270-firefly 2015-01-19 11:25:16 -0500 wip-10299-giant 2015-01-19 11:51:07 -0500 wip-10299-firefly 2015-01-19 12:12:19 -0500 wip-9854-giant 2015-01-19 12:47:28 -0500 wip-9854-firefly 2015-01-19 18:47:27 -0500 wip-8902 2015-01-21 15:25:10 -0500 wip-10462 2015-01-21 15:28:16 -0500 wip-10590-giant 2015-01-21 16:57:16 -0500 dumpling 2015-01-21 17:23:28 -0500 wip-10270-dumpling 2015-01-24 02:23:08 -0500 wip-4087 2015-01-25 20:26:27 -0500 wip-gmock Jenkins jenk...@inktank.com 2014-07-29 05:24:39 -0700 wip-nhm-hang 2015-01-13 12:10:22 -0800 last Joao Eduardo Luis jec...@gmail.com 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis joao.l...@gmail.com 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis joao.l...@inktank.com 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling Joao Eduardo Luis j...@redhat.com 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7 2015-01-10 02:40:42 + wip-dho-joao 2015-01-10 02:46:31 + wip-mon-paxos-fix 2015-01-22 11:41:59 + wip-mon-pgtemp 2015-01-26 13:00:09 + wip-mon-datahealth-fix John Spray jcsp...@gmail.com 2014-03-03 13:10:05 + wip-mds-stop-rank-0 John Spray john.sp...@redhat.com 2014-06-25 22:54:13 -0400 wip-mds-sessions 2014-07-29 00:15:21 +0100 wip-objecter-rebase 2014-08-15 02:33:49 +0100 wip-mds-contexts 2014-08-28 12:40:20 +0100 wip-9152 2014-08-28 23:34:43 +0100 wip-typed-contexts 2014-09-08 01:49:57 +0100 wip-jcsp-test 2014-09-12 18:42:02 +0100 wip-9280 2014-09-15 16:14:15 +0100 wip-9375 2014-09-24 17:56:02 +0100 wip-continuation 2014-11-08 16:02:33 + wip-9977-backport
upcoming dumpling v0.67.12
Hi Yuri, Here is a short update on the progress of the upcoming dumpling v0.67.12. It is tracked with http://tracker.ceph.com/issues/10560. In the inventory part, there is a list of all pull requests that are already merged in the dumpling branch. There only is one pull request waiting to be merged and three issues waiting for backports. While these last three are being worked on, I started rbd, rgw and rados suites. I chose to display the inventory by pull request because I figured it would be more convenient to read because sometimes a single pull request spans multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes two issues ). Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: upcoming dumpling v0.67.12
Loic, Here is the run from sepia http://pulpito.front.sepia.ceph.com/ubuntu-2015-01-26_09:26:27-upgrade:dumpling-dumpling-distro-basic-vps/ Two failures seems like env noise. Thx YuriW On Mon, Jan 26, 2015 at 9:49 AM, Loic Dachary l...@dachary.org wrote: Thanks for letting me know about the upgrade tests results, it's encouraging :-) I'll let you know when the tests make progress. On 26/01/2015 18:00, Yuri Weinstein wrote: Loic, Thanks for the update. I ran upgrade/dumpling last week (and all 42 jobs passed in octo and sepia) to establish a base line. And today running another one, assuming it will pick up the already merged pull requests. Let me know when you ready for next steps. Thx YuriW On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote: Hi Yuri, Here is a short update on the progress of the upcoming dumpling v0.67.12. It is tracked with http://tracker.ceph.com/issues/10560. In the inventory part, there is a list of all pull requests that are already merged in the dumpling branch. There only is one pull request waiting to be merged and three issues waiting for backports. While these last three are being worked on, I started rbd, rgw and rados suites. I chose to display the inventory by pull request because I figured it would be more convenient to read because sometimes a single pull request spans multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes two issues ). Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upcoming dumpling v0.67.12
Loic, Thanks for the update. I ran upgrade/dumpling last week (and all 42 jobs passed in octo and sepia) to establish a base line. And today running another one, assuming it will pick up the already merged pull requests. Let me know when you ready for next steps. Thx YuriW On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote: Hi Yuri, Here is a short update on the progress of the upcoming dumpling v0.67.12. It is tracked with http://tracker.ceph.com/issues/10560. In the inventory part, there is a list of all pull requests that are already merged in the dumpling branch. There only is one pull request waiting to be merged and three issues waiting for backports. While these last three are being worked on, I started rbd, rgw and rados suites. I chose to display the inventory by pull request because I figured it would be more convenient to read because sometimes a single pull request spans multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes two issues ). Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: idempotent op (esp delete)
On Mon, 26 Jan 2015, Wang, Zhiqiang wrote: The downside of this approach is that we may need to search the pg_log for a specific object in every write io? Not quite. IndexedLog maintains a hash_map of all of the request ids in the log, so it's just a hash lookup on each IO. (Well, now 2 hash lookups, because I put the additional request IDs in a second auxilliary map to handle dups properly. I think we can avoid that lookup if we use the request flags carefully, though.. the RETRY and REDIRECTED flags I think? Need to check carefully.) Maybe we can combine this approach and the changes in PR 3447. For the flush case when the object is deleted in the base, we search the pg_log for dup op. This should be rare cases. Otherwise the object exists, we check the reqid list in the object_info_t for dup op. We could do a hybrid approach, but there is some cost to the per-object tracking: a tiny bit more memory, and an O(n) search of the items in that list (~10 or 20?) for the dup check. I suspect the hash lookup is cheaper? And simpler. sage -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, January 26, 2015 10:35 AM To: Sage Weil; Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: RE: idempotent op (esp delete) This method puts the reqid list in the pg_log instead of the object_info_t, so that it's preserved even in the delete case, which sounds more reasonable. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Saturday, January 24, 2015 6:19 AM To: Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: Re: idempotent op (esp delete) On Fri, 23 Jan 2015, Gregory Farnum wrote: On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote: Background: 1) Way back when we made a task that would thrash the cache modes by adding and removing the cache tier while ceph_test_rados was running. This mostly worked, but would occasionally fail because we would - delete an object from the cache tier - a network failure injection would lose the reply - we'd disable the cache - the delete would resend to the base tier, not get recognized as a dup (different pool, different pg log) - -ENOENT instead of 0 2) The proxy write code hits a similar problem: - delete gets proxied - we initiate async promote - a network failure injection loses the delete reply - delete resends and blocks on promote (or arrives after it finishes) - promote finishes - delete is handled - -ENOENT instead of 0 The ticket is http://tracker.ceph.com/issues/8935 The problem is partially addressed by https://github.com/ceph/ceph/pull/3447 by logging a few request ids on every object_info_t and preserving that on promote and flush. However, it doesn't solve the problem for delete because we throw out object_info_t so that reqid_t is lost. I think we have two options, not necessarily mutually exclusive: 1) When promoting an object that doesn't exist (to create a whiteout), pull reqids out of the base tier's pg log so that the whiteout is primed with request ids. 1.5) When flushing... well, that is harder because we have nowhere to put the reqids. Unless we make a way to cram a list of reqid's into a single PG log entry...? In that case, we wouldn't strictly need the per-object list since we could pile the base tier's reqids into the promote log entry in the cache tier. 2) Make delete idempotent (0 instead of ENOENT if the object doesn't exist). This will require a delicate compat transition (let's ignore that a moment) but you can preserve the old behavior for callers that care by preceding the delete with an assert_exists op. Most callers don't care, but a handful do. This simplifies the semantics we need to support going forward. Of course, it's all a bit delicate. The idempotent op semantics have a time horizon so it's all a bit wishy-washy... :/ Thoughts? Do we have other cases that we're worried about which would be improved by maintaining reqids across pool cache transitions? I'm not a big fan of maintaining those per-op lists (they sound really expensive?), but if we need them for something else that's a point in their favor. I don't think they're *too* expensive (say, vector of 20 per object_info_t?). But the only thing I can think of beyond the cache tiering stuff would be cases where the pg log isnt long enough for a very laggy client. In general ops will be distributed across ops so it will be catch the dup from another angle. However.. I just hacked up a patch that lets us cram lots of reqids into a single pg_log_entry_t and I think that may be a
Re: idempotent op (esp delete)
The pg_log_t variant does seem to be cleaner. -Sam On Mon, Jan 26, 2015 at 9:21 AM, Sage Weil sw...@redhat.com wrote: On Mon, 26 Jan 2015, Wang, Zhiqiang wrote: The downside of this approach is that we may need to search the pg_log for a specific object in every write io? Not quite. IndexedLog maintains a hash_map of all of the request ids in the log, so it's just a hash lookup on each IO. (Well, now 2 hash lookups, because I put the additional request IDs in a second auxilliary map to handle dups properly. I think we can avoid that lookup if we use the request flags carefully, though.. the RETRY and REDIRECTED flags I think? Need to check carefully.) Maybe we can combine this approach and the changes in PR 3447. For the flush case when the object is deleted in the base, we search the pg_log for dup op. This should be rare cases. Otherwise the object exists, we check the reqid list in the object_info_t for dup op. We could do a hybrid approach, but there is some cost to the per-object tracking: a tiny bit more memory, and an O(n) search of the items in that list (~10 or 20?) for the dup check. I suspect the hash lookup is cheaper? And simpler. sage -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, January 26, 2015 10:35 AM To: Sage Weil; Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: RE: idempotent op (esp delete) This method puts the reqid list in the pg_log instead of the object_info_t, so that it's preserved even in the delete case, which sounds more reasonable. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Saturday, January 24, 2015 6:19 AM To: Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: Re: idempotent op (esp delete) On Fri, 23 Jan 2015, Gregory Farnum wrote: On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote: Background: 1) Way back when we made a task that would thrash the cache modes by adding and removing the cache tier while ceph_test_rados was running. This mostly worked, but would occasionally fail because we would - delete an object from the cache tier - a network failure injection would lose the reply - we'd disable the cache - the delete would resend to the base tier, not get recognized as a dup (different pool, different pg log) - -ENOENT instead of 0 2) The proxy write code hits a similar problem: - delete gets proxied - we initiate async promote - a network failure injection loses the delete reply - delete resends and blocks on promote (or arrives after it finishes) - promote finishes - delete is handled - -ENOENT instead of 0 The ticket is http://tracker.ceph.com/issues/8935 The problem is partially addressed by https://github.com/ceph/ceph/pull/3447 by logging a few request ids on every object_info_t and preserving that on promote and flush. However, it doesn't solve the problem for delete because we throw out object_info_t so that reqid_t is lost. I think we have two options, not necessarily mutually exclusive: 1) When promoting an object that doesn't exist (to create a whiteout), pull reqids out of the base tier's pg log so that the whiteout is primed with request ids. 1.5) When flushing... well, that is harder because we have nowhere to put the reqids. Unless we make a way to cram a list of reqid's into a single PG log entry...? In that case, we wouldn't strictly need the per-object list since we could pile the base tier's reqids into the promote log entry in the cache tier. 2) Make delete idempotent (0 instead of ENOENT if the object doesn't exist). This will require a delicate compat transition (let's ignore that a moment) but you can preserve the old behavior for callers that care by preceding the delete with an assert_exists op. Most callers don't care, but a handful do. This simplifies the semantics we need to support going forward. Of course, it's all a bit delicate. The idempotent op semantics have a time horizon so it's all a bit wishy-washy... :/ Thoughts? Do we have other cases that we're worried about which would be improved by maintaining reqids across pool cache transitions? I'm not a big fan of maintaining those per-op lists (they sound really expensive?), but if we need them for something else that's a point in their favor. I don't think they're *too* expensive (say, vector of 20 per object_info_t?). But the only thing I can think of beyond the cache tiering stuff would be cases where the pg log isnt long enough for a very laggy client. In general ops will be distributed across ops so it will be catch the dup from another angle. However.. I just hacked up a patch
Re: upcoming dumpling v0.67.12
Thanks for letting me know about the upgrade tests results, it's encouraging :-) I'll let you know when the tests make progress. On 26/01/2015 18:00, Yuri Weinstein wrote: Loic, Thanks for the update. I ran upgrade/dumpling last week (and all 42 jobs passed in octo and sepia) to establish a base line. And today running another one, assuming it will pick up the already merged pull requests. Let me know when you ready for next steps. Thx YuriW On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote: Hi Yuri, Here is a short update on the progress of the upcoming dumpling v0.67.12. It is tracked with http://tracker.ceph.com/issues/10560. In the inventory part, there is a list of all pull requests that are already merged in the dumpling branch. There only is one pull request waiting to be merged and three issues waiting for backports. While these last three are being worked on, I started rbd, rgw and rados suites. I chose to display the inventory by pull request because I figured it would be more convenient to read because sometimes a single pull request spans multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes two issues ). Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
RE: wip-auth
Good to know, I was wondering why the spec file defaulted to lib-nss.. the dpkg-build for debian packages just uses whatever configuration you had built, and I believe that will use libcryptopp if the dependency is installed on the build machine (last I looked). I forgot to mention the numbers below were based on v.91. Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, January 26, 2015 10:24 AM To: Blinick, Stephen L Cc: andreas.blue...@itxperts.de; ceph-devel@vger.kernel.org Subject: RE: wip-auth On Mon, 26 Jan 2015, Blinick, Stephen L wrote: I noticed that the spec file for building RPM's defaults to building with libnss, instead of libcrypto++. Since the measurements I'd done so far were from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference between those two on my system, memstore backend with a single OSD, and single client. Dual socket Xeon E5 2620v3, 64GB Memory, RHEL7 Kernel: 3.10.0-123.13.2.el7 100% 4K Writes, 1xOSD w/ Rados Bench libnss |Cryptopp # QD IOPSLatency(ms) | IOPSLatency(ms) IOPS Improvement % 1614432.571.11| 18896.600.8530.93% 100% 4K Reads, 1xOSD w/ Rados Bench libnss | Cryptopp # QD IOPS Latency(ms) | IOPS Latency(ms) IOPS Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62% Yikes, 30%! I think this definitely worth some effort. We switched to libnss because it has the weird government certfiications that everyone wants and is more prevalent. crypto++ is also not packaged for Red Hat distros at all (presumably for that reason). I suspect that most of the overhead is in the encryption context setup and can be avoided with a bit of effort.. sage Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, January 22, 2015 4:56 PM To: andreas.blue...@itxperts.de Cc: ceph-devel@vger.kernel.org Subject: wip-auth Hi Andreas, I took a look at the wip-auth I mentioned in the security call last week... and the patch didn't work at all. Sorry if you wasted any time trying it. Anyway, I fixed it up so that it actually worked and made one other optimization. It would be great to hear what latencies you measure with the changes in place. Also, it might be worth trying --with-cryptopp (or --with-nss if you built cryptopp by default) to see if there is a difference. There is a ton of boilerplate setting up encryption contexts and key structures and so on that I suspect could be cached (perhaps stashed in the CryptoKey struct?) with a bit of effort. See https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213 sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: wip-auth
On Mon, 26 Jan 2015, Blinick, Stephen L wrote: I noticed that the spec file for building RPM's defaults to building with libnss, instead of libcrypto++. Since the measurements I'd done so far were from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference between those two on my system, memstore backend with a single OSD, and single client. Dual socket Xeon E5 2620v3, 64GB Memory, RHEL7 Kernel: 3.10.0-123.13.2.el7 100% 4K Writes, 1xOSD w/ Rados Bench libnss |Cryptopp # QD IOPSLatency(ms) | IOPSLatency(ms) IOPS Improvement % 1614432.571.11| 18896.600.8530.93% 100% 4K Reads, 1xOSD w/ Rados Bench libnss | Cryptopp # QD IOPS Latency(ms) | IOPS Latency(ms) IOPS Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62% Yikes, 30%! I think this definitely worth some effort. We switched to libnss because it has the weird government certfiications that everyone wants and is more prevalent. crypto++ is also not packaged for Red Hat distros at all (presumably for that reason). I suspect that most of the overhead is in the encryption context setup and can be avoided with a bit of effort.. sage Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, January 22, 2015 4:56 PM To: andreas.blue...@itxperts.de Cc: ceph-devel@vger.kernel.org Subject: wip-auth Hi Andreas, I took a look at the wip-auth I mentioned in the security call last week... and the patch didn't work at all. Sorry if you wasted any time trying it. Anyway, I fixed it up so that it actually worked and made one other optimization. It would be great to hear what latencies you measure with the changes in place. Also, it might be worth trying --with-cryptopp (or --with-nss if you built cryptopp by default) to see if there is a difference. There is a ton of boilerplate setting up encryption contexts and key structures and so on that I suspect could be cached (perhaps stashed in the CryptoKey struct?) with a bit of effort. See https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213 sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-auth
Hi Stephen, Does this explain the results you were seeing earlier with the memstore testing? Mark On 01/26/2015 12:00 PM, Blinick, Stephen L wrote: Good to know, I was wondering why the spec file defaulted to lib-nss.. the dpkg-build for debian packages just uses whatever configuration you had built, and I believe that will use libcryptopp if the dependency is installed on the build machine (last I looked). I forgot to mention the numbers below were based on v.91. Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, January 26, 2015 10:24 AM To: Blinick, Stephen L Cc: andreas.blue...@itxperts.de; ceph-devel@vger.kernel.org Subject: RE: wip-auth On Mon, 26 Jan 2015, Blinick, Stephen L wrote: I noticed that the spec file for building RPM's defaults to building with libnss, instead of libcrypto++. Since the measurements I'd done so far were from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference between those two on my system, memstore backend with a single OSD, and single client. Dual socket Xeon E5 2620v3, 64GB Memory, RHEL7 Kernel: 3.10.0-123.13.2.el7 100% 4K Writes, 1xOSD w/ Rados Bench libnss |Cryptopp # QDIOPSLatency(ms) | IOPSLatency(ms) IOPS Improvement % 16 14432.571.11| 18896.600.8530.93% 100% 4K Reads, 1xOSD w/ Rados Bench libnss | Cryptopp # QD IOPS Latency(ms) | IOPS Latency(ms) IOPS Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62% Yikes, 30%! I think this definitely worth some effort. We switched to libnss because it has the weird government certfiications that everyone wants and is more prevalent. crypto++ is also not packaged for Red Hat distros at all (presumably for that reason). I suspect that most of the overhead is in the encryption context setup and can be avoided with a bit of effort.. sage Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, January 22, 2015 4:56 PM To: andreas.blue...@itxperts.de Cc: ceph-devel@vger.kernel.org Subject: wip-auth Hi Andreas, I took a look at the wip-auth I mentioned in the security call last week... and the patch didn't work at all. Sorry if you wasted any time trying it. Anyway, I fixed it up so that it actually worked and made one other optimization. It would be great to hear what latencies you measure with the changes in place. Also, it might be worth trying --with-cryptopp (or --with-nss if you built cryptopp by default) to see if there is a difference. There is a ton of boilerplate setting up encryption contexts and key structures and so on that I suspect could be cached (perhaps stashed in the CryptoKey struct?) with a bit of effort. See https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213 sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rbd: fix rbd_dev_parent_get() when parent_overlap == 0
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the drop it again if there is no overlap part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. You're right about this. If the image had a parent with no overlap this would leak a reference to the parent image. The code should have said: counter = atomic_inc_return_safe(rbd_dev-parent_ref); if (counter 0) { if (rbd_dev-parent_overlap) return true; atomic_dec(rbd_dev-parent_ref); } else if (counter 0) { rbd_warn(rbd_dev, parent reference overflow); } Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. The point of the memory barrier was to ensure that when parent_overlap gets zeroed, this code sees the zero rather than the old non-zero value. The atomic_inc_return_safe() call has an implicit memory barrier to match the smp_mb() call. It allowed the synchronization to occur without the use of a lock. We're trying to atomically determine whether an image request needs to be marked as layered, to know how to handle ENOENT on parent reads. If it is a write to an image with a parent having a non-zero overlap, it's layered, otherwise we can treat it as a simple request. I think in this particular case, this is just an optimization, trying very hard to avoid having to do layered image handling if the parent has become flattened. I think that even if it got old information (suggesting non-zero overlap) things would behave correctly, just less efficiently. Using the semaphore adds a lock to this path and therefore implements whatever barriers are being removed. I'm not sure how often this is hit--maybe the optimization isn't buying much after all. I am getting a little rusty on some of details of what precisely happens when a layered image gets flattened. But I think this looks OK. Maybe just watch for small (perhaps insignificant) performance regressions with this change in place... Reviewed-by: Alex Elder el...@linaro.org Cc: sta...@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov idryo...@redhat.com --- drivers/block/rbd.c | 20 ++-- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 31fa00f0d707..2990a1c75159 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2098,32 +2098,26 @@ static void rbd_dev_parent_put(struct rbd_device *rbd_dev) * If an image has a non-zero parent overlap, get a reference to its * parent. * - * We must get the reference before checking for the overlap to - * coordinate properly with zeroing the parent overlap in - * rbd_dev_v2_parent_info() when an image gets flattened. We - * drop it again if there is no overlap. - * * Returns true if the rbd device has a parent with a non-zero * overlap and a reference for it was successfully taken, or * false otherwise. */ static bool rbd_dev_parent_get(struct rbd_device *rbd_dev) { - int counter; + int counter = 0; if (!rbd_dev-parent_spec) return false; - counter = atomic_inc_return_safe(rbd_dev-parent_ref); - if (counter 0 rbd_dev-parent_overlap) - return true; - - /* Image was flattened, but parent is not yet torn down */ + down_read(rbd_dev-header_rwsem); + if (rbd_dev-parent_overlap) + counter = atomic_inc_return_safe(rbd_dev-parent_ref); + up_read(rbd_dev-header_rwsem); if (counter 0) rbd_warn(rbd_dev, parent reference overflow); - return false; + return counter 0; } /* @@ -4238,7 +4232,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) */ if (rbd_dev-parent_overlap) { rbd_dev-parent_overlap = 0; - smp_mb(); rbd_dev_parent_put(rbd_dev); pr_info(%s: clone image has been flattened\n, rbd_dev-disk-disk_name); @@ -4284,7 +4277,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) * treat it specially. */ rbd_dev-parent_overlap = overlap;
Re: [PATCH 3/3] rbd: do not treat standalone as flatten
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means let go of all parent state, including the spec(s) of parent images(s). This leads to a mismatch between rbd info and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar | grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: Ilya Dryomov idryo...@redhat.com Hmm. Interesting. I think that a parent with an overlap of 0 is of no real use. So in the last patch I was suggesting it should just go away. But now, looking at it from this perspective, the fact that an image *came from* a particular parent, but which has no more overlap, could be useful information. The parent shouldn't simply go away without the user requesting that. I haven't completely followed through the logic of keeping the reference around but I understand what you're doing and it looks OK to me. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 30 ++ 1 file changed, 10 insertions(+), 20 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b85d52005a21..e818c2a6ffb1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4273,32 +4273,22 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) } /* - * We always update the parent overlap. If it's zero we - * treat it specially. + * We always update the parent overlap. If it's zero we issue + * a warning, as we will proceed as if there was no parent. */ - rbd_dev-parent_overlap = overlap; if (!overlap) { - - /* A null parent_spec indicates it's the initial probe */ - if (parent_spec) { - /* - * The overlap has become zero, so the clone - * must have been resized down to 0 at some - * point. Treat this the same as a flatten. - */ - rbd_dev_parent_put(rbd_dev); - pr_info(%s: clone image now standalone\n, - rbd_dev-disk-disk_name); + /* refresh, careful to warn just once */ + if (rbd_dev-parent_overlap) + rbd_warn(rbd_dev, + clone now standalone (overlap became 0)); } else { - /* - * For the initial probe, if we find the - * overlap is zero we just pretend there was - * no parent image. - */ - rbd_warn(rbd_dev, ignoring parent with overlap 0); + /* initial probe */ + rbd_warn(rbd_dev, clone is standalone (overlap 0)); } } + rbd_dev-parent_overlap = overlap; + out: ret = 0; out_err: -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Questions]Can client know which OSDs are storing the data?
Hello Guys, My question is very rude and direct, a little bit stupid maybe ;-) Question a: a client write a file to the cluster (supposing replica = 3), so the data will be stored in 3 OSDs within the cluster, can I get the information of which OSDs storing the file data in client side? Question b: can the object data still be replicated if I store a object from client with RADOS API? Thank you guys ! -- Den -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: idempotent op (esp delete)
Not sure if it is correct, but below is my understanding. Correct me if I'm wrong. Yes, in the current code, a hash lookup on each IO is used to check dup op. But when adding extra_reqids as a log entry in the pg_log, 2 hash lookups may be not sufficient. The extra_reqids is organized by object id as in your code. We may have other ops on an object just promoted. If a dup op comes in after these ops, then we have to search for log entries of this object in pg_log. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, January 27, 2015 1:21 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org; Gregory Farnum Subject: RE: idempotent op (esp delete) On Mon, 26 Jan 2015, Wang, Zhiqiang wrote: The downside of this approach is that we may need to search the pg_log for a specific object in every write io? Not quite. IndexedLog maintains a hash_map of all of the request ids in the log, so it's just a hash lookup on each IO. (Well, now 2 hash lookups, because I put the additional request IDs in a second auxilliary map to handle dups properly. I think we can avoid that lookup if we use the request flags carefully, though.. the RETRY and REDIRECTED flags I think? Need to check carefully.) Maybe we can combine this approach and the changes in PR 3447. For the flush case when the object is deleted in the base, we search the pg_log for dup op. This should be rare cases. Otherwise the object exists, we check the reqid list in the object_info_t for dup op. We could do a hybrid approach, but there is some cost to the per-object tracking: a tiny bit more memory, and an O(n) search of the items in that list (~10 or 20?) for the dup check. I suspect the hash lookup is cheaper? And simpler. sage -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, January 26, 2015 10:35 AM To: Sage Weil; Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: RE: idempotent op (esp delete) This method puts the reqid list in the pg_log instead of the object_info_t, so that it's preserved even in the delete case, which sounds more reasonable. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Saturday, January 24, 2015 6:19 AM To: Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: Re: idempotent op (esp delete) On Fri, 23 Jan 2015, Gregory Farnum wrote: On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote: Background: 1) Way back when we made a task that would thrash the cache modes by adding and removing the cache tier while ceph_test_rados was running. This mostly worked, but would occasionally fail because we would - delete an object from the cache tier - a network failure injection would lose the reply - we'd disable the cache - the delete would resend to the base tier, not get recognized as a dup (different pool, different pg log) - -ENOENT instead of 0 2) The proxy write code hits a similar problem: - delete gets proxied - we initiate async promote - a network failure injection loses the delete reply - delete resends and blocks on promote (or arrives after it finishes) - promote finishes - delete is handled - -ENOENT instead of 0 The ticket is http://tracker.ceph.com/issues/8935 The problem is partially addressed by https://github.com/ceph/ceph/pull/3447 by logging a few request ids on every object_info_t and preserving that on promote and flush. However, it doesn't solve the problem for delete because we throw out object_info_t so that reqid_t is lost. I think we have two options, not necessarily mutually exclusive: 1) When promoting an object that doesn't exist (to create a whiteout), pull reqids out of the base tier's pg log so that the whiteout is primed with request ids. 1.5) When flushing... well, that is harder because we have nowhere to put the reqids. Unless we make a way to cram a list of reqid's into a single PG log entry...? In that case, we wouldn't strictly need the per-object list since we could pile the base tier's reqids into the promote log entry in the cache tier. 2) Make delete idempotent (0 instead of ENOENT if the object doesn't exist). This will require a delicate compat transition (let's ignore that a moment) but you can preserve the old behavior for callers that care by preceding the delete with an assert_exists op. Most callers don't care, but a handful do. This simplifies the semantics we need to support going forward. Of course, it's all a bit delicate. The idempotent op semantics have a time horizon so it's all a bit wishy-washy... :/
Re: [PATCH 2/3] rbd: drop parent_ref in rbd_dev_unprobe() unconditionally
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: This effectively reverts the last hunk of 392a9dad7e77 (rbd: detect when clone image is flattened). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the clone image now standalone logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. I'm not sure why the last reference to the parent doesn't get dropped (and state cleaned up) as soon as the overlap becomes 0. I suspect it's the original reference taken when there's a parent, we don't get rid of it until it's torn down. (I think we should.) It seems to me the test here should be for a non-null parent_spec pointer rather than non-zero parent_overlap. And that's done inside rbd_dev_parent_put(), so your change looks reasonable to me. Reviewed-by: Alex Elder el...@linaro.org My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: sta...@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov idryo...@redhat.com --- drivers/block/rbd.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 2990a1c75159..b85d52005a21 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5075,10 +5075,7 @@ static void rbd_dev_unprobe(struct rbd_device *rbd_dev) { struct rbd_image_header *header; - /* Drop parent reference unless it's already been done (or none) */ - - if (rbd_dev-parent_overlap) - rbd_dev_parent_put(rbd_dev); + rbd_dev_parent_put(rbd_dev); /* Free dynamic fields from the header, then zero it out */ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Questions]Can client know which OSDs are storing the data?
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dennis Chen Sent: Tuesday, January 27, 2015 3:07 PM To: ceph-devel@vger.kernel.org; Dennis Chen Subject: [Questions]Can client know which OSDs are storing the data? Hello Guys, My question is very rude and direct, a little bit stupid maybe ;-) Question a: a client write a file to the cluster (supposing replica = 3), so the data will be stored in 3 OSDs within the cluster, can I get the information of which OSDs storing the file data in client side? Ceph osd map poolname objectname Display the osd and pg which store object. Question b: can the object data still be replicated if I store a object from client with RADOS API? Yes, rados api is also a client. Thank you guys ! -- Den -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: Deadline of Github pull request for Hammer release (question)
Hi Loic, I have noticed that your repository ceph-erasure-code-corpus is forked for us, so I created a new pull request. Update non-regression.sh #1 https://github.com/t-miyamae/ceph-erasure-code-corpus/pull/1 Best regards, Takeshi Miyamae -Original Message- From: Miyamae, Takeshi/宮前 剛 Sent: Monday, January 26, 2015 2:44 PM To: 'Loic Dachary' Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔 Subject: RE: Deadline of Github pull request for Hammer release (question) Hi Loic, Note that you also need to update We have prepared mSHEC's parameter sets which we think will be commonly used. Because I'm not sure how to update another person's repository, we will write down those parameter sets in this mail. If we are required to do something, please let us know. while read k m c ; do for stripe_width in $STRIPE_WIDTHS ; do ceph_erasure_code_non_regression --stripe-width $stripe_width --plugin shec --parameter technique=multiple --parameter k=$k --parameter m=$m --parameter c=$c $ACTION $VERBOSE $MYDIR done done EOF 1 1 1 2 1 1 3 2 1 3 2 2 3 3 2 4 1 1 4 2 2 4 3 2 5 2 1 6 3 2 6 4 2 6 4 3 7 2 1 8 3 2 8 4 2 8 4 3 9 4 2 9 5 3 12 7 4 EOF Best regards, Takeshi Miyamae -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Friday, January 23, 2015 10:47 PM To: Miyamae, Takeshi/宮前 剛 Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔 Subject: Re: Deadline of Github pull request for Hammer release (question) Hi, Note that you also need to update https://github.com/dachary/ceph-erasure-code-corpus/blob/master/v0.85-764-gf3a1532/non-regression.sh to include non regression tests for the most common cases of the SHEC plugin encoding / decoding. This is run by make check (this repository is a submodule of Ceph). It helps make sure that content encoded / decoded with a given version of the plugin can be encoded / decoded exactly in the same way by all future versions. Cheers On 06/01/2015 12:49, Miyamae, Takeshi wrote: Dear Loic, I'm Takeshi Miyamae, one of the authors of SHEC's blueprint. Shingled Erasure Code (SHEC) https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code _(SHEC) We have revised our blueprint shown in the last CDS to extend our erasure code layouts and describe the guideline for choosing SHEC among various EC plugins. We believe the blueprint now answers all the comments given at the CDS. In addition, we would like to ask for your advice on the schedule of our github pull request. More specifically, we would like to know its deadline for Hammer release. (As we have not really completed our verification of SHEC, we are wondering if we should make it open for early preview.) Thank you in advance, Takeshi Miyamae -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre
Re: Deadline of Github pull request for Hammer release (question)
Hi, Thanks for the snippet, I'll add it to the non regression from the pull request you sent :-) Cheers On 26/01/2015 06:43, Miyamae, Takeshi wrote: Hi Loic, Note that you also need to update We have prepared mSHEC's parameter sets which we think will be commonly used. Because I'm not sure how to update another person's repository, we will write down those parameter sets in this mail. If we are required to do something, please let us know. while read k m c ; do for stripe_width in $STRIPE_WIDTHS ; do ceph_erasure_code_non_regression --stripe-width $stripe_width --plugin shec --parameter technique=multiple --parameter k=$k --parameter m=$m --parameter c=$c $ACTION $VERBOSE $MYDIR done done EOF 1 1 1 2 1 1 3 2 1 3 2 2 3 3 2 4 1 1 4 2 2 4 3 2 5 2 1 6 3 2 6 4 2 6 4 3 7 2 1 8 3 2 8 4 2 8 4 3 9 4 2 9 5 3 12 7 4 EOF Best regards, Takeshi Miyamae -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Friday, January 23, 2015 10:47 PM To: Miyamae, Takeshi/宮前 剛 Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔 Subject: Re: Deadline of Github pull request for Hammer release (question) Hi, Note that you also need to update https://github.com/dachary/ceph-erasure-code-corpus/blob/master/v0.85-764-gf3a1532/non-regression.sh to include non regression tests for the most common cases of the SHEC plugin encoding / decoding. This is run by make check (this repository is a submodule of Ceph). It helps make sure that content encoded / decoded with a given version of the plugin can be encoded / decoded exactly in the same way by all future versions. Cheers On 06/01/2015 12:49, Miyamae, Takeshi wrote: Dear Loic, I'm Takeshi Miyamae, one of the authors of SHEC's blueprint. Shingled Erasure Code (SHEC) https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code _(SHEC) We have revised our blueprint shown in the last CDS to extend our erasure code layouts and describe the guideline for choosing SHEC among various EC plugins. We believe the blueprint now answers all the comments given at the CDS. In addition, we would like to ask for your advice on the schedule of our github pull request. More specifically, we would like to know its deadline for Hammer release. (As we have not really completed our verification of SHEC, we are wondering if we should make it open for early preview.) Thank you in advance, Takeshi Miyamae -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre N�r��y���b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!tml= -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: idempotent op (esp delete)
On Mon, 26 Jan 2015, Samuel Just wrote: The pg_log_t variant does seem to be cleaner. I forgot, here is the danger: on promote and flush (copy-from) we do an O(n) scan of the pg log to assemble the reqids for that object. Current default is 1000 entries in the log. We could do better than that in many cases by skipping along the prior_version values (ignoring creates for now) if we could jump to a log entry by eversion_t, but it's a list, not a map. Perhaps we could change it to a deque in memory to allow that sort of semi-random access? Or maybe it's not worth trying to optimize that at all given the frequence of promote/flush...? sage -Sam On Mon, Jan 26, 2015 at 9:21 AM, Sage Weil sw...@redhat.com wrote: On Mon, 26 Jan 2015, Wang, Zhiqiang wrote: The downside of this approach is that we may need to search the pg_log for a specific object in every write io? Not quite. IndexedLog maintains a hash_map of all of the request ids in the log, so it's just a hash lookup on each IO. (Well, now 2 hash lookups, because I put the additional request IDs in a second auxilliary map to handle dups properly. I think we can avoid that lookup if we use the request flags carefully, though.. the RETRY and REDIRECTED flags I think? Need to check carefully.) Maybe we can combine this approach and the changes in PR 3447. For the flush case when the object is deleted in the base, we search the pg_log for dup op. This should be rare cases. Otherwise the object exists, we check the reqid list in the object_info_t for dup op. We could do a hybrid approach, but there is some cost to the per-object tracking: a tiny bit more memory, and an O(n) search of the items in that list (~10 or 20?) for the dup check. I suspect the hash lookup is cheaper? And simpler. sage -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, January 26, 2015 10:35 AM To: Sage Weil; Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: RE: idempotent op (esp delete) This method puts the reqid list in the pg_log instead of the object_info_t, so that it's preserved even in the delete case, which sounds more reasonable. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Saturday, January 24, 2015 6:19 AM To: Gregory Farnum Cc: ceph-devel@vger.kernel.org Subject: Re: idempotent op (esp delete) On Fri, 23 Jan 2015, Gregory Farnum wrote: On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote: Background: 1) Way back when we made a task that would thrash the cache modes by adding and removing the cache tier while ceph_test_rados was running. This mostly worked, but would occasionally fail because we would - delete an object from the cache tier - a network failure injection would lose the reply - we'd disable the cache - the delete would resend to the base tier, not get recognized as a dup (different pool, different pg log) - -ENOENT instead of 0 2) The proxy write code hits a similar problem: - delete gets proxied - we initiate async promote - a network failure injection loses the delete reply - delete resends and blocks on promote (or arrives after it finishes) - promote finishes - delete is handled - -ENOENT instead of 0 The ticket is http://tracker.ceph.com/issues/8935 The problem is partially addressed by https://github.com/ceph/ceph/pull/3447 by logging a few request ids on every object_info_t and preserving that on promote and flush. However, it doesn't solve the problem for delete because we throw out object_info_t so that reqid_t is lost. I think we have two options, not necessarily mutually exclusive: 1) When promoting an object that doesn't exist (to create a whiteout), pull reqids out of the base tier's pg log so that the whiteout is primed with request ids. 1.5) When flushing... well, that is harder because we have nowhere to put the reqids. Unless we make a way to cram a list of reqid's into a single PG log entry...? In that case, we wouldn't strictly need the per-object list since we could pile the base tier's reqids into the promote log entry in the cache tier. 2) Make delete idempotent (0 instead of ENOENT if the object doesn't exist). This will require a delicate compat transition (let's ignore that a moment) but you can preserve the old behavior for callers that care by preceding the delete with an assert_exists op. Most callers don't care, but a handful do. This simplifies the semantics we need to support going forward. Of course, it's all a bit delicate. The idempotent op semantics