[PATCH] crc32c: add aarch64 optimized crc32c implementation

2015-01-26 Thread Yazen Ghannam
ARMv8 defines a set of optional CRC32/CRC32C instructions.
This patch defines an optimized function that uses these
instructions when available rather than table-based lookup.
Optimized function based on a Hadoop patch by Ed Nevill.

Autotools updated to check for compiler support.
Optimized function is selected at runtime based on HWCAP_CRC32.
Added crc32c performance unit test and arch unit test.

Tested on AMD Seattle.
Passes all crc32c unit tests.
Unit test shows ~4x performance increase versus sctp.

Signed-off-by: Yazen Ghannam yazen.ghan...@linaro.org
Reviewed-by: Steve Capper steve.cap...@linaro.org
---
 configure.ac   |  1 +
 m4/ax_arm.m4   | 18 ++--
 src/arch/arm.c |  2 ++
 src/arch/arm.h |  1 +
 src/common/Makefile.am | 10 -
 src/common/crc32c.cc   |  6 ++
 src/common/crc32c_aarch64.c| 47 ++
 src/common/crc32c_aarch64.h| 27 
 src/test/common/test_crc32c.cc | 10 +
 src/test/test_arch.cc  | 14 +
 10 files changed, 133 insertions(+), 3 deletions(-)
 create mode 100644 src/common/crc32c_aarch64.c
 create mode 100644 src/common/crc32c_aarch64.h

diff --git a/configure.ac b/configure.ac
index d836b02..60e4feb 100644
--- a/configure.ac
+++ b/configure.ac
@@ -575,6 +575,7 @@ AC_LANG_POP([C++])
 # Find supported SIMD / NEON / SSE extensions supported by the compiler
 AX_ARM_FEATURES()
 AM_CONDITIONAL(HAVE_NEON, [ test x$ax_cv_support_neon_ext = xyes])
+AM_CONDITIONAL(HAVE_ARMV8_CRC, [ test x$ax_cv_support_crc_ext = xyes])
 AX_INTEL_FEATURES()
 AM_CONDITIONAL(HAVE_SSSE3, [ test x$ax_cv_support_ssse3_ext = xyes])
 AM_CONDITIONAL(HAVE_SSE4_PCLMUL, [ test x$ax_cv_support_pclmuldq_ext = 
xyes])
diff --git a/m4/ax_arm.m4 b/m4/ax_arm.m4
index 2ccc9a9..37ea0aa 100644
--- a/m4/ax_arm.m4
+++ b/m4/ax_arm.m4
@@ -13,13 +13,27 @@ AC_DEFUN([AX_ARM_FEATURES],
   fi
 ;;
 aarch64*)
+  AX_CHECK_COMPILE_FLAG(-march=armv8-a, ax_cv_support_armv8=yes, [])
+  if test x$ax_cv_support_armv8 = xyes; then
+ARM_ARCH_FLAGS=-march=armv8-a
+ARM_DEFINE_FLAGS=-DARCH_AARCH64
+  fi
   AX_CHECK_COMPILE_FLAG(-march=armv8-a+simd, ax_cv_support_neon_ext=yes, 
[])
   if test x$ax_cv_support_neon_ext = xyes; then
+ARM_ARCH_FLAGS=$ARM_ARCH_FLAGS+simd
+ARM_DEFINE_FLAGS=$ARM_DEFINE_FLAGS -DARM_NEON
 ARM_NEON_FLAGS=-march=armv8-a+simd -DARCH_AARCH64 -DARM_NEON
-AC_SUBST(ARM_NEON_FLAGS)
-ARM_FLAGS=$ARM_FLAGS $ARM_NEON_FLAGS
 AC_DEFINE(HAVE_NEON,,[Support NEON instructions])
+AC_SUBST(ARM_NEON_FLAGS)
+  fi
+  AX_CHECK_COMPILE_FLAG(-march=armv8-a+crc, ax_cv_support_crc_ext=yes, [])
+  if test x$ax_cv_support_crc_ext = xyes; then
+ARM_ARCH_FLAGS=$ARM_ARCH_FLAGS+crc
+ARM_CRC_FLAGS=-march=armv8-a+crc -DARCH_AARCH64
+AC_DEFINE(HAVE_ARMV8_CRC,,[Support ARMv8 CRC instructions])
+AC_SUBST(ARM_CRC_FLAGS)
   fi
+ARM_FLAGS=$ARM_ARCH_FLAGS $ARM_DEFINE_FLAGS
 ;;
   esac
 
diff --git a/src/arch/arm.c b/src/arch/arm.c
index 93d079a..5a47e33 100644
--- a/src/arch/arm.c
+++ b/src/arch/arm.c
@@ -2,6 +2,7 @@
 
 /* flags we export */
 int ceph_arch_neon = 0;
+int ceph_arch_aarch64_crc32 = 0;
 
 #include stdio.h
 
@@ -47,6 +48,7 @@ int ceph_arch_arm_probe(void)
ceph_arch_neon = (get_hwcap()  HWCAP_NEON) == HWCAP_NEON;
 #elif __aarch64__  __linux__
ceph_arch_neon = (get_hwcap()  HWCAP_ASIMD) == HWCAP_ASIMD;
+   ceph_arch_aarch64_crc32 = (get_hwcap()  HWCAP_CRC32) == HWCAP_CRC32;
 #else
if (0)
get_hwcap();  // make compiler shut up
diff --git a/src/arch/arm.h b/src/arch/arm.h
index f613438..1659b2e 100644
--- a/src/arch/arm.h
+++ b/src/arch/arm.h
@@ -6,6 +6,7 @@ extern C {
 #endif
 
 extern int ceph_arch_neon;  /* true if we have ARM NEON or ASIMD abilities */
+extern int ceph_arch_aarch64_crc32;  /* true if we have AArch64 CRC32/CRC32C 
abilities */
 
 extern int ceph_arch_arm_probe(void);
 
diff --git a/src/common/Makefile.am b/src/common/Makefile.am
index 2888194..37d1404 100644
--- a/src/common/Makefile.am
+++ b/src/common/Makefile.am
@@ -112,11 +112,19 @@ endif
 LIBCOMMON_DEPS += libcommon_crc.la
 noinst_LTLIBRARIES += libcommon_crc.la
 
+if HAVE_ARMV8_CRC
+libcommon_crc_aarch64_la_SOURCES = common/crc32c_aarch64.c
+libcommon_crc_aarch64_la_CFLAGS = $(AM_CFLAGS) $(ARM_CRC_FLAGS)
+LIBCOMMON_DEPS += libcommon_crc_aarch64.la
+noinst_LTLIBRARIES += libcommon_crc_aarch64.la
+endif
+
 noinst_HEADERS += \
common/bloom_filter.hpp \
common/sctp_crc32.h \
common/crc32c_intel_baseline.h \
-   common/crc32c_intel_fast.h
+   common/crc32c_intel_fast.h \
+   common/crc32c_aarch64.h
 
 
 # important; libmsg before libauth!
diff --git a/src/common/crc32c.cc b/src/common/crc32c.cc
index e2e81a4..45432f5 100644
--- a/src/common/crc32c.cc
+++ 

ceph branch status

2015-01-26 Thread ceph branch robot
-- All Branches --

Adam Crume adamcr...@gmail.com
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza alfredo.d...@inktank.com
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Andreas-Joachim Peters andreas.joachim.pet...@cern.ch
2014-10-15 15:09:24 +0200   apeters1971-wip-table-formatter

Andrew Shewmaker ags...@gmail.com
2014-11-12 14:00:10 -0800   wip-blkin

Backports backpo...@workbench.dachary.org
2015-01-07 13:29:24 +   giant-backports

Boris Ranto bra...@redhat.com
2014-11-12 14:41:33 +0100   wip-devel-python-split

Dan Mick dan.m...@inktank.com
2013-07-16 23:00:06 -0700   wip-5634

Dan Mick dan.m...@redhat.com
2014-11-12 21:35:09 -0800   wip-cli-threads
2014-11-18 15:19:32 -0800   wip-10114-firefly
2014-12-09 19:28:49 -0800   wip-10010
2014-12-10 15:09:32 -0800   wip-8797
2014-12-10 21:30:11 -0800   wip-8797-giant
2014-12-10 21:35:14 -0800   wip-8797-firefly

Danny Al-Gaaf danny.al-g...@bisect.de
2014-08-16 12:26:19 +0200   wip-da-cherry-pick-firefly
2014-11-14 19:58:43 +0100   wip-da-SCA-20141114
2015-01-23 17:54:40 +0100   wip-da-SCA-20150107

David Zafman dzaf...@redhat.com
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2014-11-26 09:41:50 -0800   wip-9403
2014-12-02 21:20:17 -0800   wip-zafman-docfix
2015-01-08 15:07:45 -0800   wip-vstart-kvs
2015-01-20 15:58:33 -0800   wip-10534

Dongmao Zhang deanracc...@gmail.com
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum gfar...@redhat.com
2014-11-04 06:55:49 -0800   firefly-7-9869

Greg Farnum g...@inktank.com
2014-10-22 17:30:02 -0700   wip-9869-dumpling
2014-10-23 13:33:44 -0700   wip-forward-scrub

Guang Yang ygu...@yahoo-inc.com
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang haomaiw...@gmail.com
2014-07-27 13:37:49 +0800   wip-flush-set

Ilya Dryomov ilya.dryo...@inktank.com
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Jason Dillaman dilla...@redhat.com
2014-11-06 07:13:44 -0500   wip-8901
2014-11-26 16:53:46 -0500   wip-librados-symbols
2014-12-15 23:25:04 -0500   wip-10299
2014-12-19 10:56:50 -0500   wip-librbd-cleanup-aio
2015-01-17 01:53:49 -0500   wip-copy-on-read
2015-01-19 10:28:56 -0500   wip-10270-giant
2015-01-19 10:30:50 -0500   wip-10270-firefly
2015-01-19 11:25:16 -0500   wip-10299-giant
2015-01-19 11:51:07 -0500   wip-10299-firefly
2015-01-19 12:12:19 -0500   wip-9854-giant
2015-01-19 12:47:28 -0500   wip-9854-firefly
2015-01-19 18:47:27 -0500   wip-8902
2015-01-21 15:25:10 -0500   wip-10462
2015-01-21 15:28:16 -0500   wip-10590-giant
2015-01-21 16:57:16 -0500   dumpling
2015-01-21 17:23:28 -0500   wip-10270-dumpling
2015-01-24 02:23:08 -0500   wip-4087
2015-01-25 20:26:27 -0500   wip-gmock

Jenkins jenk...@inktank.com
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-01-13 12:10:22 -0800   last

Joao Eduardo Luis jec...@gmail.com
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis joao.l...@gmail.com
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis joao.l...@inktank.com
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling

Joao Eduardo Luis j...@redhat.com
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-22 11:41:59 +   wip-mon-pgtemp
2015-01-26 13:00:09 +   wip-mon-datahealth-fix

John Spray jcsp...@gmail.com
2014-03-03 13:10:05 +   wip-mds-stop-rank-0

John Spray john.sp...@redhat.com
2014-06-25 22:54:13 -0400   wip-mds-sessions
2014-07-29 00:15:21 +0100   wip-objecter-rebase
2014-08-15 02:33:49 +0100   wip-mds-contexts
2014-08-28 12:40:20 +0100   wip-9152
2014-08-28 23:34:43 +0100   wip-typed-contexts
2014-09-08 01:49:57 +0100   wip-jcsp-test
2014-09-12 18:42:02 +0100   wip-9280
2014-09-15 16:14:15 +0100   wip-9375
2014-09-24 17:56:02 +0100   wip-continuation
2014-11-08 16:02:33 +   wip-9977-backport

upcoming dumpling v0.67.12

2015-01-26 Thread Loic Dachary
Hi Yuri,

Here is a short update on the progress of the upcoming dumpling v0.67.12.

It is tracked with http://tracker.ceph.com/issues/10560. In the inventory part, 
there is a list of all pull requests that are already merged in the dumpling 
branch. There only is one pull request waiting to be merged and three issues 
waiting for backports. While these last three are being worked on, I started 
rbd, rgw and rados suites.

I chose to display the inventory by pull request because I figured it would be 
more convenient to read because sometimes a single pull request spans multiple 
issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes two issues 
). 

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: upcoming dumpling v0.67.12

2015-01-26 Thread Yuri Weinstein
Loic,

Here is the run from sepia
http://pulpito.front.sepia.ceph.com/ubuntu-2015-01-26_09:26:27-upgrade:dumpling-dumpling-distro-basic-vps/

Two failures seems like env noise.

Thx
YuriW

On Mon, Jan 26, 2015 at 9:49 AM, Loic Dachary l...@dachary.org wrote:
 Thanks for letting me know about the upgrade tests results, it's encouraging 
 :-) I'll let you know when the tests make progress.

 On 26/01/2015 18:00, Yuri Weinstein wrote:
 Loic,

 Thanks for the update.
 I ran upgrade/dumpling last week (and all 42 jobs passed in octo and
 sepia) to establish a base line.  And today running another one,
 assuming it will pick up the already merged pull requests.

 Let me know when you ready for next steps.

 Thx
 YuriW

 On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote:
 Hi Yuri,

 Here is a short update on the progress of the upcoming dumpling v0.67.12.

 It is tracked with http://tracker.ceph.com/issues/10560. In the inventory 
 part, there is a list of all pull requests that are already merged in the 
 dumpling branch. There only is one pull request waiting to be merged and 
 three issues waiting for backports. While these last three are being worked 
 on, I started rbd, rgw and rados suites.

 I chose to display the inventory by pull request because I figured it would 
 be more convenient to read because sometimes a single pull request spans 
 multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes 
 two issues ).

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: upcoming dumpling v0.67.12

2015-01-26 Thread Yuri Weinstein
Loic,

Thanks for the update.
I ran upgrade/dumpling last week (and all 42 jobs passed in octo and
sepia) to establish a base line.  And today running another one,
assuming it will pick up the already merged pull requests.

Let me know when you ready for next steps.

Thx
YuriW

On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote:
 Hi Yuri,

 Here is a short update on the progress of the upcoming dumpling v0.67.12.

 It is tracked with http://tracker.ceph.com/issues/10560. In the inventory 
 part, there is a list of all pull requests that are already merged in the 
 dumpling branch. There only is one pull request waiting to be merged and 
 three issues waiting for backports. While these last three are being worked 
 on, I started rbd, rgw and rados suites.

 I chose to display the inventory by pull request because I figured it would 
 be more convenient to read because sometimes a single pull request spans 
 multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes 
 two issues ).

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: idempotent op (esp delete)

2015-01-26 Thread Sage Weil
On Mon, 26 Jan 2015, Wang, Zhiqiang wrote:
 The downside of this approach is that we may need to search the pg_log 
 for a specific object in every write io? 

Not quite.  IndexedLog maintains a hash_map of all of the request ids in 
the log, so it's just a hash lookup on each IO.  (Well, now 2 hash 
lookups, because I put the additional request IDs in a second auxilliary 
map to handle dups properly.  I think we can avoid that lookup if we use 
the request flags carefully, though.. the RETRY and REDIRECTED flags 
I think?  Need to check carefully.)

 Maybe we can combine this 
 approach and the changes in PR 3447. For the flush case when the object 
 is deleted in the base, we search the pg_log for dup op. This should be 
 rare cases. Otherwise the object exists, we check the reqid list in the 
 object_info_t for dup op.

We could do a hybrid approach, but there is some cost to the per-object 
tracking: a tiny bit more memory, and an O(n) search of the items in that 
list (~10 or 20?) for the dup check.  I suspect the hash lookup is 
cheaper?  And simpler.

sage


 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
 Sent: Monday, January 26, 2015 10:35 AM
 To: Sage Weil; Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: idempotent op (esp delete)
 
 This method puts the reqid list in the pg_log instead of the object_info_t, 
 so that it's preserved even in the delete case, which sounds more reasonable.
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Saturday, January 24, 2015 6:19 AM
 To: Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: idempotent op (esp delete)
 
 On Fri, 23 Jan 2015, Gregory Farnum wrote:
  On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote:
   Background:
  
   1) Way back when we made a task that would thrash the cache modes by 
   adding and removing the cache tier while ceph_test_rados was running.
   This mostly worked, but would occasionally fail because we would
  
- delete an object from the cache tier
- a network failure injection would lose the reply
- we'd disable the cache
- the delete would resend to the base tier, not get recognized as a 
   dup (different pool, different pg log)
  - -ENOENT instead of 0
  
   2) The proxy write code hits a similar problem:
  
- delete gets proxied
- we initiate async promote
- a network failure injection loses the delete reply
- delete resends and blocks on promote (or arrives after it
   finishes)
- promote finishes
- delete is handled
 - -ENOENT instead of 0
  
   The ticket is http://tracker.ceph.com/issues/8935
  
   The problem is partially addressed by
  
   https://github.com/ceph/ceph/pull/3447
  
   by logging a few request ids on every object_info_t and preserving 
   that on promote and flush.
  
   However, it doesn't solve the problem for delete because we throw 
   out object_info_t so that reqid_t is lost.
  
   I think we have two options, not necessarily mutually exclusive:
  
   1) When promoting an object that doesn't exist (to create a 
   whiteout), pull reqids out of the base tier's pg log so that the 
   whiteout is primed with request ids.
  
   1.5) When flushing... well, that is harder because we have nowhere 
   to put the reqids.  Unless we make a way to cram a list of reqid's 
   into a single PG log entry...?  In that case, we wouldn't strictly 
   need the per-object list since we could pile the base tier's reqids 
   into the promote log entry in the cache tier.
  
   2) Make delete idempotent (0 instead of ENOENT if the object doesn't 
   exist).  This will require a delicate compat transition (let's 
   ignore that a moment) but you can preserve the old behavior for 
   callers that care by preceding the delete with an assert_exists op.
   Most callers don't care, but a handful do.  This simplifies the 
   semantics we need to support going forward.
  
   Of course, it's all a bit delicate.  The idempotent op semantics 
   have a time horizon so it's all a bit wishy-washy... :/
  
   Thoughts?
  
  Do we have other cases that we're worried about which would be 
  improved by maintaining reqids across pool cache transitions? I'm not 
  a big fan of maintaining those per-op lists (they sound really 
  expensive?), but if we need them for something else that's a point in 
  their favor.
 
 I don't think they're *too* expensive (say, vector of 20 per object_info_t?). 
  But the only thing I can think of beyond the cache tiering stuff would be 
 cases where the pg log isnt long enough for a very laggy client.  In general 
 ops will be distributed across ops so it will be catch the dup from another 
 angle.
 
 However.. I just hacked up a patch that lets us cram lots of reqids into a 
 single pg_log_entry_t and I think that may be a 

Re: idempotent op (esp delete)

2015-01-26 Thread Samuel Just
The pg_log_t variant does seem to be cleaner.
-Sam

On Mon, Jan 26, 2015 at 9:21 AM, Sage Weil sw...@redhat.com wrote:
 On Mon, 26 Jan 2015, Wang, Zhiqiang wrote:
 The downside of this approach is that we may need to search the pg_log
 for a specific object in every write io?

 Not quite.  IndexedLog maintains a hash_map of all of the request ids in
 the log, so it's just a hash lookup on each IO.  (Well, now 2 hash
 lookups, because I put the additional request IDs in a second auxilliary
 map to handle dups properly.  I think we can avoid that lookup if we use
 the request flags carefully, though.. the RETRY and REDIRECTED flags
 I think?  Need to check carefully.)

 Maybe we can combine this
 approach and the changes in PR 3447. For the flush case when the object
 is deleted in the base, we search the pg_log for dup op. This should be
 rare cases. Otherwise the object exists, we check the reqid list in the
 object_info_t for dup op.

 We could do a hybrid approach, but there is some cost to the per-object
 tracking: a tiny bit more memory, and an O(n) search of the items in that
 list (~10 or 20?) for the dup check.  I suspect the hash lookup is
 cheaper?  And simpler.

 sage



 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
 Sent: Monday, January 26, 2015 10:35 AM
 To: Sage Weil; Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: idempotent op (esp delete)

 This method puts the reqid list in the pg_log instead of the object_info_t, 
 so that it's preserved even in the delete case, which sounds more reasonable.

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Saturday, January 24, 2015 6:19 AM
 To: Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: idempotent op (esp delete)

 On Fri, 23 Jan 2015, Gregory Farnum wrote:
  On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote:
   Background:
  
   1) Way back when we made a task that would thrash the cache modes by
   adding and removing the cache tier while ceph_test_rados was running.
   This mostly worked, but would occasionally fail because we would
  
- delete an object from the cache tier
- a network failure injection would lose the reply
- we'd disable the cache
- the delete would resend to the base tier, not get recognized as a
   dup (different pool, different pg log)
  - -ENOENT instead of 0
  
   2) The proxy write code hits a similar problem:
  
- delete gets proxied
- we initiate async promote
- a network failure injection loses the delete reply
- delete resends and blocks on promote (or arrives after it
   finishes)
- promote finishes
- delete is handled
 - -ENOENT instead of 0
  
   The ticket is http://tracker.ceph.com/issues/8935
  
   The problem is partially addressed by
  
   https://github.com/ceph/ceph/pull/3447
  
   by logging a few request ids on every object_info_t and preserving
   that on promote and flush.
  
   However, it doesn't solve the problem for delete because we throw
   out object_info_t so that reqid_t is lost.
  
   I think we have two options, not necessarily mutually exclusive:
  
   1) When promoting an object that doesn't exist (to create a
   whiteout), pull reqids out of the base tier's pg log so that the
   whiteout is primed with request ids.
  
   1.5) When flushing... well, that is harder because we have nowhere
   to put the reqids.  Unless we make a way to cram a list of reqid's
   into a single PG log entry...?  In that case, we wouldn't strictly
   need the per-object list since we could pile the base tier's reqids
   into the promote log entry in the cache tier.
  
   2) Make delete idempotent (0 instead of ENOENT if the object doesn't
   exist).  This will require a delicate compat transition (let's
   ignore that a moment) but you can preserve the old behavior for
   callers that care by preceding the delete with an assert_exists op.
   Most callers don't care, but a handful do.  This simplifies the
   semantics we need to support going forward.
  
   Of course, it's all a bit delicate.  The idempotent op semantics
   have a time horizon so it's all a bit wishy-washy... :/
  
   Thoughts?
 
  Do we have other cases that we're worried about which would be
  improved by maintaining reqids across pool cache transitions? I'm not
  a big fan of maintaining those per-op lists (they sound really
  expensive?), but if we need them for something else that's a point in
  their favor.

 I don't think they're *too* expensive (say, vector of 20 per 
 object_info_t?).  But the only thing I can think of beyond the cache tiering 
 stuff would be cases where the pg log isnt long enough for a very laggy 
 client.  In general ops will be distributed across ops so it will be catch 
 the dup from another angle.

 However.. I just hacked up a patch 

Re: upcoming dumpling v0.67.12

2015-01-26 Thread Loic Dachary
Thanks for letting me know about the upgrade tests results, it's encouraging 
:-) I'll let you know when the tests make progress.

On 26/01/2015 18:00, Yuri Weinstein wrote:
 Loic,
 
 Thanks for the update.
 I ran upgrade/dumpling last week (and all 42 jobs passed in octo and
 sepia) to establish a base line.  And today running another one,
 assuming it will pick up the already merged pull requests.
 
 Let me know when you ready for next steps.
 
 Thx
 YuriW
 
 On Mon, Jan 26, 2015 at 7:37 AM, Loic Dachary l...@dachary.org wrote:
 Hi Yuri,

 Here is a short update on the progress of the upcoming dumpling v0.67.12.

 It is tracked with http://tracker.ceph.com/issues/10560. In the inventory 
 part, there is a list of all pull requests that are already merged in the 
 dumpling branch. There only is one pull request waiting to be merged and 
 three issues waiting for backports. While these last three are being worked 
 on, I started rbd, rgw and rados suites.

 I chose to display the inventory by pull request because I figured it would 
 be more convenient to read because sometimes a single pull request spans 
 multiple issues ( https://github.com/ceph/ceph/pull/2611 for instance fixes 
 two issues ).

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RE: wip-auth

2015-01-26 Thread Blinick, Stephen L
Good to know, I was wondering why the spec file defaulted to lib-nss.. the 
dpkg-build for debian packages just uses whatever configuration you had built, 
and I believe that will use libcryptopp if the dependency is installed on the 
build machine (last I looked).

I forgot to mention the numbers below were based on v.91.

Thanks,

Stephen

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, January 26, 2015 10:24 AM
To: Blinick, Stephen L
Cc: andreas.blue...@itxperts.de; ceph-devel@vger.kernel.org
Subject: RE: wip-auth

On Mon, 26 Jan 2015, Blinick, Stephen L wrote:
 I noticed that the spec file for building RPM's defaults to building with 
 libnss, instead of libcrypto++.  Since the measurements I'd done so far were 
 from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference 
 between those two on my system, memstore backend with a single OSD, and 
 single client.
 
 Dual socket Xeon E5 2620v3, 64GB Memory,  RHEL7
 Kernel: 3.10.0-123.13.2.el7
 
 100% 4K Writes, 1xOSD w/ Rados Bench
   libnss  |Cryptopp   
 # QD  IOPSLatency(ms)   | IOPSLatency(ms) IOPS Improvement %
 1614432.571.11|   18896.600.8530.93%

 100% 4K Reads, 1xOSD w/ Rados Bench   
   libnss | Cryptopp # QD IOPS Latency(ms)  | IOPS Latency(ms) IOPS 
 Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62%

Yikes, 30%!  I think this definitely worth some effort.  We switched to libnss 
because it has the weird government certfiications that everyone wants and is 
more prevalent.  crypto++ is also not packaged for Red Hat distros at all 
(presumably for that reason).

I suspect that most of the overhead is in the encryption context setup and can 
be avoided with a bit of effort..

sage


 
 
 Thanks,
 
 Stephen
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Thursday, January 22, 2015 4:56 PM
 To: andreas.blue...@itxperts.de
 Cc: ceph-devel@vger.kernel.org
 Subject: wip-auth
 
 Hi Andreas,
 
 I took a look at the wip-auth I mentioned in the security call last week... 
 and the patch didn't work at all.  Sorry if you wasted any time trying it.
 
 Anyway, I fixed it up so that it actually worked and made one other 
 optimization.  It would be great to hear what latencies you measure with the 
 changes in place.
 
 Also, it might be worth trying --with-cryptopp (or --with-nss if you 
 built cryptopp by default) to see if there is a difference.  There is 
 a ton of boilerplate setting up encryption contexts and key structures 
 and so on that I suspect could be cached (perhaps stashed in the 
 CryptoKey struct?) with a bit of effort.  See
 
   https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: wip-auth

2015-01-26 Thread Sage Weil
On Mon, 26 Jan 2015, Blinick, Stephen L wrote:
 I noticed that the spec file for building RPM's defaults to building with 
 libnss, instead of libcrypto++.  Since the measurements I'd done so far were 
 from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference 
 between those two on my system, memstore backend with a single OSD, and 
 single client.
 
 Dual socket Xeon E5 2620v3, 64GB Memory,  RHEL7 
 Kernel: 3.10.0-123.13.2.el7
 
 100% 4K Writes, 1xOSD w/ Rados Bench
   libnss  |Cryptopp   
 # QD  IOPSLatency(ms)   | IOPSLatency(ms) IOPS Improvement %
 1614432.571.11|   18896.600.8530.93%

 100% 4K Reads, 1xOSD w/ Rados Bench   
   libnss | Cryptopp # QD IOPS Latency(ms)  | IOPS Latency(ms) IOPS 
 Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62%

Yikes, 30%!  I think this definitely worth some effort.  We switched to 
libnss because it has the weird government certfiications that everyone 
wants and is more prevalent.  crypto++ is also not packaged for Red 
Hat distros at all (presumably for that reason).

I suspect that most of the overhead is in the encryption context setup and 
can be avoided with a bit of effort..

sage


 
 
 Thanks,
 
 Stephen
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Thursday, January 22, 2015 4:56 PM
 To: andreas.blue...@itxperts.de
 Cc: ceph-devel@vger.kernel.org
 Subject: wip-auth
 
 Hi Andreas,
 
 I took a look at the wip-auth I mentioned in the security call last week... 
 and the patch didn't work at all.  Sorry if you wasted any time trying it.
 
 Anyway, I fixed it up so that it actually worked and made one other 
 optimization.  It would be great to hear what latencies you measure with the 
 changes in place.
 
 Also, it might be worth trying --with-cryptopp (or --with-nss if you built 
 cryptopp by default) to see if there is a difference.  There is a ton of 
 boilerplate setting up encryption contexts and key structures and so on that 
 I suspect could be cached (perhaps stashed in the CryptoKey struct?) with a 
 bit of effort.  See
 
   https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-auth

2015-01-26 Thread Mark Nelson

Hi Stephen,

Does this explain the results you were seeing earlier with the memstore 
testing?


Mark

On 01/26/2015 12:00 PM, Blinick, Stephen L wrote:

Good to know, I was wondering why the spec file defaulted to lib-nss.. the 
dpkg-build for debian packages just uses whatever configuration you had built, 
and I believe that will use libcryptopp if the dependency is installed on the 
build machine (last I looked).

I forgot to mention the numbers below were based on v.91.

Thanks,

Stephen

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, January 26, 2015 10:24 AM
To: Blinick, Stephen L
Cc: andreas.blue...@itxperts.de; ceph-devel@vger.kernel.org
Subject: RE: wip-auth

On Mon, 26 Jan 2015, Blinick, Stephen L wrote:

I noticed that the spec file for building RPM's defaults to building with 
libnss, instead of libcrypto++.  Since the measurements I'd done so far were 
from those RPM's I rebuilt with libcrypto++.. so FWIW here is the difference 
between those two on my system, memstore backend with a single OSD, and single 
client.

Dual socket Xeon E5 2620v3, 64GB Memory,  RHEL7
Kernel: 3.10.0-123.13.2.el7

100% 4K Writes, 1xOSD w/ Rados Bench
libnss  |Cryptopp   
# QDIOPSLatency(ms)   | IOPSLatency(ms) IOPS Improvement %
16  14432.571.11|   18896.600.8530.93%

100% 4K Reads, 1xOSD w/ Rados Bench 
libnss | Cryptopp # QD IOPS Latency(ms)  | IOPS Latency(ms) IOPS
Improvement % 16 19532.53 0.82 | 25708.70 0.62 31.62%


Yikes, 30%!  I think this definitely worth some effort.  We switched to libnss 
because it has the weird government certfiications that everyone wants and is 
more prevalent.  crypto++ is also not packaged for Red Hat distros at all 
(presumably for that reason).

I suspect that most of the overhead is in the encryption context setup and can 
be avoided with a bit of effort..

sage





Thanks,

Stephen

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, January 22, 2015 4:56 PM
To: andreas.blue...@itxperts.de
Cc: ceph-devel@vger.kernel.org
Subject: wip-auth

Hi Andreas,

I took a look at the wip-auth I mentioned in the security call last week... and 
the patch didn't work at all.  Sorry if you wasted any time trying it.

Anyway, I fixed it up so that it actually worked and made one other 
optimization.  It would be great to hear what latencies you measure with the 
changes in place.

Also, it might be worth trying --with-cryptopp (or --with-nss if you
built cryptopp by default) to see if there is a difference.  There is
a ton of boilerplate setting up encryption contexts and key structures
and so on that I suspect could be cached (perhaps stashed in the
CryptoKey struct?) with a bit of effort.  See

https://github.com/ceph/ceph/blob/master/src/auth/Crypto.cc#L99-L213

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] rbd: fix rbd_dev_parent_get() when parent_overlap == 0

2015-01-26 Thread Alex Elder
On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 The comment for rbd_dev_parent_get() said
 
 * We must get the reference before checking for the overlap to
 * coordinate properly with zeroing the parent overlap in
 * rbd_dev_v2_parent_info() when an image gets flattened.  We
 * drop it again if there is no overlap.
 
 but the drop it again if there is no overlap part was missing from
 the implementation.  This lead to absurd parent_ref values for images
 with parent_overlap == 0, as parent_ref was incremented for each
 img_request and virtually never decremented.

You're right about this.  If the image had a parent with no
overlap this would leak a reference to the parent image.  The
code should have said:

counter = atomic_inc_return_safe(rbd_dev-parent_ref);
if (counter  0) {
if (rbd_dev-parent_overlap)
return true;
atomic_dec(rbd_dev-parent_ref);
} else if (counter  0) {
rbd_warn(rbd_dev, parent reference overflow);
}

 Fix this by leveraging the fact that refresh path calls
 rbd_dev_v2_parent_info() under header_rwsem and use it for read in
 rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
 of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
 they'd pair with now and I suspect we are in a pretty miserable
 situation as far as proper locking goes regardless.

The point of the memory barrier was to ensure that when parent_overlap
gets zeroed, this code sees the zero rather than the old non-zero
value.  The atomic_inc_return_safe() call has an implicit memory
barrier to match the smp_mb() call.  It allowed the synchronization
to occur without the use of a lock.

We're trying to atomically determine whether an image request needs
to be marked as layered, to know how to handle ENOENT on parent reads.
If it is a write to an image with a parent having a non-zero overlap,
it's layered, otherwise we can treat it as a simple request.

I think in this particular case, this is just an optimization,
trying very hard to avoid having to do layered image handling
if the parent has become flattened.  I think that even if it
got old information (suggesting non-zero overlap) things would
behave correctly, just less efficiently.

Using the semaphore adds a lock to this path and therefore
implements whatever barriers are being removed.  I'm not
sure how often this is hit--maybe the optimization isn't
buying much after all.

I am getting a little rusty on some of details of what
precisely happens when a layered image gets flattened.
But I think this looks OK.  Maybe just watch for small
(perhaps insignificant) performance regressions with
this change in place...

Reviewed-by: Alex Elder el...@linaro.org

 Cc: sta...@vger.kernel.org # 3.11+
 Signed-off-by: Ilya Dryomov idryo...@redhat.com
 ---
  drivers/block/rbd.c | 20 ++--
  1 file changed, 6 insertions(+), 14 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 31fa00f0d707..2990a1c75159 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2098,32 +2098,26 @@ static void rbd_dev_parent_put(struct rbd_device 
 *rbd_dev)
   * If an image has a non-zero parent overlap, get a reference to its
   * parent.
   *
 - * We must get the reference before checking for the overlap to
 - * coordinate properly with zeroing the parent overlap in
 - * rbd_dev_v2_parent_info() when an image gets flattened.  We
 - * drop it again if there is no overlap.
 - *
   * Returns true if the rbd device has a parent with a non-zero
   * overlap and a reference for it was successfully taken, or
   * false otherwise.
   */
  static bool rbd_dev_parent_get(struct rbd_device *rbd_dev)
  {
 - int counter;
 + int counter = 0;
  
   if (!rbd_dev-parent_spec)
   return false;
  
 - counter = atomic_inc_return_safe(rbd_dev-parent_ref);
 - if (counter  0  rbd_dev-parent_overlap)
 - return true;
 -
 - /* Image was flattened, but parent is not yet torn down */
 + down_read(rbd_dev-header_rwsem);
 + if (rbd_dev-parent_overlap)
 + counter = atomic_inc_return_safe(rbd_dev-parent_ref);
 + up_read(rbd_dev-header_rwsem);
  
   if (counter  0)
   rbd_warn(rbd_dev, parent reference overflow);
  
 - return false;
 + return counter  0;
  }
  
  /*
 @@ -4238,7 +4232,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
*/
   if (rbd_dev-parent_overlap) {
   rbd_dev-parent_overlap = 0;
 - smp_mb();
   rbd_dev_parent_put(rbd_dev);
   pr_info(%s: clone image has been flattened\n,
   rbd_dev-disk-disk_name);
 @@ -4284,7 +4277,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
* treat it specially.
*/
   rbd_dev-parent_overlap = overlap;
 

Re: [PATCH 3/3] rbd: do not treat standalone as flatten

2015-01-26 Thread Alex Elder
On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 If the clone is resized down to 0, it becomes standalone.  If such
 resize is carried over while an image is mapped we would detect this
 and call rbd_dev_parent_put() which means let go of all parent state,
 including the spec(s) of parent images(s).  This leads to a mismatch
 between rbd info and sysfs parent fields, so a fix is in order.
 
 # rbd create --image-format 2 --size 1 foo
 # rbd snap create foo@snap
 # rbd snap protect foo@snap
 # rbd clone foo@snap bar
 # DEV=$(rbd map bar)
 # rbd resize --allow-shrink --size 0 bar
 # rbd resize --size 1 bar
 # rbd info bar | grep parent
 parent: rbd/foo@snap
 
 Before:
 
 # cat /sys/bus/rbd/devices/0/parent
 (no parent image)
 
 After:
 
 # cat /sys/bus/rbd/devices/0/parent
 pool_id 0
 pool_name rbd
 image_id 10056b8b4567
 image_name foo
 snap_id 2
 snap_name snap
 overlap 0
 
 Signed-off-by: Ilya Dryomov idryo...@redhat.com

Hmm.  Interesting.

I think that a parent with an overlap of 0 is of no
real use.  So in the last patch I was suggesting it
should just go away.

But now, looking at it from this perspective, the fact
that an image *came from* a particular parent, but which
has no more overlap, could be useful information.  The
parent shouldn't simply go away without the user requesting
that.

I haven't completely followed through the logic of keeping
the reference around but I understand what you're doing and
it looks OK to me.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c | 30 ++
  1 file changed, 10 insertions(+), 20 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index b85d52005a21..e818c2a6ffb1 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4273,32 +4273,22 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
   }
  
   /*
 -  * We always update the parent overlap.  If it's zero we
 -  * treat it specially.
 +  * We always update the parent overlap.  If it's zero we issue
 +  * a warning, as we will proceed as if there was no parent.
*/
 - rbd_dev-parent_overlap = overlap;
   if (!overlap) {
 -
 - /* A null parent_spec indicates it's the initial probe */
 -
   if (parent_spec) {
 - /*
 -  * The overlap has become zero, so the clone
 -  * must have been resized down to 0 at some
 -  * point.  Treat this the same as a flatten.
 -  */
 - rbd_dev_parent_put(rbd_dev);
 - pr_info(%s: clone image now standalone\n,
 - rbd_dev-disk-disk_name);
 + /* refresh, careful to warn just once */
 + if (rbd_dev-parent_overlap)
 + rbd_warn(rbd_dev,
 + clone now standalone (overlap became 0));
   } else {
 - /*
 -  * For the initial probe, if we find the
 -  * overlap is zero we just pretend there was
 -  * no parent image.
 -  */
 - rbd_warn(rbd_dev, ignoring parent with overlap 0);
 + /* initial probe */
 + rbd_warn(rbd_dev, clone is standalone (overlap 0));
   }
   }
 + rbd_dev-parent_overlap = overlap;
 +
  out:
   ret = 0;
  out_err:
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Questions]Can client know which OSDs are storing the data?

2015-01-26 Thread Dennis Chen
Hello Guys,

My question is very rude and direct, a little bit stupid maybe ;-)

Question a: a client write a file to the cluster (supposing replica =
3), so the data will be stored in 3 OSDs within the cluster, can I get
the information of which OSDs storing the file data in client side?

Question b: can the object data still be replicated if I store a
object from client with RADOS API?

Thank you guys !
-- 
Den
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: idempotent op (esp delete)

2015-01-26 Thread Wang, Zhiqiang
Not sure if it is correct, but below is my understanding. Correct me if I'm 
wrong.

Yes, in the current code, a hash lookup on each IO is used to check dup op. But 
when adding extra_reqids as a log entry in the pg_log, 2 hash lookups may be 
not sufficient. The extra_reqids is organized by object id as in your code. We 
may have other ops on an object just promoted. If a dup op comes in after these 
ops, then we have to search for log entries of this object in pg_log.

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, January 27, 2015 1:21 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org; Gregory Farnum
Subject: RE: idempotent op (esp delete)

On Mon, 26 Jan 2015, Wang, Zhiqiang wrote:
 The downside of this approach is that we may need to search the pg_log 
 for a specific object in every write io?

Not quite.  IndexedLog maintains a hash_map of all of the request ids in the 
log, so it's just a hash lookup on each IO.  (Well, now 2 hash lookups, because 
I put the additional request IDs in a second auxilliary map to handle dups 
properly.  I think we can avoid that lookup if we use the request flags 
carefully, though.. the RETRY and REDIRECTED flags I think?  Need to check 
carefully.)

 Maybe we can combine this
 approach and the changes in PR 3447. For the flush case when the 
 object is deleted in the base, we search the pg_log for dup op. This 
 should be rare cases. Otherwise the object exists, we check the reqid 
 list in the object_info_t for dup op.

We could do a hybrid approach, but there is some cost to the per-object
tracking: a tiny bit more memory, and an O(n) search of the items in that list 
(~10 or 20?) for the dup check.  I suspect the hash lookup is cheaper?  And 
simpler.

sage


 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
 Sent: Monday, January 26, 2015 10:35 AM
 To: Sage Weil; Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: idempotent op (esp delete)
 
 This method puts the reqid list in the pg_log instead of the object_info_t, 
 so that it's preserved even in the delete case, which sounds more reasonable.
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Saturday, January 24, 2015 6:19 AM
 To: Gregory Farnum
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: idempotent op (esp delete)
 
 On Fri, 23 Jan 2015, Gregory Farnum wrote:
  On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote:
   Background:
  
   1) Way back when we made a task that would thrash the cache modes 
   by adding and removing the cache tier while ceph_test_rados was running.
   This mostly worked, but would occasionally fail because we would
  
- delete an object from the cache tier
- a network failure injection would lose the reply
- we'd disable the cache
- the delete would resend to the base tier, not get recognized as 
   a dup (different pool, different pg log)
  - -ENOENT instead of 0
  
   2) The proxy write code hits a similar problem:
  
- delete gets proxied
- we initiate async promote
- a network failure injection loses the delete reply
- delete resends and blocks on promote (or arrives after it
   finishes)
- promote finishes
- delete is handled
 - -ENOENT instead of 0
  
   The ticket is http://tracker.ceph.com/issues/8935
  
   The problem is partially addressed by
  
   https://github.com/ceph/ceph/pull/3447
  
   by logging a few request ids on every object_info_t and preserving 
   that on promote and flush.
  
   However, it doesn't solve the problem for delete because we throw 
   out object_info_t so that reqid_t is lost.
  
   I think we have two options, not necessarily mutually exclusive:
  
   1) When promoting an object that doesn't exist (to create a 
   whiteout), pull reqids out of the base tier's pg log so that the 
   whiteout is primed with request ids.
  
   1.5) When flushing... well, that is harder because we have nowhere 
   to put the reqids.  Unless we make a way to cram a list of reqid's 
   into a single PG log entry...?  In that case, we wouldn't strictly 
   need the per-object list since we could pile the base tier's 
   reqids into the promote log entry in the cache tier.
  
   2) Make delete idempotent (0 instead of ENOENT if the object 
   doesn't exist).  This will require a delicate compat transition 
   (let's ignore that a moment) but you can preserve the old behavior 
   for callers that care by preceding the delete with an assert_exists op.
   Most callers don't care, but a handful do.  This simplifies the 
   semantics we need to support going forward.
  
   Of course, it's all a bit delicate.  The idempotent op semantics 
   have a time horizon so it's all a bit wishy-washy... :/
  
   

Re: [PATCH 2/3] rbd: drop parent_ref in rbd_dev_unprobe() unconditionally

2015-01-26 Thread Alex Elder
On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 This effectively reverts the last hunk of 392a9dad7e77 (rbd: detect
 when clone image is flattened).
 
 The problem with parent_overlap != 0 condition is that it's possible
 and completely valid to have an image with parent_overlap == 0 whose
 parent state needs to be cleaned up on unmap.  The next commit, which
 drops the clone image now standalone logic, opens up another window
 of opportunity to hit this, but even without it
 
 # cat parent-ref.sh
 #!/bin/bash
 rbd create --image-format 2 --size 1 foo
 rbd snap create foo@snap
 rbd snap protect foo@snap
 rbd clone foo@snap bar
 rbd resize --allow-shrink --size 0 bar
 rbd resize --size 1 bar
 DEV=$(rbd map bar)
 rbd unmap $DEV
 
 leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
 hanging around.

I'm not sure why the last reference to the parent
doesn't get dropped (and state cleaned up) as soon
as the overlap becomes 0.  I suspect it's the original
reference taken when there's a parent, we don't get
rid of it until it's torn down.  (I think we should.)

It seems to me the test here should be for a non-null
parent_spec pointer rather than non-zero parent_overlap.
And that's done inside rbd_dev_parent_put(), so your
change looks reasonable to me.

Reviewed-by: Alex Elder el...@linaro.org

 
 My thinking behind calling rbd_dev_parent_put() unconditionally is that
 there shouldn't be any requests in flight at that point in time as we
 are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
 by flatten is delayed by in-flight requests, it will have finished by
 the time we reach rbd_dev_unprobe() caused by unmap, thus turning
 unconditional rbd_dev_parent_put() into a no-op.
 
 Fixes: http://tracker.ceph.com/issues/10352
 
 Cc: sta...@vger.kernel.org # 3.11+
 Signed-off-by: Ilya Dryomov idryo...@redhat.com
 ---
  drivers/block/rbd.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 2990a1c75159..b85d52005a21 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -5075,10 +5075,7 @@ static void rbd_dev_unprobe(struct rbd_device *rbd_dev)
  {
   struct rbd_image_header *header;
  
 - /* Drop parent reference unless it's already been done (or none) */
 -
 - if (rbd_dev-parent_overlap)
 - rbd_dev_parent_put(rbd_dev);
 + rbd_dev_parent_put(rbd_dev);
  
   /* Free dynamic fields from the header, then zero it out */
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Questions]Can client know which OSDs are storing the data?

2015-01-26 Thread Ma, Jianpeng
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dennis Chen
 Sent: Tuesday, January 27, 2015 3:07 PM
 To: ceph-devel@vger.kernel.org; Dennis Chen
 Subject: [Questions]Can client know which OSDs are storing the data?
 
 Hello Guys,
 
 My question is very rude and direct, a little bit stupid maybe ;-)
 
 Question a: a client write a file to the cluster (supposing replica = 3), so 
 the
 data will be stored in 3 OSDs within the cluster, can I get the information of
 which OSDs storing the file data in client side?
 
Ceph osd map poolname objectname
Display the osd and pg which store object.

 Question b: can the object data still be replicated if I store a object from 
 client
 with RADOS API?
Yes, rados api is also a client.
 
 Thank you guys !
 --
 Den
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: Deadline of Github pull request for Hammer release (question)

2015-01-26 Thread Miyamae, Takeshi
Hi Loic,

I have noticed that your repository ceph-erasure-code-corpus is forked for us,
so I created a new pull request.

Update non-regression.sh #1
https://github.com/t-miyamae/ceph-erasure-code-corpus/pull/1

Best regards,
Takeshi Miyamae

-Original Message-
From: Miyamae, Takeshi/宮前 剛 
Sent: Monday, January 26, 2015 2:44 PM
To: 'Loic Dachary'
Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔
Subject: RE: Deadline of Github pull request for Hammer release (question)

Hi Loic,

 Note that you also need to update

We have prepared mSHEC's parameter sets which we think will be commonly used.
Because I'm not sure how to update another person's repository, we will write 
down those parameter sets in this mail.
If we are required to do something, please let us know.

while read k m c ; do
for stripe_width in $STRIPE_WIDTHS ; do
ceph_erasure_code_non_regression --stripe-width $stripe_width --plugin 
shec --parameter technique=multiple --parameter k=$k --parameter m=$m 
--parameter c=$c $ACTION $VERBOSE $MYDIR
done
done EOF
1 1 1
2 1 1
3 2 1
3 2 2
3 3 2
4 1 1
4 2 2
4 3 2
5 2 1
6 3 2
6 4 2
6 4 3
7 2 1
8 3 2
8 4 2
8 4 3
9 4 2
9 5 3
12 7 4
EOF

Best regards,
Takeshi Miyamae

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org]
Sent: Friday, January 23, 2015 10:47 PM
To: Miyamae, Takeshi/宮前 剛
Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔
Subject: Re: Deadline of Github pull request for Hammer release (question)

Hi,

Note that you also need to update 

https://github.com/dachary/ceph-erasure-code-corpus/blob/master/v0.85-764-gf3a1532/non-regression.sh

to include non regression tests for the most common cases of the SHEC plugin 
encoding / decoding. This is run by make check (this repository is a submodule 
of Ceph). It helps make sure that content encoded / decoded with a given 
version of the plugin can be encoded / decoded exactly in the same way by all 
future versions.

Cheers

On 06/01/2015 12:49, Miyamae, Takeshi wrote:
 Dear Loic,
 
 I'm Takeshi Miyamae, one of the authors of SHEC's blueprint.
 
 Shingled Erasure Code (SHEC)
 https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code
 _(SHEC)
 
 We have revised our blueprint shown in the last CDS to extend our 
 erasure code layouts and describe the guideline for choosing SHEC among 
 various EC plugins.
 We believe the blueprint now answers all the comments given at the CDS.
 
 In addition, we would like to ask for your advice on the schedule of 
 our github pull request. More specifically, we would like to know its 
 deadline for Hammer release.
 (As we have not really completed our verification of SHEC, we are 
 wondering if we should make it open for early preview.)
 
 Thank you in advance,
 Takeshi Miyamae
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html
 

--
Loïc Dachary, Artisan Logiciel Libre



Re: Deadline of Github pull request for Hammer release (question)

2015-01-26 Thread Loic Dachary
Hi,

Thanks for the snippet, I'll add it to the non regression from the pull request 
you sent :-)

Cheers

On 26/01/2015 06:43, Miyamae, Takeshi wrote:
 Hi Loic,
 
 Note that you also need to update
 
 We have prepared mSHEC's parameter sets which we think will be commonly used.
 Because I'm not sure how to update another person's repository, we will write 
 down
 those parameter sets in this mail.
 If we are required to do something, please let us know.
 
 while read k m c ; do
 for stripe_width in $STRIPE_WIDTHS ; do
 ceph_erasure_code_non_regression --stripe-width $stripe_width 
 --plugin shec --parameter technique=multiple --parameter k=$k --parameter 
 m=$m --parameter c=$c $ACTION $VERBOSE $MYDIR
 done
 done EOF
 1 1 1
 2 1 1
 3 2 1
 3 2 2
 3 3 2
 4 1 1
 4 2 2
 4 3 2
 5 2 1
 6 3 2
 6 4 2
 6 4 3
 7 2 1
 8 3 2
 8 4 2
 8 4 3
 9 4 2
 9 5 3
 12 7 4
 EOF
 
 Best regards,
 Takeshi Miyamae
 
 -Original Message-
 From: Loic Dachary [mailto:l...@dachary.org] 
 Sent: Friday, January 23, 2015 10:47 PM
 To: Miyamae, Takeshi/宮前 剛
 Cc: Ceph Development; Shiozawa, Kensuke/塩沢 賢輔; Nakao, Takanori/中尾 鷹詔
 Subject: Re: Deadline of Github pull request for Hammer release (question)
 
 Hi,
 
 Note that you also need to update 
 
 https://github.com/dachary/ceph-erasure-code-corpus/blob/master/v0.85-764-gf3a1532/non-regression.sh
 
 to include non regression tests for the most common cases of the SHEC plugin 
 encoding / decoding. This is run by make check (this repository is a 
 submodule of Ceph). It helps make sure that content encoded / decoded with a 
 given version of the plugin can be encoded / decoded exactly in the same way 
 by all future versions.
 
 Cheers
 
 On 06/01/2015 12:49, Miyamae, Takeshi wrote:
 Dear Loic,

 I'm Takeshi Miyamae, one of the authors of SHEC's blueprint.

 Shingled Erasure Code (SHEC)
 https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code
 _(SHEC)

 We have revised our blueprint shown in the last CDS to extend our 
 erasure code layouts and describe the guideline for choosing SHEC among 
 various EC plugins.
 We believe the blueprint now answers all the comments given at the CDS.

 In addition, we would like to ask for your advice on the schedule of 
 our github pull request. More specifically, we would like to know its 
 deadline for Hammer release.
 (As we have not really completed our verification of SHEC, we are 
 wondering if we should make it open for early preview.)

 Thank you in advance,
 Takeshi Miyamae

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html

 
 --
 Loïc Dachary, Artisan Logiciel Libre
 
 N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
 ���j:+v���w�j�mzZ+�ݢj��!tml=
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: idempotent op (esp delete)

2015-01-26 Thread Sage Weil
On Mon, 26 Jan 2015, Samuel Just wrote:
 The pg_log_t variant does seem to be cleaner.

I forgot, here is the danger: on promote and flush (copy-from) we do an 
O(n) scan of the pg log to assemble the reqids for that object.  Current 
default is 1000 entries in the log.

We could do better than that in many cases by skipping along the 
prior_version values (ignoring creates for now) if we could jump to a log 
entry by eversion_t, but it's a list, not a map.  Perhaps we 
could change it to a deque in memory to allow that sort of semi-random 
access?

Or maybe it's not worth trying to optimize that at all given the 
frequence of promote/flush...?

sage



 -Sam
 
 On Mon, Jan 26, 2015 at 9:21 AM, Sage Weil sw...@redhat.com wrote:
  On Mon, 26 Jan 2015, Wang, Zhiqiang wrote:
  The downside of this approach is that we may need to search the pg_log
  for a specific object in every write io?
 
  Not quite.  IndexedLog maintains a hash_map of all of the request ids in
  the log, so it's just a hash lookup on each IO.  (Well, now 2 hash
  lookups, because I put the additional request IDs in a second auxilliary
  map to handle dups properly.  I think we can avoid that lookup if we use
  the request flags carefully, though.. the RETRY and REDIRECTED flags
  I think?  Need to check carefully.)
 
  Maybe we can combine this
  approach and the changes in PR 3447. For the flush case when the object
  is deleted in the base, we search the pg_log for dup op. This should be
  rare cases. Otherwise the object exists, we check the reqid list in the
  object_info_t for dup op.
 
  We could do a hybrid approach, but there is some cost to the per-object
  tracking: a tiny bit more memory, and an O(n) search of the items in that
  list (~10 or 20?) for the dup check.  I suspect the hash lookup is
  cheaper?  And simpler.
 
  sage
 
 
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org 
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
  Sent: Monday, January 26, 2015 10:35 AM
  To: Sage Weil; Gregory Farnum
  Cc: ceph-devel@vger.kernel.org
  Subject: RE: idempotent op (esp delete)
 
  This method puts the reqid list in the pg_log instead of the 
  object_info_t, so that it's preserved even in the delete case, which 
  sounds more reasonable.
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org 
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Saturday, January 24, 2015 6:19 AM
  To: Gregory Farnum
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: idempotent op (esp delete)
 
  On Fri, 23 Jan 2015, Gregory Farnum wrote:
   On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil sw...@redhat.com wrote:
Background:
   
1) Way back when we made a task that would thrash the cache modes by
adding and removing the cache tier while ceph_test_rados was running.
This mostly worked, but would occasionally fail because we would
   
 - delete an object from the cache tier
 - a network failure injection would lose the reply
 - we'd disable the cache
 - the delete would resend to the base tier, not get recognized as a
dup (different pool, different pg log)
   - -ENOENT instead of 0
   
2) The proxy write code hits a similar problem:
   
 - delete gets proxied
 - we initiate async promote
 - a network failure injection loses the delete reply
 - delete resends and blocks on promote (or arrives after it
finishes)
 - promote finishes
 - delete is handled
  - -ENOENT instead of 0
   
The ticket is http://tracker.ceph.com/issues/8935
   
The problem is partially addressed by
   
https://github.com/ceph/ceph/pull/3447
   
by logging a few request ids on every object_info_t and preserving
that on promote and flush.
   
However, it doesn't solve the problem for delete because we throw
out object_info_t so that reqid_t is lost.
   
I think we have two options, not necessarily mutually exclusive:
   
1) When promoting an object that doesn't exist (to create a
whiteout), pull reqids out of the base tier's pg log so that the
whiteout is primed with request ids.
   
1.5) When flushing... well, that is harder because we have nowhere
to put the reqids.  Unless we make a way to cram a list of reqid's
into a single PG log entry...?  In that case, we wouldn't strictly
need the per-object list since we could pile the base tier's reqids
into the promote log entry in the cache tier.
   
2) Make delete idempotent (0 instead of ENOENT if the object doesn't
exist).  This will require a delicate compat transition (let's
ignore that a moment) but you can preserve the old behavior for
callers that care by preceding the delete with an assert_exists op.
Most callers don't care, but a handful do.  This simplifies the
semantics we need to support going forward.
   
Of course, it's all a bit delicate.  The idempotent op semantics