date:20140918

Hi Kevin,

On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote:
 
 I feel that separating the arch-specific implementations out and have a
 default 'generic' implementation would be a huge improvement.  Note that
 gf-complete was in active development for some time before including the
 SIMD code.  In hindsight, we should have done this separation back in 2012,
 but had some time pressure due to a paper deadline and limited time
 available to the contributors.
 
 Also, I agree w.r.t. the preprocessor stuff.  Going with SIMD/NOSIMD is
 fine by me.

I'll rename than and start implementing neon optimized function in their 
own files.
 
 Also, there should be very little SIMD work with jerasure, as gf-complete
 is the Galois field backend, so I would not worry too much about that.

I noticed, I have hooked my neon code already locally in ceph with 
touching jerasure.

 That covers clean-up work.  We can discuss the best way to choose the
 underlying implementation (looks like we have a bunch of options) as this
 work is completed.
 
 With this in mind, what work were you planning to do?  I can try to free up
 cycles to help, but that may not happen for a few weeks.

Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more 
than a few days though.

 One last thing...  If you do have code you want to push upstream, please
 submit a pull request(s) to our main bitbucket repo.
 
 Make sense?

yes, thanks.

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

v2 aligned buffer changes for erasure codes

Hi,

following a is an updated patchset. It passes now make check in src

It has following changes:
 * use 32-byte alignment since the isa plugin use AVX2
   (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers
   but I can't see a reason why it would need more than 32-bytes
 * ErasureCode::encode_prepare() handles more than one chunk with padding

cheers

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/3] buffer: add an aligned buffer with less alignment than a page

SIMD optimized erasure code computation needs aligned memory. Buffers
aligned to a page boundary are wasted on it though. The buffers used
for the erasure code computation are typical smaller than a page.

An alignment of 32 bytes is chosen to satisfy the needs of AVX/AVX2.
Could be made arch specific to reduce the alignment to 16 bytes for
arm/aarch64 NEON.

Signed-off-by: Janne Grunau j...@jannau.net
---
 configure.ac |   9 +
 src/common/buffer.cc | 100 +++
 src/include/buffer.h |  10 ++
 3 files changed, 119 insertions(+)

diff --git a/configure.ac b/configure.ac
index cccf2d9..1bb27c4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -793,6 +793,15 @@ AC_MSG_RESULT([no])
 ])
 
 #
+# Check for functions to provide aligned memory
+#
+AC_CHECK_HEADERS([malloc.h])
+AC_CHECK_FUNCS([posix_memalign _aligned_malloc memalign aligned_malloc],
+   [found_memalign=yes; break])
+
+AS_IF([test x$found_memalign != xyes], [AC_MSG_WARN([No function for 
aligned memory allocation found])])
+
+#
 # Check for pthread spinlock (depends on ACX_PTHREAD)
 #
 saved_LIBS=$LIBS
diff --git a/src/common/buffer.cc b/src/common/buffer.cc
index b141759..acc221f 100644
--- a/src/common/buffer.cc
+++ b/src/common/buffer.cc
@@ -30,6 +30,10 @@
 #include sys/uio.h
 #include limits.h
 
+#ifdef HAVE_MALLOC_H
+#include malloc.h
+#endif
+
 namespace ceph {
 
 #ifdef BUFFER_DEBUG
@@ -155,9 +159,15 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
 virtual int zero_copy_to_fd(int fd, loff_t *offset) {
   return -ENOTSUP;
 }
+virtual bool is_aligned() {
+  return ((long)data  ~CEPH_ALIGN_MASK) == 0;
+}
 virtual bool is_page_aligned() {
   return ((long)data  ~CEPH_PAGE_MASK) == 0;
 }
+bool is_n_align_sized() {
+  return (len  ~CEPH_ALIGN_MASK) == 0;
+}
 bool is_n_page_sized() {
   return (len  ~CEPH_PAGE_MASK) == 0;
 }
@@ -209,6 +219,41 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
 }
   };
 
+  class buffer::raw_aligned : public buffer::raw {
+  public:
+raw_aligned(unsigned l) : raw(l) {
+  if (len) {
+#if HAVE_POSIX_MEMALIGN
+if (posix_memalign((void **) data, CEPH_ALIGN, len))
+  data = 0;
+#elif HAVE__ALIGNED_MALLOC
+data = _aligned_malloc(len, CEPH_ALIGN);
+#elif HAVE_MEMALIGN
+data = memalign(CEPH_ALIGN, len);
+#elif HAVE_ALIGNED_MALLOC
+data = aligned_malloc((len + CEPH_ALIGN - 1)  ~CEPH_ALIGN_MASK,
+  CEPH_ALIGN);
+#else
+data = malloc(len);
+#endif
+if (!data)
+  throw bad_alloc();
+  } else {
+data = 0;
+  }
+  inc_total_alloc(len);
+  bdout  raw_aligned   this   alloc   (void *)data l 
buffer::get_total_alloc()  bendl;
+}
+~raw_aligned() {
+  free(data);
+  dec_total_alloc(len);
+  bdout  raw_aligned   this   free   (void *)data 
buffer::get_total_alloc()  bendl;
+}
+raw* clone_empty() {
+  return new raw_aligned(len);
+}
+  };
+
 #ifndef __CYGWIN__
   class buffer::raw_mmap_pages : public buffer::raw {
   public:
@@ -334,6 +379,10 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
   return true;
 }
 
+bool is_aligned() {
+  return false;
+}
+
 bool is_page_aligned() {
   return false;
 }
@@ -520,6 +569,9 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
   buffer::raw* buffer::create_static(unsigned len, char *buf) {
 return new raw_static(buf, len);
   }
+  buffer::raw* buffer::create_aligned(unsigned len) {
+return new raw_aligned(len);
+  }
   buffer::raw* buffer::create_page_aligned(unsigned len) {
 #ifndef __CYGWIN__
 //return new raw_mmap_pages(len);
@@ -1013,6 +1065,16 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
 return true;
   }
 
+  bool buffer::list::is_aligned() const
+  {
+for (std::listptr::const_iterator it = _buffers.begin();
+ it != _buffers.end();
+ ++it)
+  if (!it-is_aligned())
+return false;
+return true;
+  }
+
   bool buffer::list::is_page_aligned() const
   {
 for (std::listptr::const_iterator it = _buffers.begin();
@@ -1101,6 +1163,44 @@ static simple_spinlock_t buffer_debug_lock = 
SIMPLE_SPINLOCK_INITIALIZER;
 _buffers.push_back(nb);
   }
 
+void buffer::list::rebuild_aligned()
+{
+  std::listptr::iterator p = _buffers.begin();
+  while (p != _buffers.end()) {
+// keep anything that's already page sized+aligned
+if (p-is_aligned()  p-is_n_align_sized()) {
+  /*cout   segment   (void*)p-c_str()
+   offset   ((unsigned long)p-c_str()  ~CEPH_ALIGN_MASK)
+   length   p-length()
+ (p-length()  ~CEPH_ALIGN_MASK)   ok  std::endl;
+  */
+  ++p;
+  continue;
+}
+
+// consolidate unaligned items, until

[PATCH v2 2/3] ec: use 32-byte aligned buffers

Requiring page aligned buffers and realigning the input if necessary
creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster
with this change for technique=reed_sol_van,k=2,m=1.

Also prevents a misaligned buffer when bufferlist::c_str(bufferlist)
has to allocate a new buffer to provide continuous one. See bug #9408

Signed-off-by: Janne Grunau j...@jannau.net
---
 src/erasure-code/ErasureCode.cc | 57 -
 src/erasure-code/ErasureCode.h  |  3 ++-
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc
index 5953f49..7aa5235 100644
--- a/src/erasure-code/ErasureCode.cc
+++ b/src/erasure-code/ErasureCode.cc
@@ -54,22 +54,49 @@ int ErasureCode::minimum_to_decode_with_cost(const setint 
want_to_read,
 }
 
 int ErasureCode::encode_prepare(const bufferlist raw,
-bufferlist *prepared) const
+mapint, bufferlist encoded) const
 {
   unsigned int k = get_data_chunk_count();
   unsigned int m = get_chunk_count() - k;
   unsigned blocksize = get_chunk_size(raw.length());
-  unsigned padded_length = blocksize * k;
-  *prepared = raw;
-  if (padded_length - raw.length()  0) {
-bufferptr pad(padded_length - raw.length());
-pad.zero();
-prepared-push_back(pad);
+  unsigned pad_len = blocksize * k - raw.length();
+  unsigned padded_chunks = k - raw.length() / blocksize;
+  bufferlist prepared = raw;
+
+  if (!prepared.is_aligned()) {
+// splice padded chunks off to make the rebuild faster
+if (padded_chunks)
+  prepared.splice((k - padded_chunks) * blocksize,
+  padded_chunks * blocksize - pad_len);
+prepared.rebuild_aligned();
+  }
+
+  for (unsigned int i = 0; i  k - padded_chunks; i++) {
+int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
+bufferlist chunk = encoded[chunk_index];
+chunk.substr_of(prepared, i * blocksize, blocksize);
+  }
+  if (padded_chunks) {
+unsigned remainder = raw.length() - (k - padded_chunks) * blocksize;
+bufferlist padded;
+bufferptr buf(buffer::create_aligned(padded_chunks * blocksize));
+
+raw.copy((k - padded_chunks) * blocksize, remainder, buf.c_str());
+buf.zero(remainder, pad_len);
+padded.push_back(buf);
+
+for (unsigned int i = k - padded_chunks; i  k; i++) {
+  int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
+  bufferlist chunk = encoded[chunk_index];
+  chunk.substr_of(padded, (i - (k - padded_chunks)) * blocksize, 
blocksize);
+}
+  }
+  for (unsigned int i = k; i  k + m; i++) {
+int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
+bufferlist chunk = encoded[chunk_index];
+chunk.push_back(buffer::create_aligned(blocksize));
   }
-  unsigned coding_length = blocksize * m;
-  bufferptr coding(buffer::create_page_aligned(coding_length));
-  prepared-push_back(coding);
-  prepared-rebuild_page_aligned();
+
   return 0;
 }
 
@@ -80,15 +107,9 @@ int ErasureCode::encode(const setint want_to_encode,
   unsigned int k = get_data_chunk_count();
   unsigned int m = get_chunk_count() - k;
   bufferlist out;
-  int err = encode_prepare(in, out);
+  int err = encode_prepare(in, *encoded);
   if (err)
 return err;
-  unsigned blocksize = get_chunk_size(in.length());
-  for (unsigned int i = 0; i  k + m; i++) {
-int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
-bufferlist chunk = (*encoded)[chunk_index];
-chunk.substr_of(out, i * blocksize, blocksize);
-  }
   encode_chunks(want_to_encode, encoded);
   for (unsigned int i = 0; i  k + m; i++) {
 if (want_to_encode.count(i) == 0)
diff --git a/src/erasure-code/ErasureCode.h b/src/erasure-code/ErasureCode.h
index 7aaea95..62aa383 100644
--- a/src/erasure-code/ErasureCode.h
+++ b/src/erasure-code/ErasureCode.h
@@ -46,7 +46,8 @@ namespace ceph {
 const mapint, int available,
 setint *minimum);
 
-int encode_prepare(const bufferlist raw, bufferlist *prepared) const;
+int encode_prepare(const bufferlist raw,
+   mapint, bufferlist encoded) const;
 
 virtual int encode(const setint want_to_encode,
const bufferlist in,
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: v2 aligned buffer changes for erasure codes

Hi Janne, 
= (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers

I should update the README since it is misleading ... it should say 8*k or 16*k 
byte aligned chunk size depending on the compiler/platform used, it is not the 
alignment of the allocated buffer addresses.The get_alignment in the plug-in 
function is used to compute the chunk size for the encoding (as I said not the 
start address alignment). 

If you pass k buffers for decoding each buffer should be aligned at least to 16 
or as you pointed out better 32 bytes. 

For encoding there is normally a single buffer split 'virtually' into k pieces. 
To make all pieces starting at an aligned address one needs to align the chunk 
size to e.g. 16*k. For the best possible performance on all platforms we should 
change the get_alignment function in the ISA plug-in to return 32*k if there 
are no other objections ?!?!
 
Cheers Andreas.

From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Janne Grunau [j...@jannau.net]
Sent: 18 September 2014 12:33
To: ceph-devel@vger.kernel.org
Subject: v2 aligned buffer changes for erasure codes

Hi,

following a is an updated patchset. It passes now make check in src

It has following changes:
 * use 32-byte alignment since the isa plugin use AVX2
   (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers
   but I can't see a reason why it would need more than 32-bytes
 * ErasureCode::encode_prepare() handles more than one chunk with padding

cheers

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: v2 aligned buffer changes for erasure codes

Hi Janne/Loic, 
there is more confusion atleast on my side ...

I had now a look at the jerasure plug-in and I am now slightly confused why you 
have two ways to return in get_alignment ... one is as I assume and another one 
is per_chunk_alignment ... what should the function return Loic?

Cheers Andreas.

From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Andreas Joachim Peters [andreas.joachim.pet...@cern.ch]
Sent: 18 September 2014 14:18
To: Janne Grunau; ceph-devel@vger.kernel.org
Subject: RE: v2 aligned buffer changes for erasure codes

Hi Janne,
= (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers

I should update the README since it is misleading ... it should say 8*k or 16*k 
byte aligned chunk size depending on the compiler/platform used, it is not the 
alignment of the allocated buffer addresses.The get_alignment in the plug-in 
function is used to compute the chunk size for the encoding (as I said not the 
start address alignment).

If you pass k buffers for decoding each buffer should be aligned at least to 16 
or as you pointed out better 32 bytes.

For encoding there is normally a single buffer split 'virtually' into k pieces. 
To make all pieces starting at an aligned address one needs to align the chunk 
size to e.g. 16*k. For the best possible performance on all platforms we should 
change the get_alignment function in the ISA plug-in to return 32*k if there 
are no other objections ?!?!

Cheers Andreas.

From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Janne Grunau [j...@jannau.net]
Sent: 18 September 2014 12:33
To: ceph-devel@vger.kernel.org
Subject: v2 aligned buffer changes for erasure codes

Hi,

following a is an updated patchset. It passes now make check in src

It has following changes:
 * use 32-byte alignment since the isa plugin use AVX2
   (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers
   but I can't see a reason why it would need more than 32-bytes
 * ErasureCode::encode_prepare() handles more than one chunk with padding

cheers

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: severe librbd performance degradation in Giant

2014-09-18 Thread Mark Nelson


On 09/18/2014 04:49 AM, Alexandre DERUMIER wrote:

According http://tracker.ceph.com/issues/9513, do you mean that rbd
cache will make 10x performance degradation for random read?


Hi, on my side, I don't see any degradation performance on read (seq or rand)  
with or without.

firefly : around 12000iops (with or without rbd_cache)
giant : around 12000iops  (with or without rbd_cache)

(and I can reach around 2-3 iops on giant with disabling optracker).


rbd_cache only improve write performance for me (4k block )


I can't do it right now since I'm in the middle of reinstalling fedora 
on the test nodes, but I will try to replicate this as well if we 
haven't figured it out before hand.


Mark





- Mail original -

De: Haomai Wang haomaiw...@gmail.com
À: Somnath Roy somnath@sandisk.com
Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, 
ceph-devel@vger.kernel.org
Envoyé: Jeudi 18 Septembre 2014 04:27:56
Objet: Re: severe librbd performance degradation in Giant

According http://tracker.ceph.com/issues/9513, do you mean that rbd
cache will make 10x performance degradation for random read?

On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote:

Josh/Sage,
I should mention that even after turning off rbd cache I am getting ~20% 
degradation over Firefly.

Thanks  Regards
Somnath

-Original Message-
From: Somnath Roy
Sent: Wednesday, September 17, 2014 2:44 PM
To: Sage Weil
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Created a tracker for this.

http://tracker.ceph.com/issues/9513

Thanks  Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, September 17, 2014 2:39 PM
To: Sage Weil
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Sage,
It's a 4K random read.

Thanks  Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 17, 2014 2:36 PM
To: Somnath Roy
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

What was the io pattern? Sequential or random? For random a slowdown makes 
sense (tho maybe not 10x!) but not for sequentail

s

On Wed, 17 Sep 2014, Somnath Roy wrote:


I set the following in the client side /etc/ceph/ceph.conf where I am running 
fio rbd.

rbd_cache_writethrough_until_flush = false

But, no difference. BTW, I am doing Random read, not write. Still this setting 
applies ?

Next, I tried to tweak the rbd_cache setting to false and I *got back* the old 
performance. Now, it is similar to firefly throughput !

So, loks like rbd_cache=true was the culprit.

Thanks Josh !

Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 17, 2014 2:20 PM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: severe librbd performance degradation in Giant

On 09/17/2014 01:55 PM, Somnath Roy wrote:

Hi Sage,
We are experiencing severe librbd performance degradation in Giant over firefly 
release. Here is the experiment we did to isolate it as a librbd problem.

1. Single OSD is running latest Giant and client is running fio rbd on top of 
firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR).
2. Single OSD is running Giant and client is running fio rbd on top of Giant 
based librbd/librados. For one client it is giving ~1.9K iops (4K RR).
3. Single OSD is running latest Giant and client is running Giant based 
ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K 
iops (4K RR).
4. Giant RGW on top of Giant OSD is also scaling.


So, it is obvious from the above that recent librbd has issues. I will raise a 
tracker to track this.


For giant the default cache settings changed to:

rbd cache = true
rbd cache writethrough until flush = true

If fio isn't sending flushes as the test is running, the cache will stay in 
writethrough mode. Does the difference remain if you set rbd cache writethrough 
until flush = false ?

Josh



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to

Re: v2 aligned buffer changes for erasure codes

Hi,

On 2014-09-18 12:18:59 +, Andreas Joachim Peters wrote:
 
 = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers
 
 I should update the README since it is misleading ... it should say 
 8*k or 16*k byte aligned chunk size depending on the compiler/platform 
 used, it is not the alignment of the allocated buffer addresses.The 
 get_alignment in the plug-in function is used to compute the chunk 
 size for the encoding (as I said not the start address alignment). 

I've seen that

 If you pass k buffers for decoding each buffer should be aligned at 
 least to 16 or as you pointed out better 32 bytes. 

ok, that makes sense

 For encoding there is normally a single buffer split 'virtually' into 
 k pieces. To make all pieces starting at an aligned address one needs 
 to align the chunk size to e.g. 16*k.

I don't get that. How is the buffer splitted? into k (+ m) chunk size 
parts? As long as the start and the length are both 16 (or 32) byte 
aligned all parts are properly aligned too. I don't see where the k 
comes into play.

cheers

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan Van Der Ster

(moving this discussion to -devel)

Begin forwarded message:

From: Florian Haas flor...@hastexo.com
Date: 17 Sep 2014 18:02:09 CEST
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
To: Dan Van Der Ster daniel.vanders...@cern.ch
Cc: Craig Lewis cle...@centraldesktop.com, ceph-us...@lists.ceph.com
ceph-us...@lists.ceph.com

On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:
From: Florian Haas flor...@hastexo.com
Sent: Sep 17, 2014 5:33 PM
To: Dan Van Der Ster
Cc: Craig Lewis cle...@centraldesktop.com;ceph-us...@lists.ceph.com
Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:
Hi Florian,

On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote:

Hi Craig,

just dug this up in the list archives.

On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com
wrote:
In the interest of removing variables, I removed all snapshots on all
pools,
then restarted all ceph daemons at the same time. This brought up osd.8
as
well.

So just to summarize this: your 100% CPU problem at the time went away
after you removed all snapshots, and the actual cause of the issue was
never found?

I am seeing a similar issue now, and have filed
http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
again. Can you take a look at that issue and let me know if anything
in the description sounds familiar?

Could your ticket be related to the snap trimming issue I’ve finally
narrowed down in the past couple days?

http://tracker.ceph.com/issues/9487

Bump up debug_osd to 20 then check the log during one of your incidents.
If it is busy logging the snap_trimmer messages, then it’s the same issue.
(The issue is that rbd pools have many purged_snaps, but sometimes after
backfilling a PG the purged_snaps list is lost and thus the snap trimmer
becomes very busy whilst re-trimming thousands of snaps. During that time (a
few minutes on my cluster) the OSD is blocked.)

That sounds promising, thank you! debug_osd=10 should actually be
sufficient as those snap_trim messages get logged at that level. :)

Do I understand your issue report correctly in that you have found
setting osd_snap_trim_sleep to be ineffective, because it's being
applied when iterating from PG to PG, rather than from snap to snap?
If so, then I'm guessing that that can hardly be intentional…

I’m beginning to agree with you on that guess. AFAICT, the normal behavior of
the snap trimmer is to trim one single snap, the one which is in the snap_trimq
but not yet in purged_snaps. So the only time the current sleep implementation
could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool
snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most
need to trim O(100) PGs.

We could move the snap trim sleep into the SnapTrimmer state machine, for
example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get
through to the OSD, but of course the trimming PG would remain locked. And it
would be locked for even longer now due to the sleep.

To solve that we could limit the number of trims per instance of the
SnapTrimmer, like I’ve done in this pull req:
https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to
get through.

The second aspect of this issue is why are the purged_snaps being lost to begin
with. I’ve managed to reproduce that on my test cluster. All you have to do is
create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those
snapshots. Then use crush reweight to move the PGs around. With debug_osd=10,
you will see adding snap 1 to purged_snaps”, which is one signature of this
lost purged_snaps issue. To reproduce slow requests the number of snaps purged
needs to be O(1).

Looking forward to any ideas someone might have.

Cheers, Dan

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: Fwd: S3 API Compatibility support

2014-09-18 Thread M Ranga Swami Reddy

Hi ,

Could you please check and clarify the below question on object
lifecycle and notification S3 APIs support:

1. To support the bucket lifecycle - we need to support the
moving/deleting the objects/buckets based lifecycle settings.
For ex: If an object lifecyle set as below:
  1. Archive it after 10 days - means move this object to low
cost object storage after 10 days of the creation date.
   2. Remove this object after 90days - mean remove this
object from the low cost object after 90days of creation date.

Q1- Does the ceph support the above concept like moving to low cost
storage and delete from that storage?

2. To support the object notifications:
  - First there should be low cost and high availability storage
with single replica only. If an object created with this type of
object storage,
There could be chances that object could lose, so if an object
of this type of storage lost, set the notifications.

Q2- Does Ceph support low cost and high availability storage type?

Thanks

On Fri, Sep 12, 2014 at 8:00 PM, M Ranga Swami Reddy
swamire...@gmail.com wrote:
 Hi Yehuda,

 Could you please check and clarify the below question on object
 lifecycle and notification S3 APIs support:

 1. To support the bucket lifecycle - we need to support the
 moving/deleting the objects/buckets based lifecycle settings.
 For ex: If an object lifecyle set as below:
   1. Archive it after 10 days - means move this object to low
 cost object storage after 10 days of the creation date.
2. Remove this object after 90days - mean remove this
 object from the low cost object after 90days of creation date.

 Q1- Does the ceph support the above concept like moving to low cost
 storage and delete from that storage?

 2. To support the object notifications:
   - First there should be low cost and high availability storage
 with single replica only. If an object created with this type of
 object storage,
 There could be chances that object could lose, so if an object
 of this type of storage lost, set the notifications.

 Q2- Does Ceph support low cost and high availability storage type?

 Thanks
 Swami







 On Tue, Jul 29, 2014 at 1:35 AM, Yehuda Sadeh yeh...@redhat.com wrote:
 Bucket lifecycle:
 http://tracker.ceph.com/issues/8929

 Bucket notification:
 http://tracker.ceph.com/issues/8956

 On Sun, Jul 27, 2014 at 12:54 AM, M Ranga Swami Reddy
 swamire...@gmail.com wrote:
 Good no know the details. Can you please share the issue ID for bucket
 lifecycle? My team also could start help here.
 Regarding the notification - Do we have the issue ID?
 Yes, the object versioning will be backlog one - I strongly feel we
 start working on this asap.

 Thanks
 Swami

 On Fri, Jul 25, 2014 at 11:31 PM, Yehuda Sadeh yeh...@redhat.com wrote:
 On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy
 swamire...@gmail.com wrote:
 Thanks for quick reply.
 Yes,  versioned object - missing in ceph ATM
 Iam looking for: bucket lifecylce (get/put/delete), bucket location,
 put object notification and object restore (ie versioned object) S3
 API support.
 Please let me now any of the above work is in progress or some one
 planned to work on.


 I opened an issue for bucket lifecycle (we already had an issue open
 for object expiration though). We do have bucket location already
 (part of the multi-region feature). Object versioning is definitely on
 our backlog and one that we'll hopefully implement sooner rather
 later.
 With regard to object notification, it'll require having a
 notification service which is a bit out of the scope. Integrating the
 gateway with such a service whouldn't be hard, but we'll need to have
 that first.

 Yehuda


 Thanks
 Swami


 On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote:
 Hi Team: As per the ceph document a few S3 APIs compatibility not 
 supported.

 Link: http://ceph.com/docs/master/radosgw/s3/

 Is there plan to support the ?n supported item in the above table?
 or
 Any working on this?

 Yes.  Unfortunately this table isn't particularly detailed or accurate or
 up to date.   The main gap, I think, is versioned objects.

 Are there specfiic parts of the S3 API that are missing that you need?
 That sort of info is very helpful for prioritizing effort...

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 aligned buffer changes for erasure codes

Hi,

On 2014-09-18 12:34:49 +, Andreas Joachim Peters wrote:

 there is more confusion atleast on my side ...
 
 I had now a look at the jerasure plug-in and I am now slightly 
 confused why you have two ways to return in get_alignment ... one is 
 as I assume and another one is per_chunk_alignment ... what should 
 the function return Loic?

the per_chunk_alignment is just a bool which says that each chunk has to
start at an aligned address.

get_alignement() seems to be used to align the chunk size.

It might come from gf-complete' strange alignment requirements. Instead 
of requiring aligned buffers it requires that src and dst buffer have 
the same remainder when divided by 16. The best way to archieve that is 
to align the length to 16 and use a single buffer.

I agree it's convoluted.

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: v2 aligned buffer changes for erasure codes

Hi Janne, 

 For encoding there is normally a single buffer split 'virtually' into
 k pieces. To make all pieces starting at an aligned address one needs
 to align the chunk size to e.g. 16*k.

I don't get that. How is the buffer splitted? into k (+ m) chunk size
parts? As long as the start and the length are both 16 (or 32) byte
aligned all parts are properly aligned too. I don't see where the k
comes into play.

The original data block to encode has to be split into k equally long pieces. 
Each piece is given as one of the k input buffers to the erasure code algorithm 
producing m output buffers and each piece has to have an aligned starting 
address and length.

If you deal with 128 byte data input buffers for k=4 it splits like

offset=00 len=32 as chunk1
offset=32 len=32 as chunk2
offset=64 len=32 as chunk3
offset=96 len=32 as chunk4

If the desired IO size would be 196 bytes the 32 byte alignment requirement 
blows this buffer up to 256 bytes:

offset=00 len=64 as chunk1
offset=64 len=64 as chunk2
offset=128 len=64 as chunk3
offset=196 len=64 as chunk4

For the typical 4kb only k=2,4,8,16,32,64,128 do not increase the buffer. If 
someone configures e.g. k=10 the buffer is increased from 4096 to 4160 bytes 
and it creates 1.5% storage volume overhead.

Cheers Andreas.





From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Janne Grunau [j...@jannau.net]
Sent: 18 September 2014 14:40
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: v2 aligned buffer changes for erasure codes

Hi,

On 2014-09-18 12:18:59 +, Andreas Joachim Peters wrote:

 = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers

 I should update the README since it is misleading ... it should say
 8*k or 16*k byte aligned chunk size depending on the compiler/platform
 used, it is not the alignment of the allocated buffer addresses.The
 get_alignment in the plug-in function is used to compute the chunk
 size for the encoding (as I said not the start address alignment).

I've seen that

 If you pass k buffers for decoding each buffer should be aligned at
 least to 16 or as you pointed out better 32 bytes.

ok, that makes sense

 For encoding there is normally a single buffer split 'virtually' into
 k pieces. To make all pieces starting at an aligned address one needs
 to align the chunk size to e.g. 16*k.

I don't get that. How is the buffer splitted? into k (+ m) chunk size
parts? As long as the start and the length are both 16 (or 32) byte
aligned all parts are properly aligned too. I don't see where the k
comes into play.

cheers

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 aligned buffer changes for erasure codes

On 2014-09-18 13:01:03 +, Andreas Joachim Peters wrote:
 
  For encoding there is normally a single buffer split 'virtually' 
  into
  k pieces. To make all pieces starting at an aligned address one 
  needs
  to align the chunk size to e.g. 16*k.
 
 I don't get that. How is the buffer splitted? into k (+ m) chunk size
 parts? As long as the start and the length are both 16 (or 32) byte
 aligned all parts are properly aligned too. I don't see where the k
 comes into play.
 
 The original data block to encode has to be split into k equally long 
 pieces. Each piece is given as one of the k input buffers to the 
 erasure code algorithm producing m output buffers and each piece has 
 to have an aligned starting address and length.
 
 If you deal with 128 byte data input buffers for k=4 it splits like
 
 offset=00 len=32 as chunk1
 offset=32 len=32 as chunk2
 offset=64 len=32 as chunk3
 offset=96 len=32 as chunk4
 
 If the desired IO size would be 196 bytes the 32 byte alignment 
 requirement blows this buffer up to 256 bytes:
 
 offset=00 len=64 as chunk1
 offset=64 len=64 as chunk2
 offset=128 len=64 as chunk3
 offset=196 len=64 as chunk4

I fail to see how the 32 * k is related to alignment. It's only used for 
to pad the total size so it becomes a mulitple of k * 32. That is ok 
since we want k 32-byte aligned chunks. The alignment for each chunk is 
just 32-bytes.

Janne
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: severe librbd performance degradation in Giant

On Thu, 18 Sep 2014, Somnath Roy wrote:
 Sage,
 Any reason why the cache is by default enabled in Giant ?

It's recommended practice to turn it on.  It improves performance in 
general (especially with HDD OSDs).  Do you mind comparing sequential 
small IOs?

sage

 Regarding profiling, I will try if I can run Vtune/mutrace on this.
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com] 
 Sent: Wednesday, September 17, 2014 8:53 PM
 To: Somnath Roy
 Cc: Haomai Wang; Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 On Thu, 18 Sep 2014, Somnath Roy wrote:
  Yes Haomai...
 
 I would love to what a profiler says about the matter.  There is going to be 
 some overhead on the client associated with the cache for a random io 
 workload, but 10x is a problem!
 
 sage
 
 
  
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Wednesday, September 17, 2014 7:28 PM
  To: Somnath Roy
  Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
  Subject: Re: severe librbd performance degradation in Giant
  
  According http://tracker.ceph.com/issues/9513, do you mean that rbd cache 
  will make 10x performance degradation for random read?
  
  On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com 
  wrote:
   Josh/Sage,
   I should mention that even after turning off rbd cache I am getting ~20% 
   degradation over Firefly.
  
   Thanks  Regards
   Somnath
  
   -Original Message-
   From: Somnath Roy
   Sent: Wednesday, September 17, 2014 2:44 PM
   To: Sage Weil
   Cc: Josh Durgin; ceph-devel@vger.kernel.org
   Subject: RE: severe librbd performance degradation in Giant
  
   Created a tracker for this.
  
   http://tracker.ceph.com/issues/9513
  
   Thanks  Regards
   Somnath
  
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org 
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
   Sent: Wednesday, September 17, 2014 2:39 PM
   To: Sage Weil
   Cc: Josh Durgin; ceph-devel@vger.kernel.org
   Subject: RE: severe librbd performance degradation in Giant
  
   Sage,
   It's a 4K random read.
  
   Thanks  Regards
   Somnath
  
   -Original Message-
   From: Sage Weil [mailto:sw...@redhat.com]
   Sent: Wednesday, September 17, 2014 2:36 PM
   To: Somnath Roy
   Cc: Josh Durgin; ceph-devel@vger.kernel.org
   Subject: RE: severe librbd performance degradation in Giant
  
   What was the io pattern?  Sequential or random?  For random a slowdown 
   makes sense (tho maybe not 10x!) but not for sequentail
  
   s
  
   On Wed, 17 Sep 2014, Somnath Roy wrote:
  
   I set the following in the client side /etc/ceph/ceph.conf where I am 
   running fio rbd.
  
   rbd_cache_writethrough_until_flush = false
  
   But, no difference. BTW, I am doing Random read, not write. Still this 
   setting applies ?
  
   Next, I tried to tweak the rbd_cache setting to false and I *got back* 
   the old performance. Now, it is similar to firefly throughput !
  
   So, loks like rbd_cache=true was the culprit.
  
   Thanks Josh !
  
   Regards
   Somnath
  
   -Original Message-
   From: Josh Durgin [mailto:josh.dur...@inktank.com]
   Sent: Wednesday, September 17, 2014 2:20 PM
   To: Somnath Roy; ceph-devel@vger.kernel.org
   Subject: Re: severe librbd performance degradation in Giant
  
   On 09/17/2014 01:55 PM, Somnath Roy wrote:
Hi Sage,
We are experiencing severe librbd performance degradation in Giant 
over firefly release. Here is the experiment we did to isolate it as a 
librbd problem.
   
1. Single OSD is running latest Giant and client is running fio rbd on 
top of firefly based librbd/librados. For one client it is giving 
~11-12K  iops (4K RR).
2. Single OSD is running Giant and client is running fio rbd on top of 
Giant based librbd/librados. For one client it is giving ~1.9K  iops 
(4K RR).
3. Single OSD is running latest Giant and client is running Giant 
based ceph_smaiobench on top of giant librados. For one client it is 
giving ~11-12K  iops (4K RR).
4. Giant RGW on top of Giant OSD is also scaling.
   
   
So, it is obvious from the above that recent librbd has issues. I will 
raise a tracker to track this.
  
   For giant the default cache settings changed to:
  
   rbd cache = true
   rbd cache writethrough until flush = true
  
   If fio isn't sending flushes as the test is running, the cache will stay 
   in writethrough mode. Does the difference remain if you set rbd cache 
   writethrough until flush = false ?
  
   Josh
  
   
  
   PLEASE NOTE: The information contained in this electronic mail message 
   is intended only for the use of the designated recipient(s) named above. 
   If the reader of this message is not the intended recipient, you are 
   hereby notified that you have received this message in

RE: v2 aligned buffer changes for erasure codes

 I fail to see how the 32 * k is related to alignment. It's only used for
 to pad the total size so it becomes a mulitple of k * 32. That is ok
 since we want k 32-byte aligned chunks. The alignment for each chunk is
 just 32-bytes.

Yes, agreed! The alignment for each chunk should be 32 bytes. 

And the implementation is most efficient if the given encoding buffer is 
already padded to k*32 bytes, it avoids an additional buffer allocation and 
copy.

Cheers Andreas.--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas

Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…


 I’m beginning to agree with you on that guess. AFAICT, the normal behavior of 
 the snap trimmer is to trim one single snap, the one which is in the 
 snap_trimq but not yet in purged_snaps. So the only time the current sleep 
 implementation could be useful is if we rm’d a snap across many PGs at once, 
 e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway 
 since you’d at most need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.

 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in ReplicatedPG::NotTrimming::react. This should allow other IOs to 
 get through to the OSD, but of course the trimming PG would remain locked. 
 And it would be locked for even longer now due to the sleep.

 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve done in this pull req: 
 https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG to 
 get through.

 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve managed to reproduce that on my test cluster. All you have 
 to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap 
 all those snapshots. Then use crush reweight to move the PGs around. With 
 debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one 
 signature of this lost purged_snaps issue. To reproduce slow requests the 
 number of snaps purged needs to be O(1).

Hmmm, I'm not sure if I confirm that. I see adding snap X to
purged_snaps, but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.

 Looking forward to any ideas someone might have.

So am I. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: radosgw-admin list users?

2014-09-18 Thread Yehuda Sadeh

On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org wrote:
 Related to this thread, radosgw-admin doesn't seem to have anything to
 list the users.

 The closest I have as a hack is:
 rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq


Try:

$ radosgw-admin metadata list user

Yehuda

 But this does require internal knowledge of how it's stored, and I don't
 want to rely on it.

 On Thu, Sep 18, 2014 at 03:53:27PM +0800,  Zhao zhiming wrote:
 HI ALL,
   I know radosgw-admin can delete one user with command ‘radosgw-admin 
 user rm uid=xxx’, I want to know have some commands to delete multiple or 
 all users?

 thanks. --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Robin Hugh Johnson
 Gentoo Linux: Developer, Infrastructure Lead
 E-Mail : robb...@gentoo.org
 GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: severe librbd performance degradation in Giant

2014-09-18 Thread Somnath Roy

Alexandre,
What tool are you using ? I used fio rbd.

Also, I hope you have Giant package installed in the client side as well and 
rbd_cache =true is set on the client conf file.
FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had 
to make sure fio rbd is really loading giant librbd (if you have multiple 
copies around , which was in my case) for reproducing it.

Thanks  Regards
Somnath

-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: Thursday, September 18, 2014 2:49 AM
To: Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy
Subject: Re: severe librbd performance degradation in Giant

According http://tracker.ceph.com/issues/9513, do you mean that rbd 
cache will make 10x performance degradation for random read?

Hi, on my side, I don't see any degradation performance on read (seq or rand)  
with or without.

firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops 
 (with or without rbd_cache)

(and I can reach around 2-3 iops on giant with disabling optracker).


rbd_cache only improve write performance for me (4k block )



- Mail original - 

De: Haomai Wang haomaiw...@gmail.com
À: Somnath Roy somnath@sandisk.com
Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, 
ceph-devel@vger.kernel.org
Envoyé: Jeudi 18 Septembre 2014 04:27:56
Objet: Re: severe librbd performance degradation in Giant 

According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will 
make 10x performance degradation for random read? 

On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: 
 Josh/Sage,
 I should mention that even after turning off rbd cache I am getting ~20% 
 degradation over Firefly. 
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, September 17, 2014 2:44 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 Created a tracker for this. 
 
 http://tracker.ceph.com/issues/9513
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Wednesday, September 17, 2014 2:39 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 Sage,
 It's a 4K random read. 
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, September 17, 2014 2:36 PM
 To: Somnath Roy
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 What was the io pattern? Sequential or random? For random a slowdown makes 
 sense (tho maybe not 10x!) but not for sequentail 
 
 s
 
 On Wed, 17 Sep 2014, Somnath Roy wrote: 
 
 I set the following in the client side /etc/ceph/ceph.conf where I am 
 running fio rbd. 
 
 rbd_cache_writethrough_until_flush = false
 
 But, no difference. BTW, I am doing Random read, not write. Still this 
 setting applies ? 
 
 Next, I tried to tweak the rbd_cache setting to false and I *got back* the 
 old performance. Now, it is similar to firefly throughput ! 
 
 So, loks like rbd_cache=true was the culprit. 
 
 Thanks Josh ! 
 
 Regards
 Somnath
 
 -Original Message-
 From: Josh Durgin [mailto:josh.dur...@inktank.com]
 Sent: Wednesday, September 17, 2014 2:20 PM
 To: Somnath Roy; ceph-devel@vger.kernel.org
 Subject: Re: severe librbd performance degradation in Giant
 
 On 09/17/2014 01:55 PM, Somnath Roy wrote: 
  Hi Sage,
  We are experiencing severe librbd performance degradation in Giant over 
  firefly release. Here is the experiment we did to isolate it as a librbd 
  problem. 
  
  1. Single OSD is running latest Giant and client is running fio rbd on top 
  of firefly based librbd/librados. For one client it is giving ~11-12K iops 
  (4K RR). 
  2. Single OSD is running Giant and client is running fio rbd on top of 
  Giant based librbd/librados. For one client it is giving ~1.9K iops (4K 
  RR). 
  3. Single OSD is running latest Giant and client is running Giant based 
  ceph_smaiobench on top of giant librados. For one client it is giving 
  ~11-12K iops (4K RR). 
  4. Giant RGW on top of Giant OSD is also scaling. 
  
  
  So, it is obvious from the above that recent librbd has issues. I will 
  raise a tracker to track this. 
 
 For giant the default cache settings changed to: 
 
 rbd cache = true
 rbd cache writethrough until flush = true
 
 If fio isn't sending flushes as the test is running, the cache will stay in 
 writethrough mode. Does the difference remain if you set rbd cache 
 writethrough until flush = false ? 
 
 Josh
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s)

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas

On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
daniel.vanders...@cern.ch wrote:
 Hi Florian,

 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote:

 Hi Dan,

 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.

 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:
  Do I understand your issue report correctly in that you have found
  setting osd_snap_trim_sleep to be ineffective, because it's being
  applied when iterating from PG to PG, rather than from snap to snap?
  If so, then I'm guessing that that can hardly be intentional…
 
 
  I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
  of the snap trimmer is to trim one single snap, the one which is in the 
  snap_trimq but not yet in purged_snaps. So the only time the current sleep 
  implementation could be useful is if we rm’d a snap across many PGs at 
  once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
  anyway since you’d at most need to trim O(100) PGs.

 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.


 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap 
 across all of those PGs would finish rather quickly anyway. Latency would 
 surely be increased momentarily, but I wouldn't expect 90s slow requests like 
 I have with the 3 snap_trimq single PG.

 Possibly the sleep is useful in both places.

  We could move the snap trim sleep into the SnapTrimmer state machine, for 
  example in ReplicatedPG::NotTrimming::react. This should allow other IOs 
  to get through to the OSD, but of course the trimming PG would remain 
  locked. And it would be locked for even longer now due to the sleep.
 
  To solve that we could limit the number of trims per instance of the 
  SnapTrimmer, like I’ve done in this pull req: 
  https://github.com/ceph/ceph/pull/2516
  Breaking out of the trimmer like that should allow IOs to the trimming PG 
  to get through.
 
  The second aspect of this issue is why are the purged_snaps being lost to 
  begin with. I’ve managed to reproduce that on my test cluster. All you 
  have to do is create many pool snaps (e.g. of a nearly empty pool), then 
  rmsnap all those snapshots. Then use crush reweight to move the PGs 
  around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
  which is one signature of this lost purged_snaps issue. To reproduce slow 
  requests the number of snaps purged needs to be O(1).

 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.


 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior as I am. In my case the first snap trimmed is snap 1, of 
 course because that is the first rm'd snap, and the contents of your pool are 
 surely different. I also see the ENOENT messages... again confirming those 
 snaps were already trimmed. Anyway, what I've observed is that a large 
 snap_trimq like that will block the OSD until they are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
wait until the patch is merged), I'd be all ears.

Cheers
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan van der Ster

Hi,

September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote: 
 On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster daniel.vanders...@cern.ch 
 wrote:
 
 Hi Florian,
 
 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: 
 Hi Dan,
 
 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.
 
 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…
 
 
 I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
 of the snap trimmer
 is
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
 at
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.
 
 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.
 
 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up
 to
 ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
 would finish rather
 quickly anyway. Latency would surely be increased momentarily, but I 
 wouldn't expect 90s slow
 requests like I have with the 3 snap_trimq single PG.
 
 Possibly the sleep is useful in both places.
 
 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for even 
 longer now due to
 the
 sleep.
 
 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve
 done
 in this pull req: https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG 
 to get through.
 
 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps (e.g.
 of
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
 which is one signature
 of
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).
 
 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.
 
 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior as
 I
 am. In my case the first snap trimmed is snap 1, of course because that is 
 the first rm'd snap,
 and
 the contents of your pool are surely different. I also see the ENOENT 
 messages... again
 confirming
 those snaps were already trimmed. Anyway, what I've observed is that a large 
 snap_trimq like that
 will block the OSD until they are all re-trimmed.
 
 That's... a mess.
 
 So what is your workaround for recovery? My hunch would be to
 
 - stop all access to the cluster;
 - set nodown and noout so that other OSDs don't mark spinning OSDs
 down (which would cause all sorts of primary and PG reassignments,
 useless backfill/recovery when mon osd down out interval expires,
 etc.);
 - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
 so that at least *between* PGs, the OSD has a chance to respond to
 heartbeats and do whatever else it needs to do;
 - let the snap trim play itself out over several hours (days?).
 

What I've been doing is I just continue draining my OSDs, two at a time. Each 
time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it 
takes to drain) while a single PG re-trims, leading to ~100 slow requests. The 
OSD must still be responding to the peer pings, since other OSDs do not mark it 
down. Luckily this doesn't happen with every single movement of our pool 5 PGs, 
otherwise it would be a disaster like you said.

Cheers, Dan

 That sounds utterly awful, but if anyone has a better idea (other than
 wait until the patch is merged), I'd be all ears.
 
 Cheers
 Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan van der Ster

-- Dan van der Ster || Data  Storage Services || CERN IT Department --

September 18 2014 9:12 PM, Dan van der Ster daniel.vanders...@cern.ch 
wrote: 
 Hi,
 
 September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote:
 
 On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster 
 daniel.vanders...@cern.ch wrote:
 
 Hi Florian,
 
 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: 
 Hi Dan,
 
 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.
 
 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote: 
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…
 
 I’m beginning to agree with you on that guess. AFAICT, the normal 
 behavior of the snap trimmer
 
 is 
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
 
 at 
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.
 
 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.
 
 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up
 
 to 
 ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
 would finish rather
 quickly anyway. Latency would surely be increased momentarily, but I 
 wouldn't expect 90s slow
 requests like I have with the 3 snap_trimq single PG.
 
 Possibly the sleep is useful in both places.
 
 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in
 
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for 
 even longer now due to
 
 the 
 sleep. 
 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve
 
 done 
 in this pull req: https://github.com/ceph/ceph/pull/2516 
 Breaking out of the trimmer like that should allow IOs to the trimming PG 
 to get through.
 
 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve
 
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps
 (e.g.
 
 of 
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
 which is one
 signature
 
 of 
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).
 
 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.
 
 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior
 as
 
 I 
 am. In my case the first snap trimmed is snap 1, of course because that is 
 the first rm'd snap,
 
 and 
 the contents of your pool are surely different. I also see the ENOENT 
 messages... again
 
 confirming 
 those snaps were already trimmed. Anyway, what I've observed is that a 
 large snap_trimq like
 that
 will block the OSD until they are all re-trimmed.
 
 That's... a mess.
 
 So what is your workaround for recovery? My hunch would be to
 
 - stop all access to the cluster;
 - set nodown and noout so that other OSDs don't mark spinning OSDs
 down (which would cause all sorts of primary and PG reassignments,
 useless backfill/recovery when mon osd down out interval expires,
 etc.);
 - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
 so that at least *between* PGs, the OSD has a chance to respond to
 heartbeats and do whatever else it needs to do;
 - let the snap trim play itself out over several hours (days?).
 
 What I've been doing is I just continue draining my OSDs, two at a time. Each 
 time, 1-2 other OSDs
 become blocked for a couple minutes (out of the ~1 hour it takes to drain) 
 while a single PG
 re-trims, leading to ~100 slow requests. The OSD must still be responding to 
 the peer pings, since
 other OSDs do not mark it down. Luckily this doesn't happen with every single 
 movement of our pool
 5 PGs, otherwise it would be a disaster like you said.

Two other more risky work-arounds that I didn't try yet are:

1. lower the osd_snap_trim_thread_timeout from 3600s to something like 10 or 
20s, so that these long trim operations are just killed. I have no idea

Re: radosgw-admin list users?

2014-09-18 Thread Robin H. Johnson

On Thu, Sep 18, 2014 at 10:38:19AM -0700,  Yehuda Sadeh wrote:
 On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org wrote:
  Related to this thread, radosgw-admin doesn't seem to have anything to
  list the users.
 
  The closest I have as a hack is:
  rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq
 
 
 Try:
 
 $ radosgw-admin metadata list user
Ooh nice!
Nothing in the --help output says 'metadata list' takes arguments (and
the manpage for radosgw-admin doesn't even have the metadata commands).

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas

On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster
daniel.vanders...@cern.ch wrote:
 Hi,

 September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote:
 On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster 
 daniel.vanders...@cern.ch wrote:

 Hi Florian,

 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote:
 Hi Dan,

 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.

 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…


 I’m beginning to agree with you on that guess. AFAICT, the normal 
 behavior of the snap trimmer
 is
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
 at
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.

 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.

 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up
 to
 ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
 would finish rather
 quickly anyway. Latency would surely be increased momentarily, but I 
 wouldn't expect 90s slow
 requests like I have with the 3 snap_trimq single PG.

 Possibly the sleep is useful in both places.

 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for 
 even longer now due to
 the
 sleep.

 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve
 done
 in this pull req: https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG 
 to get through.

 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps (e.g.
 of
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
 which is one signature
 of
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).

 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.

 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior as
 I
 am. In my case the first snap trimmed is snap 1, of course because that is 
 the first rm'd snap,
 and
 the contents of your pool are surely different. I also see the ENOENT 
 messages... again
 confirming
 those snaps were already trimmed. Anyway, what I've observed is that a 
 large snap_trimq like that
 will block the OSD until they are all re-trimmed.

 That's... a mess.

 So what is your workaround for recovery? My hunch would be to

 - stop all access to the cluster;
 - set nodown and noout so that other OSDs don't mark spinning OSDs
 down (which would cause all sorts of primary and PG reassignments,
 useless backfill/recovery when mon osd down out interval expires,
 etc.);
 - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
 so that at least *between* PGs, the OSD has a chance to respond to
 heartbeats and do whatever else it needs to do;
 - let the snap trim play itself out over several hours (days?).


 What I've been doing is I just continue draining my OSDs, two at a time. Each 
 time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour 
 it takes to drain) while a single PG re-trims, leading to ~100 slow requests. 
 The OSD must still be responding to the peer pings, since other OSDs do not 
 mark it down. Luckily this doesn't happen with every single movement of our 
 pool 5 PGs, otherwise it would be a disaster like you said.

So just to clarify, what you're doing is out of the OSDs that are
spinning, you mark 2 out and wait for them to go empty?

What I'm seeing i my environment is that the OSDs *do* go down.
Marking them out seems not to help much as the problem then promptly
pops up elsewhere.

So, disaster is a pretty good description. Would anyone from the

Re: snap_trimming + backfilling is inefficient with many purged_snaps

On Fri, 19 Sep 2014, Florian Haas wrote:
 Hi Sage,
 
 was the off-list reply intentional?

Whoops!  Nope :)

 On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil sw...@redhat.com wrote:
  So, disaster is a pretty good description. Would anyone from the core
  team like to suggest another course of action or workaround, or are
  Dan and I generally on the right track to make the best out of a
  pretty bad situation?
 
  The short term fix would probably be to just prevent backfill for the time
  being until the bug is fixed.
 
 As in, osd max backfills = 0?

Yeah :)

Just managed to reproduce the problem...

sage

  The root of the problem seems to be that it is trying to trim snaps that
  aren't there.  I'm trying to reproduce the issue now!  Hopefully the fix
  is simple...
 
  http://tracker.ceph.com/issues/9487
 
  Thanks!
  sage
 
 Thanks. :)
 
 Cheers,
 Florian
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: severe librbd performance degradation in Giant

2014-09-18 Thread Shu, Xinxin

I also observed performance degradation on my full SSD setup ,  I can got  
~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only 
got ~12K IOPS 

Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, September 19, 2014 2:03 AM
To: Alexandre DERUMIER; Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Alexandre,
What tool are you using ? I used fio rbd.

Also, I hope you have Giant package installed in the client side as well and 
rbd_cache =true is set on the client conf file.
FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had 
to make sure fio rbd is really loading giant librbd (if you have multiple 
copies around , which was in my case) for reproducing it.

Thanks  Regards
Somnath

-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
Sent: Thursday, September 18, 2014 2:49 AM
To: Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy
Subject: Re: severe librbd performance degradation in Giant

According http://tracker.ceph.com/issues/9513, do you mean that rbd 
cache will make 10x performance degradation for random read?

Hi, on my side, I don't see any degradation performance on read (seq or rand)  
with or without.

firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops 
 (with or without rbd_cache)

(and I can reach around 2-3 iops on giant with disabling optracker).


rbd_cache only improve write performance for me (4k block )



- Mail original - 

De: Haomai Wang haomaiw...@gmail.com
À: Somnath Roy somnath@sandisk.com
Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, 
ceph-devel@vger.kernel.org
Envoyé: Jeudi 18 Septembre 2014 04:27:56
Objet: Re: severe librbd performance degradation in Giant 

According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will 
make 10x performance degradation for random read? 

On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: 
 Josh/Sage,
 I should mention that even after turning off rbd cache I am getting ~20% 
 degradation over Firefly. 
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, September 17, 2014 2:44 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 Created a tracker for this. 
 
 http://tracker.ceph.com/issues/9513
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Wednesday, September 17, 2014 2:39 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 Sage,
 It's a 4K random read. 
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, September 17, 2014 2:36 PM
 To: Somnath Roy
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant
 
 What was the io pattern? Sequential or random? For random a slowdown makes 
 sense (tho maybe not 10x!) but not for sequentail 
 
 s
 
 On Wed, 17 Sep 2014, Somnath Roy wrote: 
 
 I set the following in the client side /etc/ceph/ceph.conf where I am 
 running fio rbd. 
 
 rbd_cache_writethrough_until_flush = false
 
 But, no difference. BTW, I am doing Random read, not write. Still this 
 setting applies ? 
 
 Next, I tried to tweak the rbd_cache setting to false and I *got back* the 
 old performance. Now, it is similar to firefly throughput ! 
 
 So, loks like rbd_cache=true was the culprit. 
 
 Thanks Josh ! 
 
 Regards
 Somnath
 
 -Original Message-
 From: Josh Durgin [mailto:josh.dur...@inktank.com]
 Sent: Wednesday, September 17, 2014 2:20 PM
 To: Somnath Roy; ceph-devel@vger.kernel.org
 Subject: Re: severe librbd performance degradation in Giant
 
 On 09/17/2014 01:55 PM, Somnath Roy wrote: 
  Hi Sage,
  We are experiencing severe librbd performance degradation in Giant over 
  firefly release. Here is the experiment we did to isolate it as a librbd 
  problem. 
  
  1. Single OSD is running latest Giant and client is running fio rbd on top 
  of firefly based librbd/librados. For one client it is giving ~11-12K iops 
  (4K RR). 
  2. Single OSD is running Giant and client is running fio rbd on top of 
  Giant based librbd/librados. For one client it is giving ~1.9K iops (4K 
  RR). 
  3. Single OSD is running latest Giant and client is running Giant based 
  ceph_smaiobench on top of giant librados. For one client it is giving 
  ~11-12K iops (4K RR). 
  4. Giant RGW on top of Giant OSD is also scaling. 
  
  
  So, it is obvious from the above that recent librbd has issues. I will

RE: severe librbd performance degradation in Giant

2014-09-18 Thread Shu, Xinxin

My bad , with latest master , we got ~ 120K IOPS.

Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Friday, September 19, 2014 9:08 AM
To: Somnath Roy; Alexandre DERUMIER; Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

I also observed performance degradation on my full SSD setup ,  I can got  
~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only 
got ~12K IOPS 

Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, September 19, 2014 2:03 AM
To: Alexandre DERUMIER; Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Alexandre,
What tool are you using ? I used fio rbd.

Also, I hope you have Giant package installed in the client side as well and 
rbd_cache =true is set on the client conf file.
FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had 
to make sure fio rbd is really loading giant librbd (if you have multiple 
copies around , which was in my case) for reproducing it.

Thanks  Regards
Somnath

-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
Sent: Thursday, September 18, 2014 2:49 AM
To: Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy
Subject: Re: severe librbd performance degradation in Giant

According http://tracker.ceph.com/issues/9513, do you mean that rbd 
cache will make 10x performance degradation for random read?

Hi, on my side, I don't see any degradation performance on read (seq or rand)  
with or without.

firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops 
 (with or without rbd_cache)

(and I can reach around 2-3 iops on giant with disabling optracker).

rbd_cache only improve write performance for me (4k block )

- Mail original - 

De: Haomai Wang haomaiw...@gmail.com
À: Somnath Roy somnath@sandisk.com
Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, 
ceph-devel@vger.kernel.org
Envoyé: Jeudi 18 Septembre 2014 04:27:56
Objet: Re: severe librbd performance degradation in Giant 

According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will 
make 10x performance degradation for random read? 

On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: 
 Josh/Sage,
 I should mention that even after turning off rbd cache I am getting ~20% 
 degradation over Firefly. 

 Thanks  Regards
 Somnath

 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, September 17, 2014 2:44 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 Created a tracker for this. 

 http://tracker.ceph.com/issues/9513

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Wednesday, September 17, 2014 2:39 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 Sage,
 It's a 4K random read. 

 Thanks  Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, September 17, 2014 2:36 PM
 To: Somnath Roy
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 What was the io pattern? Sequential or random? For random a slowdown makes 
 sense (tho maybe not 10x!) but not for sequentail 

 s

 On Wed, 17 Sep 2014, Somnath Roy wrote: 

 I set the following in the client side /etc/ceph/ceph.conf where I am 
 running fio rbd. 

 rbd_cache_writethrough_until_flush = false

 But, no difference. BTW, I am doing Random read, not write. Still this 
 setting applies ? 

 Next, I tried to tweak the rbd_cache setting to false and I *got back* the 
 old performance. Now, it is similar to firefly throughput ! 

 So, loks like rbd_cache=true was the culprit. 

 Thanks Josh ! 

 Regards
 Somnath

 -Original Message-
 From: Josh Durgin [mailto:josh.dur...@inktank.com]
 Sent: Wednesday, September 17, 2014 2:20 PM
 To: Somnath Roy; ceph-devel@vger.kernel.org
 Subject: Re: severe librbd performance degradation in Giant

 On 09/17/2014 01:55 PM, Somnath Roy wrote: 
  Hi Sage,
  We are experiencing severe librbd performance degradation in Giant over 
  firefly release. Here is the experiment we did to isolate it as a librbd 
  problem. 

  1. Single OSD is running latest Giant and client is running fio rbd on top 
  of firefly based librbd/librados. For one client it is giving ~11-12K iops 
  (4K RR). 
  2. Single OSD is running Giant and client is running fio

Re: radosgw-admin list users?

2014-09-18 Thread Zhao zhiming

Thanks Robin and Yehuda, but I want to how to delete multiple users.

I use 'radosgw-admin metadata list user’ to list all users, and found some 
users have unreadable code.

radosgw-admin metadata list user
[
zzm1,
?zzm1,
?zzm1”]

and I can’t delete these unreadable users.

radosgw-admin user rm —uid=?zzm1
could not remove user: unable to remove user, user does not exist

so I want to know, do rgw admin have command to delete multiply or all users?

Thanks.

On Sep 19, 2014, at 4:01 AM, Robin H. Johnson robb...@gentoo.org wrote:

 On Thu, Sep 18, 2014 at 10:38:19AM -0700,  Yehuda Sadeh wrote:
 On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org 
 wrote:
 Related to this thread, radosgw-admin doesn't seem to have anything to
 list the users.
 
 The closest I have as a hack is:
 rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq
 
 
 Try:
 
 $ radosgw-admin metadata list user
 Ooh nice!
 Nothing in the --help output says 'metadata list' takes arguments (and
 the manpage for radosgw-admin doesn't even have the metadata commands).
 
 -- 
 Robin Hugh Johnson
 Gentoo Linux: Developer, Infrastructure Lead
 E-Mail : robb...@gentoo.org
 GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] libceph: reference counting pagelist

On Tue, 16 Sep 2014, Yan, Zheng wrote:
 this allow pagelist to present data that may be sent multiple times.
 
 Signed-off-by: Yan, Zheng z...@redhat.com

Reviewed-by: Sage Weil s...@redhat.com


 ---
  fs/ceph/mds_client.c  | 1 -
  include/linux/ceph/pagelist.h | 5 -
  net/ceph/messenger.c  | 4 +---
  net/ceph/pagelist.c   | 8 ++--
  4 files changed, 11 insertions(+), 7 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index a17fc49..30d7338 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2796,7 +2796,6 @@ fail:
   mutex_unlock(session-s_mutex);
  fail_nomsg:
   ceph_pagelist_release(pagelist);
 - kfree(pagelist);
  fail_nopagelist:
   pr_err(error %d preparing reconnect for mds%d\n, err, mds);
   return;
 diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h
 index 9660d6b..5f871d8 100644
 --- a/include/linux/ceph/pagelist.h
 +++ b/include/linux/ceph/pagelist.h
 @@ -2,6 +2,7 @@
  #define __FS_CEPH_PAGELIST_H
  
  #include linux/list.h
 +#include linux/atomic.h
  
  struct ceph_pagelist {
   struct list_head head;
 @@ -10,6 +11,7 @@ struct ceph_pagelist {
   size_t room;
   struct list_head free_list;
   size_t num_pages_free;
 + atomic_t refcnt;
  };
  
  struct ceph_pagelist_cursor {
 @@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist 
 *pl)
   pl-room = 0;
   INIT_LIST_HEAD(pl-free_list);
   pl-num_pages_free = 0;
 + atomic_set(pl-refcnt, 1);
  }
  
 -extern int ceph_pagelist_release(struct ceph_pagelist *pl);
 +extern void ceph_pagelist_release(struct ceph_pagelist *pl);
  
  extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, 
 size_t l);
  
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index e7d9411..9764c77 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data 
 *data)
   return;
  
   WARN_ON(!list_empty(data-links));
 - if (data-type == CEPH_MSG_DATA_PAGELIST) {
 + if (data-type == CEPH_MSG_DATA_PAGELIST)
   ceph_pagelist_release(data-pagelist);
 - kfree(data-pagelist);
 - }
   kmem_cache_free(ceph_msg_data_cache, data);
  }
  
 diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
 index 92866be..f70b651 100644
 --- a/net/ceph/pagelist.c
 +++ b/net/ceph/pagelist.c
 @@ -1,5 +1,6 @@
  #include linux/module.h
  #include linux/gfp.h
 +#include linux/slab.h
  #include linux/pagemap.h
  #include linux/highmem.h
  #include linux/ceph/pagelist.h
 @@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist 
 *pl)
   }
  }
  
 -int ceph_pagelist_release(struct ceph_pagelist *pl)
 +void ceph_pagelist_release(struct ceph_pagelist *pl)
  {
 + if (!atomic_dec_and_test(pl-refcnt))
 + return;
   ceph_pagelist_unmap_tail(pl);
   while (!list_empty(pl-head)) {
   struct page *page = list_first_entry(pl-head, struct page,
 @@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl)
   __free_page(page);
   }
   ceph_pagelist_free_reserve(pl);
 - return 0;
 + kfree(pl);
 + return;
  }
  EXPORT_SYMBOL(ceph_pagelist_release);
  
 -- 
 1.9.3
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] libceph: reference counting pagelist

On Tue, 16 Sep 2014, Yan, Zheng wrote:
 this allow pagelist to present data that may be sent multiple times.

Hmm, actually we probably should use the kref code for this, even though 
the refcounting is trivial.

sage

 
 Signed-off-by: Yan, Zheng z...@redhat.com
 ---
  fs/ceph/mds_client.c  | 1 -
  include/linux/ceph/pagelist.h | 5 -
  net/ceph/messenger.c  | 4 +---
  net/ceph/pagelist.c   | 8 ++--
  4 files changed, 11 insertions(+), 7 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index a17fc49..30d7338 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2796,7 +2796,6 @@ fail:
   mutex_unlock(session-s_mutex);
  fail_nomsg:
   ceph_pagelist_release(pagelist);
 - kfree(pagelist);
  fail_nopagelist:
   pr_err(error %d preparing reconnect for mds%d\n, err, mds);
   return;
 diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h
 index 9660d6b..5f871d8 100644
 --- a/include/linux/ceph/pagelist.h
 +++ b/include/linux/ceph/pagelist.h
 @@ -2,6 +2,7 @@
  #define __FS_CEPH_PAGELIST_H
  
  #include linux/list.h
 +#include linux/atomic.h
  
  struct ceph_pagelist {
   struct list_head head;
 @@ -10,6 +11,7 @@ struct ceph_pagelist {
   size_t room;
   struct list_head free_list;
   size_t num_pages_free;
 + atomic_t refcnt;
  };
  
  struct ceph_pagelist_cursor {
 @@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist 
 *pl)
   pl-room = 0;
   INIT_LIST_HEAD(pl-free_list);
   pl-num_pages_free = 0;
 + atomic_set(pl-refcnt, 1);
  }
  
 -extern int ceph_pagelist_release(struct ceph_pagelist *pl);
 +extern void ceph_pagelist_release(struct ceph_pagelist *pl);
  
  extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, 
 size_t l);
  
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index e7d9411..9764c77 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data 
 *data)
   return;
  
   WARN_ON(!list_empty(data-links));
 - if (data-type == CEPH_MSG_DATA_PAGELIST) {
 + if (data-type == CEPH_MSG_DATA_PAGELIST)
   ceph_pagelist_release(data-pagelist);
 - kfree(data-pagelist);
 - }
   kmem_cache_free(ceph_msg_data_cache, data);
  }
  
 diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
 index 92866be..f70b651 100644
 --- a/net/ceph/pagelist.c
 +++ b/net/ceph/pagelist.c
 @@ -1,5 +1,6 @@
  #include linux/module.h
  #include linux/gfp.h
 +#include linux/slab.h
  #include linux/pagemap.h
  #include linux/highmem.h
  #include linux/ceph/pagelist.h
 @@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist 
 *pl)
   }
  }
  
 -int ceph_pagelist_release(struct ceph_pagelist *pl)
 +void ceph_pagelist_release(struct ceph_pagelist *pl)
  {
 + if (!atomic_dec_and_test(pl-refcnt))
 + return;
   ceph_pagelist_unmap_tail(pl);
   while (!list_empty(pl-head)) {
   struct page *page = list_first_entry(pl-head, struct page,
 @@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl)
   __free_page(page);
   }
   ceph_pagelist_free_reserve(pl);
 - return 0;
 + kfree(pl);
 + return;
  }
  EXPORT_SYMBOL(ceph_pagelist_release);
  
 -- 
 1.9.3
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] ceph: use pagelist to present MDS request data

On Tue, 16 Sep 2014, Yan, Zheng wrote:
 Current code uses page array to present MDS request data. Pages in the
 array are allocated/freed by caller of ceph_mdsc_do_request(). If request
 is interrupted, the pages can be freed while they are still being used by
 the request message.
 
 The fix is use pagelist to present MDS request data. Pagelist is
 reference counted.
 
 Signed-off-by: Yan, Zheng z...@redhat.com

So much nicer!

Reviewed-by: Sage Weil s...@redhat.com

 ---
  fs/ceph/mds_client.c | 14 +-
  fs/ceph/mds_client.h |  4 +---
  fs/ceph/xattr.c  | 46 --
  3 files changed, 26 insertions(+), 38 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 30d7338..80d9f07 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -542,6 +542,8 @@ void ceph_mdsc_release_request(struct kref *kref)
   }
   kfree(req-r_path1);
   kfree(req-r_path2);
 + if (req-r_pagelist)
 + ceph_pagelist_release(req-r_pagelist);
   put_request_session(req);
   ceph_unreserve_caps(req-r_mdsc, req-r_caps_reservation);
   kfree(req);
 @@ -1847,13 +1849,15 @@ static struct ceph_msg *create_request_message(struct 
 ceph_mds_client *mdsc,
   msg-front.iov_len = p - msg-front.iov_base;
   msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
  
 - if (req-r_data_len) {
 - /* outbound data set only by ceph_sync_setxattr() */
 - BUG_ON(!req-r_pages);
 - ceph_msg_data_add_pages(msg, req-r_pages, req-r_data_len, 0);
 + if (req-r_pagelist) {
 + struct ceph_pagelist *pagelist = req-r_pagelist;
 + atomic_inc(pagelist-refcnt);
 + ceph_msg_data_add_pagelist(msg, pagelist);
 + msg-hdr.data_len = cpu_to_le32(pagelist-length);
 + } else {
 + msg-hdr.data_len = 0;
   }
  
 - msg-hdr.data_len = cpu_to_le32(req-r_data_len);
   msg-hdr.data_off = cpu_to_le16(0);
  
  out_free2:
 diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
 index e00737c..23015f7 100644
 --- a/fs/ceph/mds_client.h
 +++ b/fs/ceph/mds_client.h
 @@ -202,9 +202,7 @@ struct ceph_mds_request {
   bool r_direct_is_hash;  /* true if r_direct_hash is valid */
  
   /* data payload is used for xattr ops */
 - struct page **r_pages;
 - int r_num_pages;
 - int r_data_len;
 + struct ceph_pagelist *r_pagelist;
  
   /* what caps shall we drop? */
   int r_inode_drop, r_inode_unless;
 diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
 index eab3e2f..c7b18b2 100644
 --- a/fs/ceph/xattr.c
 +++ b/fs/ceph/xattr.c
 @@ -1,4 +1,5 @@
  #include linux/ceph/ceph_debug.h
 +#include linux/ceph/pagelist.h
  
  #include super.h
  #include mds_client.h
 @@ -852,28 +853,17 @@ static int ceph_sync_setxattr(struct dentry *dentry, 
 const char *name,
   struct ceph_mds_request *req;
   struct ceph_mds_client *mdsc = fsc-mdsc;
   int err;
 - int i, nr_pages;
 - struct page **pages = NULL;
 - void *kaddr;
 -
 - /* copy value into some pages */
 - nr_pages = calc_pages_for(0, size);
 - if (nr_pages) {
 - pages = kmalloc(sizeof(pages[0])*nr_pages, GFP_NOFS);
 - if (!pages)
 - return -ENOMEM;
 - err = -ENOMEM;
 - for (i = 0; i  nr_pages; i++) {
 - pages[i] = __page_cache_alloc(GFP_NOFS);
 - if (!pages[i]) {
 - nr_pages = i;
 - goto out;
 - }
 - kaddr = kmap(pages[i]);
 - memcpy(kaddr, value + i*PAGE_CACHE_SIZE,
 -min(PAGE_CACHE_SIZE, size-i*PAGE_CACHE_SIZE));
 - }
 - }
 + struct ceph_pagelist *pagelist;
 +
 + /* copy value into pagelist */
 + pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS);
 + if (!pagelist)
 + return -ENOMEM;
 +
 + ceph_pagelist_init(pagelist);
 + err = ceph_pagelist_append(pagelist, value, size);
 + if (err)
 + goto out;
  
   dout(setxattr value=%.*s\n, (int)size, value);
  
 @@ -894,9 +884,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, 
 const char *name,
   req-r_args.setxattr.flags = cpu_to_le32(flags);
   req-r_path2 = kstrdup(name, GFP_NOFS);
  
 - req-r_pages = pages;
 - req-r_num_pages = nr_pages;
 - req-r_data_len = size;
 + req-r_pagelist = pagelist;
 + pagelist = NULL;
  
   dout(xattr.ver (before): %lld\n, ci-i_xattrs.version);
   err = ceph_mdsc_do_request(mdsc, NULL, req);
 @@ -904,11 +893,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, 
 const char *name,
   dout(xattr.ver (after): %lld\n, ci-i_xattrs.version);
  
  out:
 - if (pages) {
 - for (i = 0; i  nr_pages; i++)
 - __free_page(pages[i]);
 - kfree(pages);
 - }
 + if

Re: [PATCH 3/3] ceph: include the initial ACL in create/mkdir/mknod MDS requests

On Tue, 16 Sep 2014, Yan, Zheng wrote:
 Current code set new file/directory's initial ACL in a non-atomic
 manner.
 Client first sends request to MDS to create new file/directory, then set
 the initial ACL after the new file/directory is successfully created.
 
 The fix is include the initial ACL in create/mkdir/mknod MDS requests.
 So MDS can handle creating file/directory and setting the initial ACL in
 one request.
 
 Signed-off-by: Yan, Zheng z...@redhat.com

Reviewed-by: Sage Weil s...@redhat.com

 ---
  fs/ceph/acl.c   | 125 
 
  fs/ceph/dir.c   |  41 ++-
  fs/ceph/file.c  |  27 +---
  fs/ceph/super.h |  24 ---
  4 files changed, 170 insertions(+), 47 deletions(-)
 
 diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
 index cebf2eb..5bd853b 100644
 --- a/fs/ceph/acl.c
 +++ b/fs/ceph/acl.c
 @@ -169,36 +169,109 @@ out:
   return ret;
  }
  
 -int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode 
 *dir)
 +int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
 +struct ceph_acls_info *info)
  {
 - struct posix_acl *default_acl, *acl;
 - umode_t new_mode = inode-i_mode;
 - int error;
 -
 - error = posix_acl_create(dir, new_mode, default_acl, acl);
 - if (error)
 - return error;
 -
 - if (!default_acl  !acl) {
 - cache_no_acl(inode);
 - if (new_mode != inode-i_mode) {
 - struct iattr newattrs = {
 - .ia_mode = new_mode,
 - .ia_valid = ATTR_MODE,
 - };
 - error = ceph_setattr(dentry, newattrs);
 + struct posix_acl *acl, *default_acl;
 + size_t val_size1 = 0, val_size2 = 0;
 + struct ceph_pagelist *pagelist = NULL;
 + void *tmp_buf = NULL;
 + int err;
 +
 + err = posix_acl_create(dir, mode, default_acl, acl);
 + if (err)
 + return err;
 +
 + if (acl) {
 + int ret = posix_acl_equiv_mode(acl, mode);
 + if (ret  0)
 + goto out_err;
 + if (ret == 0) {
 + posix_acl_release(acl);
 + acl = NULL;
   }
 - return error;
   }
  
 - if (default_acl) {
 - error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT);
 - posix_acl_release(default_acl);
 - }
 + if (!default_acl  !acl)
 + return 0;
 +
 + if (acl)
 + val_size1 = posix_acl_xattr_size(acl-a_count);
 + if (default_acl)
 + val_size2 = posix_acl_xattr_size(default_acl-a_count);
 +
 + err = -ENOMEM;
 + tmp_buf = kmalloc(max(val_size1, val_size2), GFP_NOFS);
 + if (!tmp_buf)
 + goto out_err;
 + pagelist = kmalloc(sizeof(struct ceph_pagelist), GFP_NOFS);
 + if (!pagelist)
 + goto out_err;
 + ceph_pagelist_init(pagelist);
 +
 + err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
 + if (err)
 + goto out_err;
 +
 + ceph_pagelist_encode_32(pagelist, acl  default_acl ? 2 : 1);
 +
   if (acl) {
 - if (!error)
 - error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS);
 - posix_acl_release(acl);
 + size_t len = strlen(POSIX_ACL_XATTR_ACCESS);
 + err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8);
 + if (err)
 + goto out_err;
 + ceph_pagelist_encode_string(pagelist, POSIX_ACL_XATTR_ACCESS,
 + len);
 + err = posix_acl_to_xattr(init_user_ns, acl,
 +  tmp_buf, val_size1);
 + if (err  0)
 + goto out_err;
 + ceph_pagelist_encode_32(pagelist, val_size1);
 + ceph_pagelist_append(pagelist, tmp_buf, val_size1);
   }
 - return error;
 + if (default_acl) {
 + size_t len = strlen(POSIX_ACL_XATTR_DEFAULT);
 + err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8);
 + if (err)
 + goto out_err;
 + err = ceph_pagelist_encode_string(pagelist,
 +   POSIX_ACL_XATTR_DEFAULT, len);
 + err = posix_acl_to_xattr(init_user_ns, default_acl,
 +  tmp_buf, val_size2);
 + if (err  0)
 + goto out_err;
 + ceph_pagelist_encode_32(pagelist, val_size2);
 + ceph_pagelist_append(pagelist, tmp_buf, val_size2);
 + }
 +
 + kfree(tmp_buf);
 +
 + info-acl = acl;
 + info-default_acl = default_acl;
 + info-pagelist = pagelist;
 + return 0;
 +
 +out_err:
 + posix_acl_release(acl);
 + posix_acl_release(default_acl);
 + kfree(tmp_buf);
 + if (pagelist)
 +

Re: [PATCH] ceph: move ceph_find_inode() outside the s_mutex

On Wed, 17 Sep 2014, Yan, Zheng wrote:
 ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
 may cause deadlock. (the freeing inode is waiting for OSD read reply, but
 dispatch thread is blocked by the s_mutex)
 
 Signed-off-by: Yan, Zheng z...@redhat.com

Reviewed-by: Sage Weil s...@redhat.com

 ---
  fs/ceph/caps.c   | 11 ++-
  fs/ceph/mds_client.c |  7 ---
  2 files changed, 10 insertions(+), 8 deletions(-)
 
 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 6d1cd45..b3b0a91 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -3045,6 +3045,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
   }
   }
  
 + /* lookup ino */
 + inode = ceph_find_inode(sb, vino);
 + ci = ceph_inode(inode);
 + dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino,
 +  vino.snap, inode);
 +
   mutex_lock(session-s_mutex);
   session-s_seq++;
   dout( mds%d seq %lld cap seq %u\n, session-s_mds, session-s_seq,
 @@ -3053,11 +3059,6 @@ void ceph_handle_caps(struct ceph_mds_session *session,
   if (op == CEPH_CAP_OP_IMPORT)
   ceph_add_cap_releases(mdsc, session);
  
 - /* lookup ino */
 - inode = ceph_find_inode(sb, vino);
 - ci = ceph_inode(inode);
 - dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino,
 -  vino.snap, inode);
   if (!inode) {
   dout( i don't have ino %llx\n, vino.ino);
  
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 80d9f07..c27e204 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2947,14 +2947,15 @@ static void handle_lease(struct ceph_mds_client *mdsc,
   if (dname.len != get_unaligned_le32(h+1))
   goto bad;
  
 - mutex_lock(session-s_mutex);
 - session-s_seq++;
 -
   /* lookup inode */
   inode = ceph_find_inode(sb, vino);
   dout(handle_lease %s, ino %llx %p %.*s\n,
ceph_lease_op_name(h-action), vino.ino, inode,
dname.len, dname.name);
 +
 + mutex_lock(session-s_mutex);
 + session-s_seq++;
 +
   if (inode == NULL) {
   dout(handle_lease no inode %llx\n, vino.ino);
   goto release;
 -- 
 1.9.3
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: S3 API Compatibility support

2014-09-18 Thread M Ranga Swami Reddy

Hi Sage,
Could you please advise, if Ceph support the low cost object
storages(like Amazon Glacier or RRS) for archiving objects like log
file etc.?

Thanks
Swami

On Thu, Sep 18, 2014 at 6:20 PM, M Ranga Swami Reddy
swamire...@gmail.com wrote:
 Hi ,

 Could you please check and clarify the below question on object
 lifecycle and notification S3 APIs support:

 1. To support the bucket lifecycle - we need to support the
 moving/deleting the objects/buckets based lifecycle settings.
 For ex: If an object lifecyle set as below:
   1. Archive it after 10 days - means move this object to low
 cost object storage after 10 days of the creation date.
2. Remove this object after 90days - mean remove this
 object from the low cost object after 90days of creation date.

 Q1- Does the ceph support the above concept like moving to low cost
 storage and delete from that storage?

 2. To support the object notifications:
   - First there should be low cost and high availability storage
 with single replica only. If an object created with this type of
 object storage,
 There could be chances that object could lose, so if an object
 of this type of storage lost, set the notifications.

 Q2- Does Ceph support low cost and high availability storage type?

 Thanks

 On Fri, Sep 12, 2014 at 8:00 PM, M Ranga Swami Reddy
 swamire...@gmail.com wrote:
 Hi Yehuda,

 Could you please check and clarify the below question on object
 lifecycle and notification S3 APIs support:

 1. To support the bucket lifecycle - we need to support the
 moving/deleting the objects/buckets based lifecycle settings.
 For ex: If an object lifecyle set as below:
   1. Archive it after 10 days - means move this object to low
 cost object storage after 10 days of the creation date.
2. Remove this object after 90days - mean remove this
 object from the low cost object after 90days of creation date.

 Q1- Does the ceph support the above concept like moving to low cost
 storage and delete from that storage?

 2. To support the object notifications:
   - First there should be low cost and high availability storage
 with single replica only. If an object created with this type of
 object storage,
 There could be chances that object could lose, so if an object
 of this type of storage lost, set the notifications.

 Q2- Does Ceph support low cost and high availability storage type?

 Thanks
 Swami







 On Tue, Jul 29, 2014 at 1:35 AM, Yehuda Sadeh yeh...@redhat.com wrote:
 Bucket lifecycle:
 http://tracker.ceph.com/issues/8929

 Bucket notification:
 http://tracker.ceph.com/issues/8956

 On Sun, Jul 27, 2014 at 12:54 AM, M Ranga Swami Reddy
 swamire...@gmail.com wrote:
 Good no know the details. Can you please share the issue ID for bucket
 lifecycle? My team also could start help here.
 Regarding the notification - Do we have the issue ID?
 Yes, the object versioning will be backlog one - I strongly feel we
 start working on this asap.

 Thanks
 Swami

 On Fri, Jul 25, 2014 at 11:31 PM, Yehuda Sadeh yeh...@redhat.com wrote:
 On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy
 swamire...@gmail.com wrote:
 Thanks for quick reply.
 Yes,  versioned object - missing in ceph ATM
 Iam looking for: bucket lifecylce (get/put/delete), bucket location,
 put object notification and object restore (ie versioned object) S3
 API support.
 Please let me now any of the above work is in progress or some one
 planned to work on.


 I opened an issue for bucket lifecycle (we already had an issue open
 for object expiration though). We do have bucket location already
 (part of the multi-region feature). Object versioning is definitely on
 our backlog and one that we'll hopefully implement sooner rather
 later.
 With regard to object notification, it'll require having a
 notification service which is a bit out of the scope. Integrating the
 gateway with such a service whouldn't be hard, but we'll need to have
 that first.

 Yehuda


 Thanks
 Swami


 On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote:
 Hi Team: As per the ceph document a few S3 APIs compatibility not 
 supported.

 Link: http://ceph.com/docs/master/radosgw/s3/

 Is there plan to support the ?n supported item in the above table?
 or
 Any working on this?

 Yes.  Unfortunately this table isn't particularly detailed or accurate 
 or
 up to date.   The main gap, I think, is versioned objects.

 Are there specfiic parts of the S3 API that are missing that you need?
 That sort of info is very helpful for prioritizing effort...

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: Fwd: S3 API Compatibility support