RE: severe librbd performance degradation in Giant
Sage, Any reason why the cache is by default enabled in Giant ? Regarding profiling, I will try if I can run Vtune/mutrace on this. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 8:53 PM To: Somnath Roy Cc: Haomai Wang; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant On Thu, 18 Sep 2014, Somnath Roy wrote: Yes Haomai... I would love to what a profiler says about the matter. There is going to be some overhead on the client associated with the cache for a random io workload, but 10x is a problem! sage -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 17, 2014 7:28 PM To: Somnath Roy Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or
How to use radosgw-admin to delete some or all users?
HI ALL, I know radosgw-admin can delete one user with command ‘radosgw-admin user rm uid=xxx’, I want to know have some commands to delete multiple or all users? thanks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: severe librbd performance degradation in Giant
Same question as Somnath, some customer of us not feeling that comfortable with cache, they still have some consistent concern. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, September 18, 2014 2:25 PM To: Sage Weil Cc: Haomai Wang; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, Any reason why the cache is by default enabled in Giant ? Regarding profiling, I will try if I can run Vtune/mutrace on this. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 8:53 PM To: Somnath Roy Cc: Haomai Wang; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant On Thu, 18 Sep 2014, Somnath Roy wrote: Yes Haomai... I would love to what a profiler says about the matter. There is going to be some overhead on the client associated with the cache for a random io workload, but 10x is a problem! sage -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 17, 2014 7:28 PM To: Somnath Roy Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the
Re: severe librbd performance degradation in Giant
According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More
Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
Hi Kevin, On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote: I feel that separating the arch-specific implementations out and have a default 'generic' implementation would be a huge improvement. Note that gf-complete was in active development for some time before including the SIMD code. In hindsight, we should have done this separation back in 2012, but had some time pressure due to a paper deadline and limited time available to the contributors. Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is fine by me. I'll rename than and start implementing neon optimized function in their own files. Also, there should be very little SIMD work with jerasure, as gf-complete is the Galois field backend, so I would not worry too much about that. I noticed, I have hooked my neon code already locally in ceph with touching jerasure. That covers clean-up work. We can discuss the best way to choose the underlying implementation (looks like we have a bunch of options) as this work is completed. With this in mind, what work were you planning to do? I can try to free up cycles to help, but that may not happen for a few weeks. Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more than a few days though. One last thing... If you do have code you want to push upstream, please submit a pull request(s) to our main bitbucket repo. Make sense? yes, thanks. Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
v2 aligned buffer changes for erasure codes
Hi, following a is an updated patchset. It passes now make check in src It has following changes: * use 32-byte alignment since the isa plugin use AVX2 (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers but I can't see a reason why it would need more than 32-bytes * ErasureCode::encode_prepare() handles more than one chunk with padding cheers Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] buffer: add an aligned buffer with less alignment than a page
SIMD optimized erasure code computation needs aligned memory. Buffers aligned to a page boundary are wasted on it though. The buffers used for the erasure code computation are typical smaller than a page. An alignment of 32 bytes is chosen to satisfy the needs of AVX/AVX2. Could be made arch specific to reduce the alignment to 16 bytes for arm/aarch64 NEON. Signed-off-by: Janne Grunau j...@jannau.net --- configure.ac | 9 + src/common/buffer.cc | 100 +++ src/include/buffer.h | 10 ++ 3 files changed, 119 insertions(+) diff --git a/configure.ac b/configure.ac index cccf2d9..1bb27c4 100644 --- a/configure.ac +++ b/configure.ac @@ -793,6 +793,15 @@ AC_MSG_RESULT([no]) ]) # +# Check for functions to provide aligned memory +# +AC_CHECK_HEADERS([malloc.h]) +AC_CHECK_FUNCS([posix_memalign _aligned_malloc memalign aligned_malloc], + [found_memalign=yes; break]) + +AS_IF([test x$found_memalign != xyes], [AC_MSG_WARN([No function for aligned memory allocation found])]) + +# # Check for pthread spinlock (depends on ACX_PTHREAD) # saved_LIBS=$LIBS diff --git a/src/common/buffer.cc b/src/common/buffer.cc index b141759..acc221f 100644 --- a/src/common/buffer.cc +++ b/src/common/buffer.cc @@ -30,6 +30,10 @@ #include sys/uio.h #include limits.h +#ifdef HAVE_MALLOC_H +#include malloc.h +#endif + namespace ceph { #ifdef BUFFER_DEBUG @@ -155,9 +159,15 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; virtual int zero_copy_to_fd(int fd, loff_t *offset) { return -ENOTSUP; } +virtual bool is_aligned() { + return ((long)data ~CEPH_ALIGN_MASK) == 0; +} virtual bool is_page_aligned() { return ((long)data ~CEPH_PAGE_MASK) == 0; } +bool is_n_align_sized() { + return (len ~CEPH_ALIGN_MASK) == 0; +} bool is_n_page_sized() { return (len ~CEPH_PAGE_MASK) == 0; } @@ -209,6 +219,41 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; } }; + class buffer::raw_aligned : public buffer::raw { + public: +raw_aligned(unsigned l) : raw(l) { + if (len) { +#if HAVE_POSIX_MEMALIGN +if (posix_memalign((void **) data, CEPH_ALIGN, len)) + data = 0; +#elif HAVE__ALIGNED_MALLOC +data = _aligned_malloc(len, CEPH_ALIGN); +#elif HAVE_MEMALIGN +data = memalign(CEPH_ALIGN, len); +#elif HAVE_ALIGNED_MALLOC +data = aligned_malloc((len + CEPH_ALIGN - 1) ~CEPH_ALIGN_MASK, + CEPH_ALIGN); +#else +data = malloc(len); +#endif +if (!data) + throw bad_alloc(); + } else { +data = 0; + } + inc_total_alloc(len); + bdout raw_aligned this alloc (void *)data l buffer::get_total_alloc() bendl; +} +~raw_aligned() { + free(data); + dec_total_alloc(len); + bdout raw_aligned this free (void *)data buffer::get_total_alloc() bendl; +} +raw* clone_empty() { + return new raw_aligned(len); +} + }; + #ifndef __CYGWIN__ class buffer::raw_mmap_pages : public buffer::raw { public: @@ -334,6 +379,10 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; return true; } +bool is_aligned() { + return false; +} + bool is_page_aligned() { return false; } @@ -520,6 +569,9 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; buffer::raw* buffer::create_static(unsigned len, char *buf) { return new raw_static(buf, len); } + buffer::raw* buffer::create_aligned(unsigned len) { +return new raw_aligned(len); + } buffer::raw* buffer::create_page_aligned(unsigned len) { #ifndef __CYGWIN__ //return new raw_mmap_pages(len); @@ -1013,6 +1065,16 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; return true; } + bool buffer::list::is_aligned() const + { +for (std::listptr::const_iterator it = _buffers.begin(); + it != _buffers.end(); + ++it) + if (!it-is_aligned()) +return false; +return true; + } + bool buffer::list::is_page_aligned() const { for (std::listptr::const_iterator it = _buffers.begin(); @@ -1101,6 +1163,44 @@ static simple_spinlock_t buffer_debug_lock = SIMPLE_SPINLOCK_INITIALIZER; _buffers.push_back(nb); } +void buffer::list::rebuild_aligned() +{ + std::listptr::iterator p = _buffers.begin(); + while (p != _buffers.end()) { +// keep anything that's already page sized+aligned +if (p-is_aligned() p-is_n_align_sized()) { + /*cout segment (void*)p-c_str() + offset ((unsigned long)p-c_str() ~CEPH_ALIGN_MASK) + length p-length() + (p-length() ~CEPH_ALIGN_MASK) ok std::endl; + */ + ++p; + continue; +} + +// consolidate unaligned items, until
[PATCH v2 2/3] ec: use 32-byte aligned buffers
Requiring page aligned buffers and realigning the input if necessary creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster with this change for technique=reed_sol_van,k=2,m=1. Also prevents a misaligned buffer when bufferlist::c_str(bufferlist) has to allocate a new buffer to provide continuous one. See bug #9408 Signed-off-by: Janne Grunau j...@jannau.net --- src/erasure-code/ErasureCode.cc | 57 - src/erasure-code/ErasureCode.h | 3 ++- 2 files changed, 41 insertions(+), 19 deletions(-) diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc index 5953f49..7aa5235 100644 --- a/src/erasure-code/ErasureCode.cc +++ b/src/erasure-code/ErasureCode.cc @@ -54,22 +54,49 @@ int ErasureCode::minimum_to_decode_with_cost(const setint want_to_read, } int ErasureCode::encode_prepare(const bufferlist raw, -bufferlist *prepared) const +mapint, bufferlist encoded) const { unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; unsigned blocksize = get_chunk_size(raw.length()); - unsigned padded_length = blocksize * k; - *prepared = raw; - if (padded_length - raw.length() 0) { -bufferptr pad(padded_length - raw.length()); -pad.zero(); -prepared-push_back(pad); + unsigned pad_len = blocksize * k - raw.length(); + unsigned padded_chunks = k - raw.length() / blocksize; + bufferlist prepared = raw; + + if (!prepared.is_aligned()) { +// splice padded chunks off to make the rebuild faster +if (padded_chunks) + prepared.splice((k - padded_chunks) * blocksize, + padded_chunks * blocksize - pad_len); +prepared.rebuild_aligned(); + } + + for (unsigned int i = 0; i k - padded_chunks; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.substr_of(prepared, i * blocksize, blocksize); + } + if (padded_chunks) { +unsigned remainder = raw.length() - (k - padded_chunks) * blocksize; +bufferlist padded; +bufferptr buf(buffer::create_aligned(padded_chunks * blocksize)); + +raw.copy((k - padded_chunks) * blocksize, remainder, buf.c_str()); +buf.zero(remainder, pad_len); +padded.push_back(buf); + +for (unsigned int i = k - padded_chunks; i k; i++) { + int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; + bufferlist chunk = encoded[chunk_index]; + chunk.substr_of(padded, (i - (k - padded_chunks)) * blocksize, blocksize); +} + } + for (unsigned int i = k; i k + m; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.push_back(buffer::create_aligned(blocksize)); } - unsigned coding_length = blocksize * m; - bufferptr coding(buffer::create_page_aligned(coding_length)); - prepared-push_back(coding); - prepared-rebuild_page_aligned(); + return 0; } @@ -80,15 +107,9 @@ int ErasureCode::encode(const setint want_to_encode, unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; bufferlist out; - int err = encode_prepare(in, out); + int err = encode_prepare(in, *encoded); if (err) return err; - unsigned blocksize = get_chunk_size(in.length()); - for (unsigned int i = 0; i k + m; i++) { -int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; -bufferlist chunk = (*encoded)[chunk_index]; -chunk.substr_of(out, i * blocksize, blocksize); - } encode_chunks(want_to_encode, encoded); for (unsigned int i = 0; i k + m; i++) { if (want_to_encode.count(i) == 0) diff --git a/src/erasure-code/ErasureCode.h b/src/erasure-code/ErasureCode.h index 7aaea95..62aa383 100644 --- a/src/erasure-code/ErasureCode.h +++ b/src/erasure-code/ErasureCode.h @@ -46,7 +46,8 @@ namespace ceph { const mapint, int available, setint *minimum); -int encode_prepare(const bufferlist raw, bufferlist *prepared) const; +int encode_prepare(const bufferlist raw, + mapint, bufferlist encoded) const; virtual int encode(const setint want_to_encode, const bufferlist in, -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: v2 aligned buffer changes for erasure codes
Hi Janne, = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers I should update the README since it is misleading ... it should say 8*k or 16*k byte aligned chunk size depending on the compiler/platform used, it is not the alignment of the allocated buffer addresses.The get_alignment in the plug-in function is used to compute the chunk size for the encoding (as I said not the start address alignment). If you pass k buffers for decoding each buffer should be aligned at least to 16 or as you pointed out better 32 bytes. For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. For the best possible performance on all platforms we should change the get_alignment function in the ISA plug-in to return 32*k if there are no other objections ?!?! Cheers Andreas. From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on behalf of Janne Grunau [j...@jannau.net] Sent: 18 September 2014 12:33 To: ceph-devel@vger.kernel.org Subject: v2 aligned buffer changes for erasure codes Hi, following a is an updated patchset. It passes now make check in src It has following changes: * use 32-byte alignment since the isa plugin use AVX2 (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers but I can't see a reason why it would need more than 32-bytes * ErasureCode::encode_prepare() handles more than one chunk with padding cheers Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: v2 aligned buffer changes for erasure codes
Hi Janne/Loic, there is more confusion atleast on my side ... I had now a look at the jerasure plug-in and I am now slightly confused why you have two ways to return in get_alignment ... one is as I assume and another one is per_chunk_alignment ... what should the function return Loic? Cheers Andreas. From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on behalf of Andreas Joachim Peters [andreas.joachim.pet...@cern.ch] Sent: 18 September 2014 14:18 To: Janne Grunau; ceph-devel@vger.kernel.org Subject: RE: v2 aligned buffer changes for erasure codes Hi Janne, = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers I should update the README since it is misleading ... it should say 8*k or 16*k byte aligned chunk size depending on the compiler/platform used, it is not the alignment of the allocated buffer addresses.The get_alignment in the plug-in function is used to compute the chunk size for the encoding (as I said not the start address alignment). If you pass k buffers for decoding each buffer should be aligned at least to 16 or as you pointed out better 32 bytes. For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. For the best possible performance on all platforms we should change the get_alignment function in the ISA plug-in to return 32*k if there are no other objections ?!?! Cheers Andreas. From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on behalf of Janne Grunau [j...@jannau.net] Sent: 18 September 2014 12:33 To: ceph-devel@vger.kernel.org Subject: v2 aligned buffer changes for erasure codes Hi, following a is an updated patchset. It passes now make check in src It has following changes: * use 32-byte alignment since the isa plugin use AVX2 (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers but I can't see a reason why it would need more than 32-bytes * ErasureCode::encode_prepare() handles more than one chunk with padding cheers Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: severe librbd performance degradation in Giant
On 09/18/2014 04:49 AM, Alexandre DERUMIER wrote: According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) I can't do it right now since I'm in the middle of reinstalling fedora on the test nodes, but I will try to replicate this as well if we haven't figured it out before hand. Mark - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
Re: v2 aligned buffer changes for erasure codes
Hi, On 2014-09-18 12:18:59 +, Andreas Joachim Peters wrote: = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers I should update the README since it is misleading ... it should say 8*k or 16*k byte aligned chunk size depending on the compiler/platform used, it is not the alignment of the allocated buffer addresses.The get_alignment in the plug-in function is used to compute the chunk size for the encoding (as I said not the start address alignment). I've seen that If you pass k buffers for decoding each buffer should be aligned at least to 16 or as you pointed out better 32 bytes. ok, that makes sense For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. I don't get that. How is the buffer splitted? into k (+ m) chunk size parts? As long as the start and the length are both 16 (or 32) byte aligned all parts are properly aligned too. I don't see where the k comes into play. cheers Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
snap_trimming + backfilling is inefficient with many purged_snaps
(moving this discussion to -devel) Begin forwarded message: From: Florian Haas flor...@hastexo.com Date: 17 Sep 2014 18:02:09 CEST Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU To: Dan Van Der Ster daniel.vanders...@cern.ch Cc: Craig Lewis cle...@centraldesktop.com, ceph-us...@lists.ceph.com ceph-us...@lists.ceph.com On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: From: Florian Haas flor...@hastexo.com Sent: Sep 17, 2014 5:33 PM To: Dan Van Der Ster Cc: Craig Lewis cle...@centraldesktop.com;ceph-us...@lists.ceph.com Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On 17 Sep 2014, at 17:09, Florian Haas flor...@hastexo.com wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis cle...@centraldesktop.com wrote: In the interest of removing variables, I removed all snapshots on all pools, then restarted all ceph daemons at the same time. This brought up osd.8 as well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? Could your ticket be related to the snap trimming issue I’ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it’s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) That sounds promising, thank you! debug_osd=10 should actually be sufficient as those snap_trim messages get logged at that level. :) Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Looking forward to any ideas someone might have. Cheers, Dan N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: Fwd: S3 API Compatibility support
Hi , Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks On Fri, Sep 12, 2014 at 8:00 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: Hi Yehuda, Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks Swami On Tue, Jul 29, 2014 at 1:35 AM, Yehuda Sadeh yeh...@redhat.com wrote: Bucket lifecycle: http://tracker.ceph.com/issues/8929 Bucket notification: http://tracker.ceph.com/issues/8956 On Sun, Jul 27, 2014 at 12:54 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Good no know the details. Can you please share the issue ID for bucket lifecycle? My team also could start help here. Regarding the notification - Do we have the issue ID? Yes, the object versioning will be backlog one - I strongly feel we start working on this asap. Thanks Swami On Fri, Jul 25, 2014 at 11:31 PM, Yehuda Sadeh yeh...@redhat.com wrote: On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Thanks for quick reply. Yes, versioned object - missing in ceph ATM Iam looking for: bucket lifecylce (get/put/delete), bucket location, put object notification and object restore (ie versioned object) S3 API support. Please let me now any of the above work is in progress or some one planned to work on. I opened an issue for bucket lifecycle (we already had an issue open for object expiration though). We do have bucket location already (part of the multi-region feature). Object versioning is definitely on our backlog and one that we'll hopefully implement sooner rather later. With regard to object notification, it'll require having a notification service which is a bit out of the scope. Integrating the gateway with such a service whouldn't be hard, but we'll need to have that first. Yehuda Thanks Swami On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote: On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote: Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ?n supported item in the above table? or Any working on this? Yes. Unfortunately this table isn't particularly detailed or accurate or up to date. The main gap, I think, is versioned objects. Are there specfiic parts of the S3 API that are missing that you need? That sort of info is very helpful for prioritizing effort... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v2 aligned buffer changes for erasure codes
Hi, On 2014-09-18 12:34:49 +, Andreas Joachim Peters wrote: there is more confusion atleast on my side ... I had now a look at the jerasure plug-in and I am now slightly confused why you have two ways to return in get_alignment ... one is as I assume and another one is per_chunk_alignment ... what should the function return Loic? the per_chunk_alignment is just a bool which says that each chunk has to start at an aligned address. get_alignement() seems to be used to align the chunk size. It might come from gf-complete' strange alignment requirements. Instead of requiring aligned buffers it requires that src and dst buffer have the same remainder when divided by 16. The best way to archieve that is to align the length to 16 and use a single buffer. I agree it's convoluted. Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: v2 aligned buffer changes for erasure codes
Hi Janne, For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. I don't get that. How is the buffer splitted? into k (+ m) chunk size parts? As long as the start and the length are both 16 (or 32) byte aligned all parts are properly aligned too. I don't see where the k comes into play. The original data block to encode has to be split into k equally long pieces. Each piece is given as one of the k input buffers to the erasure code algorithm producing m output buffers and each piece has to have an aligned starting address and length. If you deal with 128 byte data input buffers for k=4 it splits like offset=00 len=32 as chunk1 offset=32 len=32 as chunk2 offset=64 len=32 as chunk3 offset=96 len=32 as chunk4 If the desired IO size would be 196 bytes the 32 byte alignment requirement blows this buffer up to 256 bytes: offset=00 len=64 as chunk1 offset=64 len=64 as chunk2 offset=128 len=64 as chunk3 offset=196 len=64 as chunk4 For the typical 4kb only k=2,4,8,16,32,64,128 do not increase the buffer. If someone configures e.g. k=10 the buffer is increased from 4096 to 4160 bytes and it creates 1.5% storage volume overhead. Cheers Andreas. From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on behalf of Janne Grunau [j...@jannau.net] Sent: 18 September 2014 14:40 To: Andreas Joachim Peters Cc: ceph-devel@vger.kernel.org Subject: Re: v2 aligned buffer changes for erasure codes Hi, On 2014-09-18 12:18:59 +, Andreas Joachim Peters wrote: = (src/erasure-code/isa/README claims it needs 16*k byte aligned buffers I should update the README since it is misleading ... it should say 8*k or 16*k byte aligned chunk size depending on the compiler/platform used, it is not the alignment of the allocated buffer addresses.The get_alignment in the plug-in function is used to compute the chunk size for the encoding (as I said not the start address alignment). I've seen that If you pass k buffers for decoding each buffer should be aligned at least to 16 or as you pointed out better 32 bytes. ok, that makes sense For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. I don't get that. How is the buffer splitted? into k (+ m) chunk size parts? As long as the start and the length are both 16 (or 32) byte aligned all parts are properly aligned too. I don't see where the k comes into play. cheers Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v2 aligned buffer changes for erasure codes
On 2014-09-18 13:01:03 +, Andreas Joachim Peters wrote: For encoding there is normally a single buffer split 'virtually' into k pieces. To make all pieces starting at an aligned address one needs to align the chunk size to e.g. 16*k. I don't get that. How is the buffer splitted? into k (+ m) chunk size parts? As long as the start and the length are both 16 (or 32) byte aligned all parts are properly aligned too. I don't see where the k comes into play. The original data block to encode has to be split into k equally long pieces. Each piece is given as one of the k input buffers to the erasure code algorithm producing m output buffers and each piece has to have an aligned starting address and length. If you deal with 128 byte data input buffers for k=4 it splits like offset=00 len=32 as chunk1 offset=32 len=32 as chunk2 offset=64 len=32 as chunk3 offset=96 len=32 as chunk4 If the desired IO size would be 196 bytes the 32 byte alignment requirement blows this buffer up to 256 bytes: offset=00 len=64 as chunk1 offset=64 len=64 as chunk2 offset=128 len=64 as chunk3 offset=196 len=64 as chunk4 I fail to see how the 32 * k is related to alignment. It's only used for to pad the total size so it becomes a mulitple of k * 32. That is ok since we want k 32-byte aligned chunks. The alignment for each chunk is just 32-bytes. Janne -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: severe librbd performance degradation in Giant
On Thu, 18 Sep 2014, Somnath Roy wrote: Sage, Any reason why the cache is by default enabled in Giant ? It's recommended practice to turn it on. It improves performance in general (especially with HDD OSDs). Do you mind comparing sequential small IOs? sage Regarding profiling, I will try if I can run Vtune/mutrace on this. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 8:53 PM To: Somnath Roy Cc: Haomai Wang; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant On Thu, 18 Sep 2014, Somnath Roy wrote: Yes Haomai... I would love to what a profiler says about the matter. There is going to be some overhead on the client associated with the cache for a random io workload, but 10x is a problem! sage -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 17, 2014 7:28 PM To: Somnath Roy Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in
RE: v2 aligned buffer changes for erasure codes
I fail to see how the 32 * k is related to alignment. It's only used for to pad the total size so it becomes a mulitple of k * 32. That is ok since we want k 32-byte aligned chunks. The alignment for each chunk is just 32-bytes. Yes, agreed! The alignment for each chunk should be 32 bytes. And the implementation is most efficient if the given encoding buffer is already padded to k*32 bytes, it avoids an additional buffer allocation and copy. Cheers Andreas.-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. Looking forward to any ideas someone might have. So am I. :) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: radosgw-admin list users?
On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org wrote: Related to this thread, radosgw-admin doesn't seem to have anything to list the users. The closest I have as a hack is: rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq Try: $ radosgw-admin metadata list user Yehuda But this does require internal knowledge of how it's stored, and I don't want to rely on it. On Thu, Sep 18, 2014 at 03:53:27PM +0800, Zhao zhiming wrote: HI ALL, I know radosgw-admin can delete one user with command ‘radosgw-admin user rm uid=xxx’, I want to know have some commands to delete multiple or all users? thanks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: severe librbd performance degradation in Giant
Alexandre, What tool are you using ? I used fio rbd. Also, I hope you have Giant package installed in the client side as well and rbd_cache =true is set on the client conf file. FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had to make sure fio rbd is really loading giant librbd (if you have multiple copies around , which was in my case) for reproducing it. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, September 18, 2014 2:49 AM To: Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will raise a tracker to track this. For giant the default cache settings changed to: rbd cache = true rbd cache writethrough until flush = true If fio isn't sending flushes as the test is running, the cache will stay in writethrough mode. Does the difference remain if you set rbd cache writethrough until flush = false ? Josh PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s)
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). That sounds utterly awful, but if anyone has a better idea (other than wait until the patch is merged), I'd be all ears. Cheers Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi, September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote: On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said. Cheers, Dan That sounds utterly awful, but if anyone has a better idea (other than wait until the patch is merged), I'd be all ears. Cheers Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
-- Dan van der Ster || Data Storage Services || CERN IT Department -- September 18 2014 9:12 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote: On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said. Two other more risky work-arounds that I didn't try yet are: 1. lower the osd_snap_trim_thread_timeout from 3600s to something like 10 or 20s, so that these long trim operations are just killed. I have no idea
Re: radosgw-admin list users?
On Thu, Sep 18, 2014 at 10:38:19AM -0700, Yehuda Sadeh wrote: On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org wrote: Related to this thread, radosgw-admin doesn't seem to have anything to list the users. The closest I have as a hack is: rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq Try: $ radosgw-admin metadata list user Ooh nice! Nothing in the --help output says 'metadata list' takes arguments (and the manpage for radosgw-admin doesn't even have the metadata commands). -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote: On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said. So just to clarify, what you're doing is out of the OSDs that are spinning, you mark 2 out and wait for them to go empty? What I'm seeing i my environment is that the OSDs *do* go down. Marking them out seems not to help much as the problem then promptly pops up elsewhere. So, disaster is a pretty good description. Would anyone from the
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, 19 Sep 2014, Florian Haas wrote: Hi Sage, was the off-list reply intentional? Whoops! Nope :) On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil sw...@redhat.com wrote: So, disaster is a pretty good description. Would anyone from the core team like to suggest another course of action or workaround, or are Dan and I generally on the right track to make the best out of a pretty bad situation? The short term fix would probably be to just prevent backfill for the time being until the bug is fixed. As in, osd max backfills = 0? Yeah :) Just managed to reproduce the problem... sage The root of the problem seems to be that it is trying to trim snaps that aren't there. I'm trying to reproduce the issue now! Hopefully the fix is simple... http://tracker.ceph.com/issues/9487 Thanks! sage Thanks. :) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: severe librbd performance degradation in Giant
I also observed performance degradation on my full SSD setup , I can got ~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only got ~12K IOPS Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Friday, September 19, 2014 2:03 AM To: Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Alexandre, What tool are you using ? I used fio rbd. Also, I hope you have Giant package installed in the client side as well and rbd_cache =true is set on the client conf file. FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had to make sure fio rbd is really loading giant librbd (if you have multiple copies around , which was in my case) for reproducing it. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, September 18, 2014 2:49 AM To: Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent librbd has issues. I will
RE: severe librbd performance degradation in Giant
My bad , with latest master , we got ~ 120K IOPS. Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Shu, Xinxin Sent: Friday, September 19, 2014 9:08 AM To: Somnath Roy; Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant I also observed performance degradation on my full SSD setup , I can got ~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only got ~12K IOPS Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Friday, September 19, 2014 2:03 AM To: Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Alexandre, What tool are you using ? I used fio rbd. Also, I hope you have Giant package installed in the client side as well and rbd_cache =true is set on the client conf file. FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had to make sure fio rbd is really loading giant librbd (if you have multiple copies around , which was in my case) for reproducing it. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, September 18, 2014 2:49 AM To: Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio
Re: radosgw-admin list users?
Thanks Robin and Yehuda, but I want to how to delete multiple users. I use 'radosgw-admin metadata list user’ to list all users, and found some users have unreadable code. radosgw-admin metadata list user [ zzm1, ?zzm1, ?zzm1”] and I can’t delete these unreadable users. radosgw-admin user rm —uid=?zzm1 could not remove user: unable to remove user, user does not exist so I want to know, do rgw admin have command to delete multiply or all users? Thanks. On Sep 19, 2014, at 4:01 AM, Robin H. Johnson robb...@gentoo.org wrote: On Thu, Sep 18, 2014 at 10:38:19AM -0700, Yehuda Sadeh wrote: On Thu, Sep 18, 2014 at 10:27 AM, Robin H. Johnson robb...@gentoo.org wrote: Related to this thread, radosgw-admin doesn't seem to have anything to list the users. The closest I have as a hack is: rados ls --pool=.users.uid |sed 's,.buckets$,,g' |sort |uniq Try: $ radosgw-admin metadata list user Ooh nice! Nothing in the --help output says 'metadata list' takes arguments (and the manpage for radosgw-admin doesn't even have the metadata commands). -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libceph: reference counting pagelist
On Tue, 16 Sep 2014, Yan, Zheng wrote: this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng z...@redhat.com Reviewed-by: Sage Weil s...@redhat.com --- fs/ceph/mds_client.c | 1 - include/linux/ceph/pagelist.h | 5 - net/ceph/messenger.c | 4 +--- net/ceph/pagelist.c | 8 ++-- 4 files changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index a17fc49..30d7338 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2796,7 +2796,6 @@ fail: mutex_unlock(session-s_mutex); fail_nomsg: ceph_pagelist_release(pagelist); - kfree(pagelist); fail_nopagelist: pr_err(error %d preparing reconnect for mds%d\n, err, mds); return; diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h index 9660d6b..5f871d8 100644 --- a/include/linux/ceph/pagelist.h +++ b/include/linux/ceph/pagelist.h @@ -2,6 +2,7 @@ #define __FS_CEPH_PAGELIST_H #include linux/list.h +#include linux/atomic.h struct ceph_pagelist { struct list_head head; @@ -10,6 +11,7 @@ struct ceph_pagelist { size_t room; struct list_head free_list; size_t num_pages_free; + atomic_t refcnt; }; struct ceph_pagelist_cursor { @@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist *pl) pl-room = 0; INIT_LIST_HEAD(pl-free_list); pl-num_pages_free = 0; + atomic_set(pl-refcnt, 1); } -extern int ceph_pagelist_release(struct ceph_pagelist *pl); +extern void ceph_pagelist_release(struct ceph_pagelist *pl); extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, size_t l); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index e7d9411..9764c77 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) return; WARN_ON(!list_empty(data-links)); - if (data-type == CEPH_MSG_DATA_PAGELIST) { + if (data-type == CEPH_MSG_DATA_PAGELIST) ceph_pagelist_release(data-pagelist); - kfree(data-pagelist); - } kmem_cache_free(ceph_msg_data_cache, data); } diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c index 92866be..f70b651 100644 --- a/net/ceph/pagelist.c +++ b/net/ceph/pagelist.c @@ -1,5 +1,6 @@ #include linux/module.h #include linux/gfp.h +#include linux/slab.h #include linux/pagemap.h #include linux/highmem.h #include linux/ceph/pagelist.h @@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist *pl) } } -int ceph_pagelist_release(struct ceph_pagelist *pl) +void ceph_pagelist_release(struct ceph_pagelist *pl) { + if (!atomic_dec_and_test(pl-refcnt)) + return; ceph_pagelist_unmap_tail(pl); while (!list_empty(pl-head)) { struct page *page = list_first_entry(pl-head, struct page, @@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl) __free_page(page); } ceph_pagelist_free_reserve(pl); - return 0; + kfree(pl); + return; } EXPORT_SYMBOL(ceph_pagelist_release); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libceph: reference counting pagelist
On Tue, 16 Sep 2014, Yan, Zheng wrote: this allow pagelist to present data that may be sent multiple times. Hmm, actually we probably should use the kref code for this, even though the refcounting is trivial. sage Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 1 - include/linux/ceph/pagelist.h | 5 - net/ceph/messenger.c | 4 +--- net/ceph/pagelist.c | 8 ++-- 4 files changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index a17fc49..30d7338 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2796,7 +2796,6 @@ fail: mutex_unlock(session-s_mutex); fail_nomsg: ceph_pagelist_release(pagelist); - kfree(pagelist); fail_nopagelist: pr_err(error %d preparing reconnect for mds%d\n, err, mds); return; diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h index 9660d6b..5f871d8 100644 --- a/include/linux/ceph/pagelist.h +++ b/include/linux/ceph/pagelist.h @@ -2,6 +2,7 @@ #define __FS_CEPH_PAGELIST_H #include linux/list.h +#include linux/atomic.h struct ceph_pagelist { struct list_head head; @@ -10,6 +11,7 @@ struct ceph_pagelist { size_t room; struct list_head free_list; size_t num_pages_free; + atomic_t refcnt; }; struct ceph_pagelist_cursor { @@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist *pl) pl-room = 0; INIT_LIST_HEAD(pl-free_list); pl-num_pages_free = 0; + atomic_set(pl-refcnt, 1); } -extern int ceph_pagelist_release(struct ceph_pagelist *pl); +extern void ceph_pagelist_release(struct ceph_pagelist *pl); extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, size_t l); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index e7d9411..9764c77 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) return; WARN_ON(!list_empty(data-links)); - if (data-type == CEPH_MSG_DATA_PAGELIST) { + if (data-type == CEPH_MSG_DATA_PAGELIST) ceph_pagelist_release(data-pagelist); - kfree(data-pagelist); - } kmem_cache_free(ceph_msg_data_cache, data); } diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c index 92866be..f70b651 100644 --- a/net/ceph/pagelist.c +++ b/net/ceph/pagelist.c @@ -1,5 +1,6 @@ #include linux/module.h #include linux/gfp.h +#include linux/slab.h #include linux/pagemap.h #include linux/highmem.h #include linux/ceph/pagelist.h @@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist *pl) } } -int ceph_pagelist_release(struct ceph_pagelist *pl) +void ceph_pagelist_release(struct ceph_pagelist *pl) { + if (!atomic_dec_and_test(pl-refcnt)) + return; ceph_pagelist_unmap_tail(pl); while (!list_empty(pl-head)) { struct page *page = list_first_entry(pl-head, struct page, @@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl) __free_page(page); } ceph_pagelist_free_reserve(pl); - return 0; + kfree(pl); + return; } EXPORT_SYMBOL(ceph_pagelist_release); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] ceph: use pagelist to present MDS request data
On Tue, 16 Sep 2014, Yan, Zheng wrote: Current code uses page array to present MDS request data. Pages in the array are allocated/freed by caller of ceph_mdsc_do_request(). If request is interrupted, the pages can be freed while they are still being used by the request message. The fix is use pagelist to present MDS request data. Pagelist is reference counted. Signed-off-by: Yan, Zheng z...@redhat.com So much nicer! Reviewed-by: Sage Weil s...@redhat.com --- fs/ceph/mds_client.c | 14 +- fs/ceph/mds_client.h | 4 +--- fs/ceph/xattr.c | 46 -- 3 files changed, 26 insertions(+), 38 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 30d7338..80d9f07 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -542,6 +542,8 @@ void ceph_mdsc_release_request(struct kref *kref) } kfree(req-r_path1); kfree(req-r_path2); + if (req-r_pagelist) + ceph_pagelist_release(req-r_pagelist); put_request_session(req); ceph_unreserve_caps(req-r_mdsc, req-r_caps_reservation); kfree(req); @@ -1847,13 +1849,15 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc, msg-front.iov_len = p - msg-front.iov_base; msg-hdr.front_len = cpu_to_le32(msg-front.iov_len); - if (req-r_data_len) { - /* outbound data set only by ceph_sync_setxattr() */ - BUG_ON(!req-r_pages); - ceph_msg_data_add_pages(msg, req-r_pages, req-r_data_len, 0); + if (req-r_pagelist) { + struct ceph_pagelist *pagelist = req-r_pagelist; + atomic_inc(pagelist-refcnt); + ceph_msg_data_add_pagelist(msg, pagelist); + msg-hdr.data_len = cpu_to_le32(pagelist-length); + } else { + msg-hdr.data_len = 0; } - msg-hdr.data_len = cpu_to_le32(req-r_data_len); msg-hdr.data_off = cpu_to_le16(0); out_free2: diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index e00737c..23015f7 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -202,9 +202,7 @@ struct ceph_mds_request { bool r_direct_is_hash; /* true if r_direct_hash is valid */ /* data payload is used for xattr ops */ - struct page **r_pages; - int r_num_pages; - int r_data_len; + struct ceph_pagelist *r_pagelist; /* what caps shall we drop? */ int r_inode_drop, r_inode_unless; diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index eab3e2f..c7b18b2 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1,4 +1,5 @@ #include linux/ceph/ceph_debug.h +#include linux/ceph/pagelist.h #include super.h #include mds_client.h @@ -852,28 +853,17 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, struct ceph_mds_request *req; struct ceph_mds_client *mdsc = fsc-mdsc; int err; - int i, nr_pages; - struct page **pages = NULL; - void *kaddr; - - /* copy value into some pages */ - nr_pages = calc_pages_for(0, size); - if (nr_pages) { - pages = kmalloc(sizeof(pages[0])*nr_pages, GFP_NOFS); - if (!pages) - return -ENOMEM; - err = -ENOMEM; - for (i = 0; i nr_pages; i++) { - pages[i] = __page_cache_alloc(GFP_NOFS); - if (!pages[i]) { - nr_pages = i; - goto out; - } - kaddr = kmap(pages[i]); - memcpy(kaddr, value + i*PAGE_CACHE_SIZE, -min(PAGE_CACHE_SIZE, size-i*PAGE_CACHE_SIZE)); - } - } + struct ceph_pagelist *pagelist; + + /* copy value into pagelist */ + pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS); + if (!pagelist) + return -ENOMEM; + + ceph_pagelist_init(pagelist); + err = ceph_pagelist_append(pagelist, value, size); + if (err) + goto out; dout(setxattr value=%.*s\n, (int)size, value); @@ -894,9 +884,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, req-r_args.setxattr.flags = cpu_to_le32(flags); req-r_path2 = kstrdup(name, GFP_NOFS); - req-r_pages = pages; - req-r_num_pages = nr_pages; - req-r_data_len = size; + req-r_pagelist = pagelist; + pagelist = NULL; dout(xattr.ver (before): %lld\n, ci-i_xattrs.version); err = ceph_mdsc_do_request(mdsc, NULL, req); @@ -904,11 +893,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, dout(xattr.ver (after): %lld\n, ci-i_xattrs.version); out: - if (pages) { - for (i = 0; i nr_pages; i++) - __free_page(pages[i]); - kfree(pages); - } + if
Re: [PATCH 3/3] ceph: include the initial ACL in create/mkdir/mknod MDS requests
On Tue, 16 Sep 2014, Yan, Zheng wrote: Current code set new file/directory's initial ACL in a non-atomic manner. Client first sends request to MDS to create new file/directory, then set the initial ACL after the new file/directory is successfully created. The fix is include the initial ACL in create/mkdir/mknod MDS requests. So MDS can handle creating file/directory and setting the initial ACL in one request. Signed-off-by: Yan, Zheng z...@redhat.com Reviewed-by: Sage Weil s...@redhat.com --- fs/ceph/acl.c | 125 fs/ceph/dir.c | 41 ++- fs/ceph/file.c | 27 +--- fs/ceph/super.h | 24 --- 4 files changed, 170 insertions(+), 47 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index cebf2eb..5bd853b 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -169,36 +169,109 @@ out: return ret; } -int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode *dir) +int ceph_pre_init_acls(struct inode *dir, umode_t *mode, +struct ceph_acls_info *info) { - struct posix_acl *default_acl, *acl; - umode_t new_mode = inode-i_mode; - int error; - - error = posix_acl_create(dir, new_mode, default_acl, acl); - if (error) - return error; - - if (!default_acl !acl) { - cache_no_acl(inode); - if (new_mode != inode-i_mode) { - struct iattr newattrs = { - .ia_mode = new_mode, - .ia_valid = ATTR_MODE, - }; - error = ceph_setattr(dentry, newattrs); + struct posix_acl *acl, *default_acl; + size_t val_size1 = 0, val_size2 = 0; + struct ceph_pagelist *pagelist = NULL; + void *tmp_buf = NULL; + int err; + + err = posix_acl_create(dir, mode, default_acl, acl); + if (err) + return err; + + if (acl) { + int ret = posix_acl_equiv_mode(acl, mode); + if (ret 0) + goto out_err; + if (ret == 0) { + posix_acl_release(acl); + acl = NULL; } - return error; } - if (default_acl) { - error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT); - posix_acl_release(default_acl); - } + if (!default_acl !acl) + return 0; + + if (acl) + val_size1 = posix_acl_xattr_size(acl-a_count); + if (default_acl) + val_size2 = posix_acl_xattr_size(default_acl-a_count); + + err = -ENOMEM; + tmp_buf = kmalloc(max(val_size1, val_size2), GFP_NOFS); + if (!tmp_buf) + goto out_err; + pagelist = kmalloc(sizeof(struct ceph_pagelist), GFP_NOFS); + if (!pagelist) + goto out_err; + ceph_pagelist_init(pagelist); + + err = ceph_pagelist_reserve(pagelist, PAGE_SIZE); + if (err) + goto out_err; + + ceph_pagelist_encode_32(pagelist, acl default_acl ? 2 : 1); + if (acl) { - if (!error) - error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS); - posix_acl_release(acl); + size_t len = strlen(POSIX_ACL_XATTR_ACCESS); + err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8); + if (err) + goto out_err; + ceph_pagelist_encode_string(pagelist, POSIX_ACL_XATTR_ACCESS, + len); + err = posix_acl_to_xattr(init_user_ns, acl, + tmp_buf, val_size1); + if (err 0) + goto out_err; + ceph_pagelist_encode_32(pagelist, val_size1); + ceph_pagelist_append(pagelist, tmp_buf, val_size1); } - return error; + if (default_acl) { + size_t len = strlen(POSIX_ACL_XATTR_DEFAULT); + err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8); + if (err) + goto out_err; + err = ceph_pagelist_encode_string(pagelist, + POSIX_ACL_XATTR_DEFAULT, len); + err = posix_acl_to_xattr(init_user_ns, default_acl, + tmp_buf, val_size2); + if (err 0) + goto out_err; + ceph_pagelist_encode_32(pagelist, val_size2); + ceph_pagelist_append(pagelist, tmp_buf, val_size2); + } + + kfree(tmp_buf); + + info-acl = acl; + info-default_acl = default_acl; + info-pagelist = pagelist; + return 0; + +out_err: + posix_acl_release(acl); + posix_acl_release(default_acl); + kfree(tmp_buf); + if (pagelist) +
Re: [PATCH] ceph: move ceph_find_inode() outside the s_mutex
On Wed, 17 Sep 2014, Yan, Zheng wrote: ceph_find_inode() may wait on freeing inode, using it inside the s_mutex may cause deadlock. (the freeing inode is waiting for OSD read reply, but dispatch thread is blocked by the s_mutex) Signed-off-by: Yan, Zheng z...@redhat.com Reviewed-by: Sage Weil s...@redhat.com --- fs/ceph/caps.c | 11 ++- fs/ceph/mds_client.c | 7 --- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 6d1cd45..b3b0a91 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -3045,6 +3045,12 @@ void ceph_handle_caps(struct ceph_mds_session *session, } } + /* lookup ino */ + inode = ceph_find_inode(sb, vino); + ci = ceph_inode(inode); + dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino, + vino.snap, inode); + mutex_lock(session-s_mutex); session-s_seq++; dout( mds%d seq %lld cap seq %u\n, session-s_mds, session-s_seq, @@ -3053,11 +3059,6 @@ void ceph_handle_caps(struct ceph_mds_session *session, if (op == CEPH_CAP_OP_IMPORT) ceph_add_cap_releases(mdsc, session); - /* lookup ino */ - inode = ceph_find_inode(sb, vino); - ci = ceph_inode(inode); - dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino, - vino.snap, inode); if (!inode) { dout( i don't have ino %llx\n, vino.ino); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 80d9f07..c27e204 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2947,14 +2947,15 @@ static void handle_lease(struct ceph_mds_client *mdsc, if (dname.len != get_unaligned_le32(h+1)) goto bad; - mutex_lock(session-s_mutex); - session-s_seq++; - /* lookup inode */ inode = ceph_find_inode(sb, vino); dout(handle_lease %s, ino %llx %p %.*s\n, ceph_lease_op_name(h-action), vino.ino, inode, dname.len, dname.name); + + mutex_lock(session-s_mutex); + session-s_seq++; + if (inode == NULL) { dout(handle_lease no inode %llx\n, vino.ino); goto release; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: S3 API Compatibility support
Hi Sage, Could you please advise, if Ceph support the low cost object storages(like Amazon Glacier or RRS) for archiving objects like log file etc.? Thanks Swami On Thu, Sep 18, 2014 at 6:20 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: Hi , Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks On Fri, Sep 12, 2014 at 8:00 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: Hi Yehuda, Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks Swami On Tue, Jul 29, 2014 at 1:35 AM, Yehuda Sadeh yeh...@redhat.com wrote: Bucket lifecycle: http://tracker.ceph.com/issues/8929 Bucket notification: http://tracker.ceph.com/issues/8956 On Sun, Jul 27, 2014 at 12:54 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Good no know the details. Can you please share the issue ID for bucket lifecycle? My team also could start help here. Regarding the notification - Do we have the issue ID? Yes, the object versioning will be backlog one - I strongly feel we start working on this asap. Thanks Swami On Fri, Jul 25, 2014 at 11:31 PM, Yehuda Sadeh yeh...@redhat.com wrote: On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Thanks for quick reply. Yes, versioned object - missing in ceph ATM Iam looking for: bucket lifecylce (get/put/delete), bucket location, put object notification and object restore (ie versioned object) S3 API support. Please let me now any of the above work is in progress or some one planned to work on. I opened an issue for bucket lifecycle (we already had an issue open for object expiration though). We do have bucket location already (part of the multi-region feature). Object versioning is definitely on our backlog and one that we'll hopefully implement sooner rather later. With regard to object notification, it'll require having a notification service which is a bit out of the scope. Integrating the gateway with such a service whouldn't be hard, but we'll need to have that first. Yehuda Thanks Swami On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote: On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote: Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ?n supported item in the above table? or Any working on this? Yes. Unfortunately this table isn't particularly detailed or accurate or up to date. The main gap, I think, is versioned objects. Are there specfiic parts of the S3 API that are missing that you need? That sort of info is very helpful for prioritizing effort... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Fwd: S3 API Compatibility support
On Fri, 19 Sep 2014, M Ranga Swami Reddy wrote: Hi Sage, Could you please advise, if Ceph support the low cost object storages(like Amazon Glacier or RRS) for archiving objects like log file etc.? Ceph doesn't interact at all with AWS services like Glacier, if that's what you mean. For RRS, though, I assume you mean the ability to create buckets with reduced redundancy with radosgw? That is supported, although not quite the way AWS does it. You can create different pools that back RGW buckets, and each bucket is stored in one of those pools. So you could make one of them 2x instead of 3x, or use an erasure code of your choice. What isn't currently supported is the ability to reduce the redundancy of individual objects in a bucket. I don't think there is anything architecturally preventing that, but it is not implemented or supported. When we look at the S3 archival features in more detail (soon!) I'm sure this will come up! The current plan is to address object versioning first. That is, unless a developer surfaces who wants to start hacking on this right away... sage Thanks Swami On Thu, Sep 18, 2014 at 6:20 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: Hi , Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks On Fri, Sep 12, 2014 at 8:00 PM, M Ranga Swami Reddy swamire...@gmail.com wrote: Hi Yehuda, Could you please check and clarify the below question on object lifecycle and notification S3 APIs support: 1. To support the bucket lifecycle - we need to support the moving/deleting the objects/buckets based lifecycle settings. For ex: If an object lifecyle set as below: 1. Archive it after 10 days - means move this object to low cost object storage after 10 days of the creation date. 2. Remove this object after 90days - mean remove this object from the low cost object after 90days of creation date. Q1- Does the ceph support the above concept like moving to low cost storage and delete from that storage? 2. To support the object notifications: - First there should be low cost and high availability storage with single replica only. If an object created with this type of object storage, There could be chances that object could lose, so if an object of this type of storage lost, set the notifications. Q2- Does Ceph support low cost and high availability storage type? Thanks Swami On Tue, Jul 29, 2014 at 1:35 AM, Yehuda Sadeh yeh...@redhat.com wrote: Bucket lifecycle: http://tracker.ceph.com/issues/8929 Bucket notification: http://tracker.ceph.com/issues/8956 On Sun, Jul 27, 2014 at 12:54 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Good no know the details. Can you please share the issue ID for bucket lifecycle? My team also could start help here. Regarding the notification - Do we have the issue ID? Yes, the object versioning will be backlog one - I strongly feel we start working on this asap. Thanks Swami On Fri, Jul 25, 2014 at 11:31 PM, Yehuda Sadeh yeh...@redhat.com wrote: On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Thanks for quick reply. Yes, versioned object - missing in ceph ATM Iam looking for: bucket lifecylce (get/put/delete), bucket location, put object notification and object restore (ie versioned object) S3 API support. Please let me now any of the above work is in progress or some one planned to work on. I opened an issue for bucket lifecycle (we already had an issue open for object expiration though). We do have bucket location already (part of the multi-region feature). Object versioning is definitely on our backlog and one that we'll hopefully implement sooner rather later. With regard to object notification, it'll require having a notification service which is a bit out of the scope. Integrating the gateway with such a