from:"Stefan Priebe"

Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever

2015-11-09 Thread Stefan Priebe - Profihost AG


> - Original Message -
>> From: "Alexandre DERUMIER" 
>> To: "ceph-devel" 
>> Cc: "qemu-devel" , jdur...@redhat.com
>> Sent: Monday, November 9, 2015 5:48:45 AM
>> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and 
>> vm_stop is hanging forever
>>
>> adding to ceph.conf
>>
>> [client]
>> rbd_non_blocking_aio = false
>>
>>
>> fix the problem for me (with rbd_cache=false)
>>
>>
>> (@cc jdur...@redhat.com)

+1 same to me.

Stefan

>>
>>
>>
>> - Mail original -
>> De: "Denis V. Lunev" 
>> À: "aderumier" , "ceph-devel"
>> , "qemu-devel" 
>> Envoyé: Lundi 9 Novembre 2015 08:22:34
>> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop
>> is hanging forever
>>
>> On 11/09/2015 10:19 AM, Denis V. Lunev wrote:
>>> On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote:
 Hi,

 with qemu (2.4.1), if I do an internal snapshot of an rbd device,
 then I pause the vm with vm_stop,

 the qemu process is hanging forever


 monitor commands to reproduce:


 # snapshot_blkdev_internal drive-virtio0 yoursnapname
 # stop




 I don't see this with qcow2 or sheepdog block driver for example.


 Regards,

 Alexandre

>>> this could look like the problem I have recenty trying to
>>> fix with dataplane enabled. Patch series is named as
>>>
>>> [PATCH for 2.5 v6 0/10] dataplane snapshot fixes
>>>
>>> Den
>>
>> anyway, even if above will not help, can you collect gdb
>> traces from all threads in QEMU process. May be I'll be
>> able to give a hit.
>>
>> Den
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Stefan Priebe



Am 19.08.2015 um 22:34 schrieb Somnath Roy:

But, you said you need to remove libcmalloc *not* libtcmalloc...
I saw librbd/librados is built with libcmalloc not with libtcmalloc..
So, are you saying to remove libtcmalloc (not libcmalloc) to enable jemalloc ?


Ouch my mistake. I read libtcmalloc - too late here.

My build (Hammer) says:
# ldd /usr/lib/librados.so.2.0.0
linux-vdso.so.1 =  (0x7fff4f71d000)
libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fafdb26c000)
libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 
(0x7fafdb24f000)
libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7fafdb032000)

libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fafda924000)
libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 
(0x7fafda71f000)

librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fafda516000)
libboost_system.so.1.49.0 = /usr/lib/libboost_system.so.1.49.0 
(0x7fafda512000)
libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7fafda20b000)

libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fafd9f88000)
libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fafd9bfd000)
libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7fafd99e7000)

/lib64/ld-linux-x86-64.so.2 (0x56358ecfe000)

Only ceph-osd is linked against libjemalloc for me.

Stefan


-Original Message-
From: Stefan Priebe [mailto:s.pri...@profihost.ag]
Sent: Wednesday, August 19, 2015 1:31 PM
To: Somnath Roy; Alexandre DERUMIER; Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing


Am 19.08.2015 um 22:29 schrieb Somnath Roy:

Hmm...We need to fix that as part of configure/Makefile I guess (?)..
Since we have done this jemalloc integration originally, we can take that 
ownership unless anybody sees a problem of enabling tcmalloc/jemalloc with 
librbd/librados.

 You have to remove libcmalloc out of your build environment to get
this done How do I do that ? I am using Ubuntu and can't afford to remove libc* 
packages.


I always use a chroot to build packages where only a minimal bootstrap + the build deps 
are installed. googleperftools where libtcmalloc comes from is not Ubuntu 
core/minimal.

Stefan



Thanks  Regards
Somnath

-Original Message-
From: Stefan Priebe [mailto:s.pri...@profihost.ag]
Sent: Wednesday, August 19, 2015 1:18 PM
To: Somnath Roy; Alexandre DERUMIER; Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing


Am 19.08.2015 um 22:16 schrieb Somnath Roy:

Alexandre,
I am not able to build librados/librbd by using the following config option.

./configure –without-tcmalloc –with-jemalloc


Same issue to me. You have to remove libcmalloc out of your build environment 
to get this done.

Stefan



It seems it is building osd/mon/Mds/RGW with jemalloc enabled..

root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd
   linux-vdso.so.1 =  (0x7ffd0eb43000)
   libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
(0x7f5f92d7)
   ...

root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0
   linux-vdso.so.1 =  (0x7ffed46f2000)
   libboost_thread.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000)
   liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
(0x7ff68763d000)
   libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000)
   libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7ff68721a000)
   libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so 
(0x7ff686ee)
   libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so 
(0x7ff686cb3000)
   libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so 
(0x7ff686a76000)
   libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 
(0x7ff686871000)
   librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000)
   libboost_system.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000)
   libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7ff68616)
   libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000)
   libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000)
   libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7ff68587e000)
   liblttng-ust-tracepoint.so.0 = 
/usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000)
   liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000)
   liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000)
   /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000)
   libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so 
(0x7ff685029000)
   libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Stefan Priebe



Am 19.08.2015 um 22:29 schrieb Somnath Roy:

Hmm...We need to fix that as part of configure/Makefile I guess (?)..
Since we have done this jemalloc integration originally, we can take that 
ownership unless anybody sees a problem of enabling tcmalloc/jemalloc with 
librbd/librados.

 You have to remove libcmalloc out of your build environment to get this done
How do I do that ? I am using Ubuntu and can't afford to remove libc* packages.


I always use a chroot to build packages where only a minimal bootstrap + 
the build deps are installed. googleperftools where libtcmalloc comes 
from is not Ubuntu core/minimal.


Stefan



Thanks  Regards
Somnath

-Original Message-
From: Stefan Priebe [mailto:s.pri...@profihost.ag]
Sent: Wednesday, August 19, 2015 1:18 PM
To: Somnath Roy; Alexandre DERUMIER; Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing


Am 19.08.2015 um 22:16 schrieb Somnath Roy:

Alexandre,
I am not able to build librados/librbd by using the following config option.

./configure –without-tcmalloc –with-jemalloc


Same issue to me. You have to remove libcmalloc out of your build environment 
to get this done.

Stefan



It seems it is building osd/mon/Mds/RGW with jemalloc enabled..

root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd
  linux-vdso.so.1 =  (0x7ffd0eb43000)
  libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
(0x7f5f92d7)
  ...

root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0
  linux-vdso.so.1 =  (0x7ffed46f2000)
  libboost_thread.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000)
  liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
(0x7ff68763d000)
  libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000)
  libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7ff68721a000)
  libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so 
(0x7ff686ee)
  libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so 
(0x7ff686cb3000)
  libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so 
(0x7ff686a76000)
  libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 
(0x7ff686871000)
  librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000)
  libboost_system.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000)
  libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7ff68616)
  libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000)
  libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000)
  libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7ff68587e000)
  liblttng-ust-tracepoint.so.0 = 
/usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000)
  liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000)
  liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000)
  /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000)
  libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so 
(0x7ff685029000)
  libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so 
(0x7ff684e24000)
  libplds4.so = /usr/lib/x86_64-linux-gnu/libplds4.so
(0x7ff684c2)

It is building with libcmalloc always...

Did you change the ceph makefiles to build librbd/librados with jemalloc ?

Thanks  Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre
DERUMIER
Sent: Wednesday, August 19, 2015 7:01 AM
To: Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing

Thanks Marc,

Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs 
jemalloc.

and indeed tcmalloc, even with bigger cache, seem decrease over time.


What is funny, is that I see exactly same behaviour client librbd side, with 
qemu and multiple iothreads.


Switching both server and client to jemalloc give me best performance on small 
read currently.






- Mail original -
De: Mark Nelson mnel...@redhat.com
À: ceph-devel ceph-devel@vger.kernel.org
Envoyé: Mercredi 19 Août 2015 06:45:36
Objet: Ceph Hackathon: More Memory Allocator Testing

Hi Everyone,

One of the goals at the Ceph Hackathon last week was to examine how to improve 
Ceph Small IO performance. Jian Zhang presented findings showing a dramatic 
improvement in small random IO performance when Ceph is used with jemalloc. His 
results build upon Sandisk's original findings that the default thread cache 
values are a major bottleneck in TCMalloc 2.1. To further verify these results, 
we sat down at the Hackathon and configured the new performance test cluster 
that Intel generously donated to the Ceph community laboratory to run through a 
variety of tests with different memory

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Stefan Priebe



Am 19.08.2015 um 22:16 schrieb Somnath Roy:

Alexandre,
I am not able to build librados/librbd by using the following config option.

./configure –without-tcmalloc –with-jemalloc


Same issue to me. You have to remove libcmalloc out of your build 
environment to get this done.


Stefan



It seems it is building osd/mon/Mds/RGW with jemalloc enabled..

root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd
 linux-vdso.so.1 =  (0x7ffd0eb43000)
 libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
(0x7f5f92d7)
 ...

root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0
 linux-vdso.so.1 =  (0x7ffed46f2000)
 libboost_thread.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000)
 liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
(0x7ff68763d000)
 libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000)
 libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7ff68721a000)
 libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so (0x7ff686ee)
 libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so 
(0x7ff686cb3000)
 libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so 
(0x7ff686a76000)
 libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7ff686871000)
 librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000)
 libboost_system.so.1.55.0 = 
/usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000)
 libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7ff68616)
 libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000)
 libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000)
 libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7ff68587e000)
 liblttng-ust-tracepoint.so.0 = 
/usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000)
 liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000)
 liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000)
 /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000)
 libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so 
(0x7ff685029000)
 libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so (0x7ff684e24000)
 libplds4.so = /usr/lib/x86_64-linux-gnu/libplds4.so 
(0x7ff684c2)

It is building with libcmalloc always...

Did you change the ceph makefiles to build librbd/librados with jemalloc ?

Thanks  Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER
Sent: Wednesday, August 19, 2015 7:01 AM
To: Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing

Thanks Marc,

Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs 
jemalloc.

and indeed tcmalloc, even with bigger cache, seem decrease over time.


What is funny, is that I see exactly same behaviour client librbd side, with 
qemu and multiple iothreads.


Switching both server and client to jemalloc give me best performance on small 
read currently.






- Mail original -
De: Mark Nelson mnel...@redhat.com
À: ceph-devel ceph-devel@vger.kernel.org
Envoyé: Mercredi 19 Août 2015 06:45:36
Objet: Ceph Hackathon: More Memory Allocator Testing

Hi Everyone,

One of the goals at the Ceph Hackathon last week was to examine how to improve 
Ceph Small IO performance. Jian Zhang presented findings showing a dramatic 
improvement in small random IO performance when Ceph is used with jemalloc. His 
results build upon Sandisk's original findings that the default thread cache 
values are a major bottleneck in TCMalloc 2.1. To further verify these results, 
we sat down at the Hackathon and configured the new performance test cluster 
that Intel generously donated to the Ceph community laboratory to run through a 
variety of tests with different memory allocator configurations. I've since 
written the results of those tests up in pdf form for folks who are interested.

The results are located here:

http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf

I want to be clear that many other folks have done the heavy lifting here. 
These results are simply a validation of the many tests that other folks have 
already done. Many thanks to Sandisk and others for figuring this out as it's a 
pretty big deal!

Side note: Very little tuning other than swapping the memory allocator and a 
couple of quick and dirty ceph tunables were set during these tests. It's quite 
possible that higher IOPS will be achieved as we really start digging into the 
cluster and learning what the bottlenecks are.

Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Stefan Priebe - Profihost AG


Thanks for sharing. Do those tests use jemalloc for fio too? Otherwise
librbd on client side is running with tcmalloc again.

Stefan

Am 19.08.2015 um 06:45 schrieb Mark Nelson:
 Hi Everyone,
 
 One of the goals at the Ceph Hackathon last week was to examine how to
 improve Ceph Small IO performance.  Jian Zhang presented findings
 showing a dramatic improvement in small random IO performance when Ceph
 is used with jemalloc.  His results build upon Sandisk's original
 findings that the default thread cache values are a major bottleneck in
 TCMalloc 2.1.  To further verify these results, we sat down at the
 Hackathon and configured the new performance test cluster that Intel
 generously donated to the Ceph community laboratory to run through a
 variety of tests with different memory allocator configurations.  I've
 since written the results of those tests up in pdf form for folks who
 are interested.
 
 The results are located here:
 
 http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf
 
 I want to be clear that many other folks have done the heavy lifting
 here.  These results are simply a validation of the many tests that
 other folks have already done.  Many thanks to Sandisk and others for
 figuring this out as it's a pretty big deal!
 
 Side note:  Very little tuning other than swapping the memory allocator
 and a couple of quick and dirty ceph tunables were set during these
 tests. It's quite possible that higher IOPS will be achieved as we
 really start digging into the cluster and learning what the bottlenecks
 are.
 
 Thanks,
 Mark
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-04 Thread Stefan Priebe

We've done the splitting several times. The most important thing is to 
run a ceph version which does not have the linger ops bug.


This is dumpling latest release, giant and hammer. Latest firefly 
release still has this bug. Which results in wrong watchers and no 
working snapshots.


Stefan
Am 04.08.2015 um 18:46 schrieb Samuel Just:

It will cause a large amount of data movement.  Each new pg after the
split will relocate.  It might be ok if you do it slowly.  Experiment
on a test cluster.
-Sam

On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote:

Hi Cephers,

This is a greeting from Jevon. Currently, I'm experiencing an issue which
suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
More details are provided bellow.

Issue:
I set up a cluster having 24 OSDs and created one pool with 1024 placement
groups on it for a small startup company. The number 1024 was calculated per
the equation 'OSDs * 100'/pool size. The cluster have been running quite
well for a long time. But recently, our monitoring system always complains
that some disks' usage exceed 85%. I log into the system and find out that
some disks' usage are really very high, but some are not(less than 60%).
Each time when the issue happens, I have to manually re-balance the
distribution. This is a short-term solution, I'm not willing to do it all
the time.

Two long-term solutions come in my mind,
1) Ask the customers to expand their clusters by adding more OSDs. But I
think they will ask me to explain the reason of the imbalance data
distribution. We've already done some analysis on the environment, we
learned that the most imbalance part in the CRUSH is the mapping between
object and pg. The biggest pg has 613 objects, while the smallest pg only
has 226 objects.

2) Increase the number of placement groups. It can be of great help for
statistically uniform data distribution, but it can also incur significant
data movement as PGs are effective being split. I just cannot do it in our
customers' environment before we 100% understand the consequence. So anyone
did this under a production environment? How much does this operation affect
the performance of Clients?

Any comments/help/suggestions will be highly appreciated.

--
Best Regards
Jevon

___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-04 Thread Stefan Priebe


Hi,

Am 04.08.2015 um 21:16 schrieb Ketor D:

Hi Stefan,
   Could you describe more about the linger ops bug?
   I'm runing Firefly as you say still has this bug.


It will be fixed in next ff release.

This on:
http://tracker.ceph.com/issues/9806

Stefan



Thanks!

On Wed, Aug 5, 2015 at 12:51 AM, Stefan Priebe s.pri...@profihost.ag wrote:

We've done the splitting several times. The most important thing is to run a
ceph version which does not have the linger ops bug.

This is dumpling latest release, giant and hammer. Latest firefly release
still has this bug. Which results in wrong watchers and no working
snapshots.

Stefan

Am 04.08.2015 um 18:46 schrieb Samuel Just:


It will cause a large amount of data movement.  Each new pg after the
split will relocate.  It might be ok if you do it slowly.  Experiment
on a test cluster.
-Sam

On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote:


Hi Cephers,

This is a greeting from Jevon. Currently, I'm experiencing an issue which
suffers me a lot, so I'm writing to ask for your
comments/help/suggestions.
More details are provided bellow.

Issue:
I set up a cluster having 24 OSDs and created one pool with 1024
placement
groups on it for a small startup company. The number 1024 was calculated
per
the equation 'OSDs * 100'/pool size. The cluster have been running quite
well for a long time. But recently, our monitoring system always
complains
that some disks' usage exceed 85%. I log into the system and find out
that
some disks' usage are really very high, but some are not(less than 60%).
Each time when the issue happens, I have to manually re-balance the
distribution. This is a short-term solution, I'm not willing to do it all
the time.

Two long-term solutions come in my mind,
1) Ask the customers to expand their clusters by adding more OSDs. But I
think they will ask me to explain the reason of the imbalance data
distribution. We've already done some analysis on the environment, we
learned that the most imbalance part in the CRUSH is the mapping between
object and pg. The biggest pg has 613 objects, while the smallest pg only
has 226 objects.

2) Increase the number of placement groups. It can be of great help for
statistically uniform data distribution, but it can also incur
significant
data movement as PGs are effective being split. I just cannot do it in
our
customers' environment before we 100% understand the consequence. So
anyone
did this under a production environment? How much does this operation
affect
the performance of Clients?

Any comments/help/suggestions will be highly appreciated.

--
Best Regards
Jevon

___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-22 Thread Stefan Priebe - Profihost AG


Am 21.07.2015 um 22:50 schrieb Josh Durgin:
 Yes, I'm afraid it sounds like it is. You can double check whether the
 watch exists on an image by getting the id of the image from 'rbd info
 $pool/$image | grep block_name_prefix':
 
 block_name_prefix: rbd_data.105674b0dc51
 
 The id is the hex number there. Append that to 'rbd_header.' and you
 have the header object name. Check whether it has watchers with:
 
 rados listwatchers -p $pool rbd_header.105674b0dc51
 
 If that doesn't show any watchers while the image is in use by a vm,
 it's #9806.

Yes it does not show any watchers.

 I just merged the backport for firefly, so it'll be in 0.80.11.
 Sorry it took so long to get to firefly :(. We'll need to be
 more vigilant about checking non-trivial backports when we're
 going through all the bugs periodically.

That would be really important. I've seen that this one was already in
upstream/firefly-backports. What's the purpose of that branch?

Greets,
Stefan

 Josh
 
 On 07/21/2015 12:52 PM, Stefan Priebe wrote:
 So this is really this old bug?

 http://tracker.ceph.com/issues/9806

 Stefan
 Am 21.07.2015 um 21:46 schrieb Josh Durgin:
 On 07/21/2015 12:22 PM, Stefan Priebe wrote:

 Am 21.07.2015 um 19:19 schrieb Jason Dillaman:
 Does this still occur if you export the images to the console (i.e.
 rbd export cephstor/disk-116@snap -  dump_file)?

 Would it be possible for you to provide logs from the two rbd export
 runs on your smallest VM image?  If so, please add the following to
 the [client] section of your ceph.conf:

log file = /valid/path/to/logs/$name.$pid.log
debug rbd = 20

 I opened a ticket [1] where you can attach the logs (if they aren't
 too large).

 [1] http://tracker.ceph.com/issues/12422

 Will post some more details to the tracker in a few hours. It seems it
 is related to using discard inside guest but not on the FS the osd is
 on.

 That sounds very odd. Could you verify via 'rados listwatchers' on an
 in-use rbd image's header object that there's still a watch established?

 Have you increased pgs in all those clusters recently?

 Josh
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 16:32 schrieb Jason Dillaman:

Any chance that the snapshot was just created prior to the first export and you 
have a process actively writing to the image?



Sadly not. I executed those commands exactly as i've posted manually at 
bash.


I can reproduce this at 5 different ceph cluster and 500 vms each.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 21:46 schrieb Josh Durgin:

On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?


How can i do this exactly?


Have you increased pgs in all those clusters recently?


Yes i bumped from 2048 to 4096 as i doubled the osds.

Stefan


Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe


So this is really this old bug?

http://tracker.ceph.com/issues/9806

Stefan
Am 21.07.2015 um 21:46 schrieb Josh Durgin:

On 07/21/2015 12:22 PM, Stefan Priebe wrote:


Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e.
rbd export cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export
runs on your smallest VM image?  If so, please add the following to
the [client] section of your ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't
too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it
is related to using discard inside guest but not on the FS the osd is on.


That sounds very odd. Could you verify via 'rados listwatchers' on an
in-use rbd image's header object that there's still a watch established?

Have you increased pgs in all those clusters recently?

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe



Am 21.07.2015 um 19:19 schrieb Jason Dillaman:

Does this still occur if you export the images to the console (i.e. rbd export 
cephstor/disk-116@snap -  dump_file)?

Would it be possible for you to provide logs from the two rbd export runs on your 
smallest VM image?  If so, please add the following to the [client] section of your 
ceph.conf:

   log file = /valid/path/to/logs/$name.$pid.log
   debug rbd = 20

I opened a ticket [1] where you can attach the logs (if they aren't too large).

[1] http://tracker.ceph.com/issues/12422


Will post some more details to the tracker in a few hours. It seems it 
is related to using discard inside guest but not on the FS the osd is on.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

upstream/firefly exporting the same snap 2 times results in different exports

2015-07-21 Thread Stefan Priebe - Profihost AG

Hi,

i remember there was a bug before in ceph not sure in which release
where exporting the same rbd snap multiple times results in different
raw images.

Currently running upstream/firefly and i'm seeing the same again.


# rbd export cephstor/disk-116@snap dump1
# sleep 10
# rbd export cephstor/disk-116@snap dump2

# md5sum -b dump*
b89198f118de59b3aa832db1bfddaf8f *dump1
f63ed9345ac2d5898483531e473772b1 *dump2

Can anybody help?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc

2015-07-07 Thread Stefan Priebe - Profihost AG

Am 07.07.2015 um 12:55 schrieb Shishir Gowda:
 Hi Stefan,
 
 I tried with hammer, and with google perf devel tools installed, and still 
 worked as expected.
 
 You can check in .../ceph/src/ceph-osd directory to confirm if you are 
 checking with the right binaries.

Strange under Debian Wheezy it always bind to tcmalloc. I've now removed
googleperftools and it works fine.

Stefan

 
 With regards,
 Shishir
 
 -Original Message-
 From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
 Sent: Tuesday, July 07, 2015 2:48 PM
 To: Shishir Gowda; ceph-devel@vger.kernel.org
 Subject: Re: trying to compile with-jemalloc but ceph-osd is still linked to
 libtcmalloc


 Am 07.07.2015 um 09:56 schrieb Shishir Gowda:
 Hi Stefan,

 I built it with ./configure --without-tcmalloc and --with-jemalloc, and
 resulting binaries are not being linked with tcmalloc.

 It works for me if i remove the google perftools dev package. But if it is
 installed hammer always builds against tcmalloc.

 ldd src/ceph-osd
 linux-vdso.so.1 =  (0x7fff2a5fe000)
 libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
 (0x7f99d1c7b000)
 libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 
 (0x7f99d1a79000)
 libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1
 (0x7f99d182b000)
 liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0
 (0x7f99d15dc000)
 libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0
 (0x7f99d13be000)
 libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7f99d0cc1000)
 libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1
 (0x7f99d0abc000)
 libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99d08b8000)
 libboost_thread.so.1.54.0 = /usr/lib/x86_64-linux-
 gnu/libboost_thread.so.1.54.0 (0x7f99d06a1000)
 librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7f99d0499000)
 libboost_system.so.1.54.0 = /usr/lib/x86_64-linux-
 gnu/libboost_system.so.1.54.0 (0x7f99d0295000)
 libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 (0x7f99cff9)
 libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7f99cfc8a000)
 libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1
 (0x7f99cfa74000)
 libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7f99cf6ae000)
 /lib64/ld-linux-x86-64.so.2 (0x7f99d1ec2000)
 libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7f99cf4a8000)
 liblttng-ust-tracepoint.so.0 = 
 /usr/lib/x86_64-linux-gnu/liblttng-ust-
 tracepoint.so.0 (0x7f99cf28e000)
 liburcu-bp.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-bp.so.2
 (0x7f99cf086000)
 liburcu-cds.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-cds.so.2
 (0x7f99cee7f000)

 I tried it with upstream master, what branch are you using.

 With regards,
 Shishir

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
 ow...@vger.kernel.org] On Behalf Of Stefan Priebe
 Sent: Friday, July 03, 2015 2:45 PM
 To: ceph-devel@vger.kernel.org
 Subject: trying to compile with-jemalloc but ceph-osd is still linked
 to libtcmalloc

 Hi,

 i'm trying to compile current hammer with-jemalloc.

 configure .. --without-tcmalloc --with-jemalloc

 but resulting ceph-osd is still linked against tcmalloc:
 ldd /usr/bin/ceph-osd
  linux-vdso.so.1 =  (0x7fffbf3b9000)
  libjemalloc.so.1 =
 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
 (0x7fc44bc25000)
  libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1
 (0x7fc44ba23000)
  libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1
 (0x7fc44b7d2000)
  libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0
 (0x7fc44b5b6000)
  libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000)
  libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1
 (0x7fc44aca2000)
  libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 
 (0x7fc44aa9e000)
  libboost_thread.so.1.49.0 =
 /usr/lib/libboost_thread.so.1.49.0
 (0x7fc44aa81000)
  librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 
 (0x7fc44a878000)
  libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 (0x7fc44a571000)
  libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000)
  libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1
 (0x7fc44a0d8000)
  libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000)
  /lib64/ld-linux-x86-64.so.2 (0x7fc44be65000)
  libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000)
  libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000)
  libunwind.so.7 = /usr/lib/libunwind.so.7
 (0x7fc4496bb000)

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More
 majordomo
 info at http://vger.kernel.org/majordomo

Re: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc

2015-07-07 Thread Stefan Priebe - Profihost AG


Am 07.07.2015 um 09:56 schrieb Shishir Gowda:
 Hi Stefan,
 
 I built it with ./configure --without-tcmalloc and --with-jemalloc, and 
 resulting binaries are not being linked with tcmalloc.

It works for me if i remove the google perftools dev package. But if it
is installed hammer always builds against tcmalloc.

 ldd src/ceph-osd
 linux-vdso.so.1 =  (0x7fff2a5fe000)
 libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
 (0x7f99d1c7b000)
 libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7f99d1a79000)
 libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 
 (0x7f99d182b000)
 liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 
 (0x7f99d15dc000)
 libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
 (0x7f99d13be000)
 libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7f99d0cc1000)
 libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 
 (0x7f99d0abc000)
 libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99d08b8000)
 libboost_thread.so.1.54.0 = 
 /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0 (0x7f99d06a1000)
 librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7f99d0499000)
 libboost_system.so.1.54.0 = 
 /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0 (0x7f99d0295000)
 libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
 (0x7f99cff9)
 libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7f99cfc8a000)
 libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
 (0x7f99cfa74000)
 libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7f99cf6ae000)
 /lib64/ld-linux-x86-64.so.2 (0x7f99d1ec2000)
 libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7f99cf4a8000)
 liblttng-ust-tracepoint.so.0 = 
 /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7f99cf28e000)
 liburcu-bp.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-bp.so.2 
 (0x7f99cf086000)
 liburcu-cds.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-cds.so.2 
 (0x7f99cee7f000)
 
 I tried it with upstream master, what branch are you using.
 
 With regards,
 Shishir
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
 ow...@vger.kernel.org] On Behalf Of Stefan Priebe
 Sent: Friday, July 03, 2015 2:45 PM
 To: ceph-devel@vger.kernel.org
 Subject: trying to compile with-jemalloc but ceph-osd is still linked to
 libtcmalloc

 Hi,

 i'm trying to compile current hammer with-jemalloc.

 configure .. --without-tcmalloc --with-jemalloc

 but resulting ceph-osd is still linked against tcmalloc:
 ldd /usr/bin/ceph-osd
  linux-vdso.so.1 =  (0x7fffbf3b9000)
  libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
 (0x7fc44bc25000)
  libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1
 (0x7fc44ba23000)
  libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1
 (0x7fc44b7d2000)
  libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0
 (0x7fc44b5b6000)
  libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000)
  libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1
 (0x7fc44aca2000)
  libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fc44aa9e000)
  libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0
 (0x7fc44aa81000)
  librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fc44a878000)
  libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 (0x7fc44a571000)
  libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000)
  libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1
 (0x7fc44a0d8000)
  libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000)
  /lib64/ld-linux-x86-64.so.2 (0x7fc44be65000)
  libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000)
  libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000)
  libunwind.so.7 = /usr/lib/libunwind.so.7 (0x7fc4496bb000)

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the
 body of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies

trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc

2015-07-03 Thread Stefan Priebe


Hi,

i'm trying to compile current hammer with-jemalloc.

configure .. --without-tcmalloc --with-jemalloc

but resulting ceph-osd is still linked against tcmalloc:
ldd /usr/bin/ceph-osd
linux-vdso.so.1 =  (0x7fffbf3b9000)
libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
(0x7fc44bc25000)
libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 
(0x7fc44ba23000)
libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 
(0x7fc44b7d2000)
libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7fc44b5b6000)

libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000)
libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 
(0x7fc44aca2000)

libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fc44aa9e000)
libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 
(0x7fc44aa81000)

librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fc44a878000)
libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7fc44a571000)

libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000)
libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7fc44a0d8000)

libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000)
/lib64/ld-linux-x86-64.so.2 (0x7fc44be65000)
libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000)
libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000)
libunwind.so.7 = /usr/lib/libunwind.so.7 (0x7fc4496bb000)

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd_cache, limiting read on high iops around 40k

2015-06-22 Thread Stefan Priebe - Profihost AG


Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER aderum...@odiso.com:

 Just an update, there seems to be no proper way to pass iothread 
 parameter from openstack-nova (not at least in Juno release). So a 
 default single iothread per VM is what all we have. So in conclusion a 
 nova instance max iops on ceph rbd will be limited to 30-40K.
 
 Thanks for the update.
 
 For proxmox users, 
 
 I have added iothread option to gui for proxmox 4.0

Can we make iothread the default? Does it also help for single disks or only 
multiple disks?

 and added jemalloc as default memory allocator
 
 
 I have also send a jemmaloc patch to qemu dev mailing
 https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html
 
 (Help is welcome to push it in qemu upstream ! )
 
 
 
 - Mail original -
 De: pushpesh sharma pushpesh@gmail.com
 À: aderumier aderum...@odiso.com
 Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov 
 malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, ceph-users 
 ceph-us...@lists.ceph.com
 Envoyé: Lundi 22 Juin 2015 07:58:47
 Objet: Re: rbd_cache, limiting read on high iops around 40k
 
 Just an update, there seems to be no proper way to pass iothread 
 parameter from openstack-nova (not at least in Juno release). So a 
 default single iothread per VM is what all we have. So in conclusion a 
 nova instance max iops on ceph rbd will be limited to 30-40K. 
 
 On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER 
 aderum...@odiso.com wrote: 
 Hi, 
 
 some news about qemu with tcmalloc vs jemmaloc. 
 
 I'm testing with multiple disks (with iothreads) in 1 qemu guest. 
 
 And if tcmalloc is a little faster than jemmaloc, 
 
 I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache 
 bug. 
 
 increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. 
 
 
 with multiple disk, I'm around 200k iops with tcmalloc (before hitting the 
 bug) and 350kiops with jemmaloc. 
 
 The problem is that when I hit malloc bug, I'm around 4000-1 iops, and 
 only way to fix is is to restart qemu ... 
 
 
 
 - Mail original - 
 De: pushpesh sharma pushpesh@gmail.com 
 À: aderumier aderum...@odiso.com 
 Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov 
 malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, ceph-users 
 ceph-us...@lists.ceph.com 
 Envoyé: Vendredi 12 Juin 2015 08:58:21 
 Objet: Re: rbd_cache, limiting read on high iops around 40k 
 
 Thanks, posted the question in openstack list. Hopefully will get some 
 expert opinion. 
 
 On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER 
 aderum...@odiso.com wrote: 
 Hi, 
 
 here a libvirt xml sample from libvirt src 
 
 (you need to define iothreads number, then assign then in disks). 
 
 I don't use openstack, so I really don't known how it's working with it. 
 
 
 domain type='qemu' 
 nameQEMUGuest1/name 
 uuidc7a5fdbd-edaf-9455-926a-d65c16db1809/uuid 
 memory unit='KiB'219136/memory 
 currentMemory unit='KiB'219136/currentMemory 
 vcpu placement='static'2/vcpu 
 iothreads2/iothreads 
 os 
 type arch='i686' machine='pc'hvm/type 
 boot dev='hd'/ 
 /os 
 clock offset='utc'/ 
 on_poweroffdestroy/on_poweroff 
 on_rebootrestart/on_reboot 
 on_crashdestroy/on_crash 
 devices 
 emulator/usr/bin/qemu/emulator 
 disk type='file' device='disk' 
 driver name='qemu' type='raw' iothread='1'/ 
 source file='/var/lib/libvirt/images/iothrtest1.img'/ 
 target dev='vdb' bus='virtio'/ 
 address type='pci' domain='0x' bus='0x00' slot='0x04' function='0x0'/ 
 /disk 
 disk type='file' device='disk' 
 driver name='qemu' type='raw' iothread='2'/ 
 source file='/var/lib/libvirt/images/iothrtest2.img'/ 
 target dev='vdc' bus='virtio'/ 
 /disk 
 controller type='usb' index='0'/ 
 controller type='ide' index='0'/ 
 controller type='pci' index='0' model='pci-root'/ 
 memballoon model='none'/ 
 /devices 
 /domain 
 
 
 - Mail original - 
 De: pushpesh sharma pushpesh@gmail.com 
 À: aderumier aderum...@odiso.com 
 Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov 
 malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, 
 ceph-users ceph-us...@lists.ceph.com 
 Envoyé: Vendredi 12 Juin 2015 07:52:41 
 Objet: Re: rbd_cache, limiting read on high iops around 40k 
 
 Hi Alexandre, 
 
 I agree with your rational, of one iothread per disk. CPU consumed in 
 IOwait is pretty high in each VM. But I am not finding a way to set 
 the same on a nova instance. I am using openstack Juno with QEMU+KVM. 
 As per libvirt documentation for setting iothreads, I can edit 
 domain.xml directly and achieve the same effect. However in as in 
 openstack env domain xml is created by nova with some additional 
 metadata, so editing the domain xml using 'virsh edit' does not seems 
 to work(I agree, it is not a very cloud way of doing things, but a 
 hack). Changes made there vanish after saving them, due to reason 
 libvirt validation fails on the same. 
 
 #virsh dumpxml instance-00c5  vm.xml 
 #virt-xml-validate vm.xml

Re: Memstore performance improvements v0.90 vs v0.87

2015-02-20 Thread Stefan Priebe


Am 20.02.2015 um 17:03 schrieb Alexandre DERUMIER:

http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/
It's possible that this could be having an effect on the results.


Isn't auto numa balancing enabled by default since kernel 3.8 ?

it can be checked with

cat /proc/sys/kernel/numa_balancing


I have it disabled in kernel due to many libc memory allocation failures 
when enabled.


Stefan



- Mail original -
De: Mark Nelson mnel...@redhat.com
À: Blair Bethwaite blair.bethwa...@gmail.com, James Page 
james.p...@ubuntu.com
Cc: ceph-devel ceph-devel@vger.kernel.org, Stephen L Blinick stephen.l.blin...@intel.com, Jay Vosburgh 
jay.vosbu...@canonical.com, Colin Ian King colin.k...@canonical.com, Patricia Gaughen patricia.gaug...@canonical.com, 
Leann Ogasawara leann.ogasaw...@canonical.com
Envoyé: Vendredi 20 Février 2015 16:38:02
Objet: Re: Memstore performance improvements v0.90 vs v0.87

I think paying attention to NUMA is good advice. One of the things that
apparently changed in RHEL7 is that they are now doing automatic NUMA
tuning:

http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/

It's possible that this could be having an effect on the results.

Mark

On 02/20/2015 03:49 AM, Blair Bethwaite wrote:

Hi James,

Interesting results, but did you do any tests with a NUMA system? IIUC
the original report was from a dual socket setup, and that'd
presumably be the standard setup for most folks (both OSD server and
client side).

Cheers,

On 20 February 2015 at 20:07, James Page james.p...@ubuntu.com wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi All

The Ubuntu Kernel team have spent the last few weeks investigating the
apparent performance disparity between RHEL 7 and Ubuntu 14.04; we've
focussed efforts in a few ways (see below).

All testing has been done using the latest Firefly release.

1) Base network latency

Jay Vosburgh looked at the base network latencies between RHEL 7 and
Ubuntu 14.04; under default install, RHEL actually had slightly worse
latency than Ubuntu due to the default enablement of a firewall;
disabling this brought latency back inline between the two distributions:

OS rtt min/avg/max/mdev
Ubuntu 14.04 (3.13) 0.013/0.016/0.018/0.005 ms
RHEL7 (3.10) 0.010/0.018/0.025/0.005 ms

...base network latency is pretty much the same.

This testing was performed on a matched pair of Dell Poweredge R610's,
configured with a single 4 core CPU and 8G of RAM.

2) Latency and performance in Ceph using Rados bench

Colin King spent a number of days testing and analysing results using
rados bench against a single node ceph deployment, configured with a
single memory backed OSD, to see if we could reproduce the disparities
reported.

He ran 120 second OSD benchmarks on RHEL 7 as well as Ubuntu 14.04 LTS
with a selection of kernels including 3.10 vanilla, 3.13.0-44 (release
kernel), 3.16.0-30 (utopic HWE kernel), 3.18.0-12 (vivid HWE kernel)
and 3.19-rc6 with 1, 16 and 128 client threads. The data collected is
available at [0].

Each round of tests consisted of 15 runs, from which we computed
average latency, latency deviation and latency distribution:


120 second x 1 thread


Results all seem to cluster around 0.04-0.05ms, with RHEL 7 averaging
at 0.044 and recent Ubuntu kernels at 0.036-0.037ms. The older 3.10
kernel in RHEL 7 does have some slightly higher average latency.


120 second x 16 threads


Results all seem to cluster around 0.6-0.7ms. 3.19.0-rc6 had a couple
of 1.4ms outliers which pushed it out to be worse than RHEL 7. On the
whole Ubuntu 3.10-3.18 kernels are better than RHEL 7 by ~0.1ms. RHEL
shows a far higher standard deviation, due to the bimodal latency
distribution, which from the casual observer may appear to be more
jittery.


120 second x 128 threads


Later kernels show up to have less standard deviation than RHEL 7, so
that shows perhaps less jitter in the stats than RHEL 7's 3.10 kernel.
With this many threads pounding the test, we get a wider spread of
latencies and it is hard to tell any kind of latency distribution
patterns with just 15 rounds because of the large amount of latency
jitter. All systems show a latency of ~ 5ms. Taking into
consideration the amount of jitter, we think these results do not make
much sense unless we repeat these tests with say 100 samples.

3) Conclusion

We’ve have not been able to show any major anomalies in Ceph on Ubuntu
compared to RHEL 7 when using memstore. Our current hypothesis is that
one needs to run the OSD bench stressor many times to get a fair capture
of system latency stats. The reason for this is:

* Latencies are very low with memstore, so any small jitter in
scheduling etc will show up as a large distortion (as shown by the large
standard deviations in the samples).

* When memstore is heavily utilized, memory pressure causes the system
to page heavily and so we are subject to the nature of perhaps delays on
paging that cause some

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-17 Thread Stefan Priebe - Profihost AG

   [.] __pthread_mutex_unlock_usercnt
  0,56%  ceph-osd [.]
ceph::buffer::list::iterator::advance(int)
  0,44%  ceph-osd [.] ceph::buffer::ptr::append(char
const*, unsigned int)

Stefan


  
 
 
 - Mail original -
 De: Stefan Priebe s.pri...@profihost.ag
 À: aderumier aderum...@odiso.com
 Cc: Mark Nelson mnel...@redhat.com, ceph-devel 
 ceph-devel@vger.kernel.org
 Envoyé: Lundi 16 Février 2015 23:08:37
 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try
 
 Am 16.02.2015 um 23:02 schrieb Alexandre DERUMIER aderum...@odiso.com: 
 
 This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s 
 while running dumpling... 

 Is it for write only ? 
 or do you see same decrease for read too 
 
 Just tested write. This might be the result of higher CPU load of the 
 ceph-osd processes under firefly. 
 
 Dumpling 180% per process vs. firefly 220% 
 
 Stefan 
 
 ? 


 - Mail original - 
 De: Stefan Priebe s.pri...@profihost.ag 
 À: Mark Nelson mnel...@redhat.com, ceph-devel 
 ceph-devel@vger.kernel.org 
 Envoyé: Lundi 16 Février 2015 22:22:01 
 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try 

 I've now upgraded server side and client side to latest upstream/firefly. 

 This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s 
 while running dumpling... 

 Greets, 
 Stefan 
 Am 15.02.2015 um 19:40 schrieb Stefan Priebe: 
 Hi Mark, 

 what's next? 

 I've this test cluster only for 2 more days. 

 Here some perf Details: 

 dumpling: 
 12,65% libc-2.13.so [.] 0x79000 
 2,86% libc-2.13.so [.] malloc 
 2,80% kvm [.] 0xb59c5 
 2,59% libc-2.13.so [.] free 
 2,35% [kernel] [k] __schedule 
 2,16% [kernel] [k] _raw_spin_lock 
 1,92% [kernel] [k] __switch_to 
 1,58% [kernel] [k] lapic_next_deadline 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% [kernel] [k] _raw_spin_lock_irqsave 
 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 
 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,87% [kernel] [k] resched_task 
 0,72% [kernel] [k] cpu_startup_entry 
 0,71% librados.so.2.0.0 [.] crush_hash32_3 
 0,66% [kernel] [k] leave_mm 
 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,64% [kernel] [k] idle_cpu 
 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,59% [kernel] [k] try_to_wake_up 
 0,56% [kernel] [k] wake_futex 
 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 

 firefly: 
 12,56% libc-2.13.so [.] 0x7905d 
 2,82% libc-2.13.so [.] malloc 
 2,64% libc-2.13.so [.] free 
 2,61% kvm [.] 0x34322f 
 2,33% [kernel] [k] __schedule 
 2,14% [kernel] [k] _raw_spin_lock 
 1,83% [kernel] [k] __switch_to 
 1,62% [kernel] [k] lapic_next_deadline 
 1,17% [kernel] [k] _raw_spin_lock_irqsave 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,77% [kernel] [k] resched_task 
 0,74% librbd.so.1.0.0 [.] 0x71b73 
 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,68% librados.so.2.0.0 [.] crush_hash32_3 
 0,67% [kernel] [k] idle_cpu 
 0,65% [kernel] [k] leave_mm 
 0,65% [kernel] [k] cpu_startup_entry 
 0,59% [kernel] [k] try_to_wake_up 
 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 
 0,51% [kernel] [k] wake_futex 

 Stefan 

 Am 11.02.2015 um 06:42 schrieb Stefan Priebe: 

 Am 11.02.2015 um 05:45 schrieb Mark Nelson: 
 On 02/10/2015 04:18 PM, Stefan Priebe wrote: 

 Am 10.02.2015 um 22:38 schrieb Mark Nelson: 
 On 02/10/2015 03:11 PM, Stefan Priebe wrote: 

 mhm i installed librbd1-dbg and librados2-dbg - but the output still 
 looks useless to me. Should i upload it somewhere? 

 Meh, if it's all just symbols it's probably not that helpful. 

 I've summarized your results here: 

 1 concurrent 4k write (libaio, direct=1, iodepth=1) 

 IOPS Latency 
 wb on wb off wb on wb off 
 dumpling 10870 536 ~100us ~2ms 
 firefly 10350 525 ~100us ~2ms 

 So in single op tests dumpling and firefly are far closer. Now let's 
 see each of these cases with iodepth=32 (still 1 thread for now). 


 dumpling: 

 file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 
 2.0.8 
 Starting 1 thread 
 Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 
 00m:00s] 
 file1: (groupid=0, jobs=1): err= 0: pid=3011 
 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec 
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 
 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 
 clat percentiles (usec): 
 | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 
 1672], 
 | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 
 1832], 
 | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 
 2128], 
 | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 
 5344], 
 | 99.99th=[ 7072] 
 bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, 
 stdev=4783.25 
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01

firefly: librbd: reads contending for cache space can cause livelock

2015-02-17 Thread Stefan Priebe - Profihost AG

Hi,

is there any reason why this one is not merged into firefly yet?

http://tracker.ceph.com/issues/9854
librbd: reads contending for cache space can cause livelock

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-16 Thread Stefan Priebe



Am 16.02.2015 um 17:45 schrieb Alexandre DERUMIER:

I also thinked about 1 thing

fio-lirbd use the rbd_cache value from ceph.conf.

and qemu change the value if cache=none or cache=writeback in qemu conf.

So, verify that too.



I'm thinked of this old bug with cache

http://tracker.ceph.com/issues/9513
It was a bug in giant, but tracker said also dumpling and firefly (but no 
commit for them)


But the original bug was

http://tracker.ceph.com/issues/9854
and I'm not sure it's already released



No it's not in latest firefly nor in latest dumpling. But it's in latest 
git for both.


But it looks read related not write related - isn't it?

Stefan






- Mail original -
De: Stefan Priebe s.pri...@profihost.ag
À: aderumier aderum...@odiso.com
Cc: Mark Nelson mnel...@redhat.com, ceph-devel 
ceph-devel@vger.kernel.org
Envoyé: Lundi 16 Février 2015 15:50:56
Objet: Re: speed decrease since firefly,giant,hammer the 2nd try

Hi Mark,
Hi Alexandre,

Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER:

Hi Stefan,

I could be interesting to see if you have the same speed decrease with 
fio-librbd on the host,
without the qemu layer.

the perf reports don't seem to be too much different.
do you have the same cpu usage ? (check qemu process usage)


the idea to use fio-librbd was very good.

I cannot reproduce the behaviour using fio-rbd. I can just reproduce it
with qemu.

Very strange. So please ignore me for the moment. I'll try to dig deeper
into it.

Greets,
Stefan


- Mail original -
De: Stefan Priebe s.pri...@profihost.ag
À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org
Envoyé: Dimanche 15 Février 2015 19:40:45
Objet: Re: speed decrease since firefly,giant,hammer the 2nd try

Hi Mark,

what's next?

I've this test cluster only for 2 more days.

Here some perf Details:

dumpling:
12,65% libc-2.13.so [.] 0x79000
2,86% libc-2.13.so [.] malloc
2,80% kvm [.] 0xb59c5
2,59% libc-2.13.so [.] free
2,35% [kernel] [k] __schedule
2,16% [kernel] [k] _raw_spin_lock
1,92% [kernel] [k] __switch_to
1,58% [kernel] [k] lapic_next_deadline
1,09% [kernel] [k] update_sd_lb_stats
1,08% [kernel] [k] _raw_spin_lock_irqsave
0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel
0,91% libpthread-2.13.so [.] pthread_mutex_trylock
0,87% [kernel] [k] resched_task
0,72% [kernel] [k] cpu_startup_entry
0,71% librados.so.2.0.0 [.] crush_hash32_3
0,66% [kernel] [k] leave_mm
0,65% librados.so.2.0.0 [.] Mutex::Lock(bool)
0,64% [kernel] [k] idle_cpu
0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,59% [kernel] [k] try_to_wake_up
0,56% [kernel] [k] wake_futex
0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release()

firefly:
12,56% libc-2.13.so [.] 0x7905d
2,82% libc-2.13.so [.] malloc
2,64% libc-2.13.so [.] free
2,61% kvm [.] 0x34322f
2,33% [kernel] [k] __schedule
2,14% [kernel] [k] _raw_spin_lock
1,83% [kernel] [k] __switch_to
1,62% [kernel] [k] lapic_next_deadline
1,17% [kernel] [k] _raw_spin_lock_irqsave
1,09% [kernel] [k] update_sd_lb_stats
1,08% libpthread-2.13.so [.] pthread_mutex_trylock
0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,77% [kernel] [k] resched_task
0,74% librbd.so.1.0.0 [.] 0x71b73
0,72% librados.so.2.0.0 [.] Mutex::Lock(bool)
0,68% librados.so.2.0.0 [.] crush_hash32_3
0,67% [kernel] [k] idle_cpu
0,65% [kernel] [k] leave_mm
0,65% [kernel] [k] cpu_startup_entry
0,59% [kernel] [k] try_to_wake_up
0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release()
0,51% [kernel] [k] wake_futex

Stefan

Am 11.02.2015 um 06:42 schrieb Stefan Priebe:


Am 11.02.2015 um 05:45 schrieb Mark Nelson:

On 02/10/2015 04:18 PM, Stefan Priebe wrote:


Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

IOPS Latency
wb on wb off wb on wb off
dumpling 10870 536 ~100us ~2ms
firefly 10350 525 ~100us ~2ms

So in single op tests dumpling and firefly are far closer. Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3011
write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
clat percentiles (usec):
| 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[
1672],
| 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[
1832],
| 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[
2128],
| 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-16 Thread Stefan Priebe



Am 16.02.2015 um 16:36 schrieb Alexandre DERUMIER:

What is you fio command line ?


fio rbd or fio under qemu?


do you test with numjobs  1 ?


Both.


(I think under qemu, you can use any numjobs value, as it's use only 1 thread, 
is equal to numjobs=1)


numjobs  1 gives also under qemu better results.

Stefan




- Mail original -
De: Stefan Priebe s.pri...@profihost.ag
À: aderumier aderum...@odiso.com
Cc: Mark Nelson mnel...@redhat.com, ceph-devel 
ceph-devel@vger.kernel.org
Envoyé: Lundi 16 Février 2015 15:50:56
Objet: Re: speed decrease since firefly,giant,hammer the 2nd try

Hi Mark,
Hi Alexandre,

Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER:

Hi Stefan,

I could be interesting to see if you have the same speed decrease with 
fio-librbd on the host,
without the qemu layer.

the perf reports don't seem to be too much different.
do you have the same cpu usage ? (check qemu process usage)


the idea to use fio-librbd was very good.

I cannot reproduce the behaviour using fio-rbd. I can just reproduce it
with qemu.

Very strange. So please ignore me for the moment. I'll try to dig deeper
into it.

Greets,
Stefan


- Mail original -
De: Stefan Priebe s.pri...@profihost.ag
À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org
Envoyé: Dimanche 15 Février 2015 19:40:45
Objet: Re: speed decrease since firefly,giant,hammer the 2nd try

Hi Mark,

what's next?

I've this test cluster only for 2 more days.

Here some perf Details:

dumpling:
12,65% libc-2.13.so [.] 0x79000
2,86% libc-2.13.so [.] malloc
2,80% kvm [.] 0xb59c5
2,59% libc-2.13.so [.] free
2,35% [kernel] [k] __schedule
2,16% [kernel] [k] _raw_spin_lock
1,92% [kernel] [k] __switch_to
1,58% [kernel] [k] lapic_next_deadline
1,09% [kernel] [k] update_sd_lb_stats
1,08% [kernel] [k] _raw_spin_lock_irqsave
0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel
0,91% libpthread-2.13.so [.] pthread_mutex_trylock
0,87% [kernel] [k] resched_task
0,72% [kernel] [k] cpu_startup_entry
0,71% librados.so.2.0.0 [.] crush_hash32_3
0,66% [kernel] [k] leave_mm
0,65% librados.so.2.0.0 [.] Mutex::Lock(bool)
0,64% [kernel] [k] idle_cpu
0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,59% [kernel] [k] try_to_wake_up
0,56% [kernel] [k] wake_futex
0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release()

firefly:
12,56% libc-2.13.so [.] 0x7905d
2,82% libc-2.13.so [.] malloc
2,64% libc-2.13.so [.] free
2,61% kvm [.] 0x34322f
2,33% [kernel] [k] __schedule
2,14% [kernel] [k] _raw_spin_lock
1,83% [kernel] [k] __switch_to
1,62% [kernel] [k] lapic_next_deadline
1,17% [kernel] [k] _raw_spin_lock_irqsave
1,09% [kernel] [k] update_sd_lb_stats
1,08% libpthread-2.13.so [.] pthread_mutex_trylock
0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,77% [kernel] [k] resched_task
0,74% librbd.so.1.0.0 [.] 0x71b73
0,72% librados.so.2.0.0 [.] Mutex::Lock(bool)
0,68% librados.so.2.0.0 [.] crush_hash32_3
0,67% [kernel] [k] idle_cpu
0,65% [kernel] [k] leave_mm
0,65% [kernel] [k] cpu_startup_entry
0,59% [kernel] [k] try_to_wake_up
0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release()
0,51% [kernel] [k] wake_futex

Stefan

Am 11.02.2015 um 06:42 schrieb Stefan Priebe:


Am 11.02.2015 um 05:45 schrieb Mark Nelson:

On 02/10/2015 04:18 PM, Stefan Priebe wrote:


Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

IOPS Latency
wb on wb off wb on wb off
dumpling 10870 536 ~100us ~2ms
firefly 10350 525 ~100us ~2ms

So in single op tests dumpling and firefly are far closer. Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3011
write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
clat percentiles (usec):
| 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[
1672],
| 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[
1832],
| 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[
2128],
| 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[
5344],
| 99.99th=[ 7072]
bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27,
stdev=4783.25
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53%
lat (msec) : 2=85.02%, 4=14.31%, 10=0.13%
cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,

=64=0.0

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-16 Thread Stefan Priebe


I've now upgraded server side and client side to latest upstream/firefly.

This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s 
while running dumpling...


Greets,
Stefan
Am 15.02.2015 um 19:40 schrieb Stefan Priebe:

Hi Mark,

what's next?

I've this test cluster only for 2 more days.

Here some perf Details:

dumpling:
  12,65%  libc-2.13.so [.] 0x79000
   2,86%  libc-2.13.so [.] malloc
   2,80%  kvm  [.] 0xb59c5
   2,59%  libc-2.13.so [.] free
   2,35%  [kernel] [k] __schedule
   2,16%  [kernel] [k] _raw_spin_lock
   1,92%  [kernel] [k] __switch_to
   1,58%  [kernel] [k] lapic_next_deadline
   1,09%  [kernel] [k] update_sd_lb_stats
   1,08%  [kernel] [k] _raw_spin_lock_irqsave
   0,91%  librados.so.2.0.0[.] ceph_crc32c_le_intel
   0,91%  libpthread-2.13.so   [.] pthread_mutex_trylock
   0,87%  [kernel] [k] resched_task
   0,72%  [kernel] [k] cpu_startup_entry
   0,71%  librados.so.2.0.0[.] crush_hash32_3
   0,66%  [kernel] [k] leave_mm
   0,65%  librados.so.2.0.0[.] Mutex::Lock(bool)
   0,64%  [kernel] [k] idle_cpu
   0,62%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
   0,59%  [kernel] [k] try_to_wake_up
   0,56%  [kernel] [k] wake_futex
   0,50%  librados.so.2.0.0[.] ceph::buffer::ptr::release()

firefly:
  12,56%  libc-2.13.so [.] 0x7905d
   2,82%  libc-2.13.so [.] malloc
   2,64%  libc-2.13.so [.] free
   2,61%  kvm  [.] 0x34322f
   2,33%  [kernel] [k] __schedule
   2,14%  [kernel] [k] _raw_spin_lock
   1,83%  [kernel] [k] __switch_to
   1,62%  [kernel] [k] lapic_next_deadline
   1,17%  [kernel] [k] _raw_spin_lock_irqsave
   1,09%  [kernel] [k] update_sd_lb_stats
   1,08%  libpthread-2.13.so   [.] pthread_mutex_trylock
   0,85%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
   0,77%  [kernel] [k] resched_task
   0,74%  librbd.so.1.0.0  [.] 0x71b73
   0,72%  librados.so.2.0.0[.] Mutex::Lock(bool)
   0,68%  librados.so.2.0.0[.] crush_hash32_3
   0,67%  [kernel] [k] idle_cpu
   0,65%  [kernel] [k] leave_mm
   0,65%  [kernel] [k] cpu_startup_entry
   0,59%  [kernel] [k] try_to_wake_up
   0,51%  librados.so.2.0.0[.] ceph::buffer::ptr::release()
   0,51%  [kernel] [k] wake_futex

Stefan

Am 11.02.2015 um 06:42 schrieb Stefan Priebe:


Am 11.02.2015 um 05:45 schrieb Mark Nelson:

On 02/10/2015 04:18 PM, Stefan Priebe wrote:


Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

 IOPSLatency
 wb onwb offwb onwb off
dumpling10870536~100us~2ms
firefly10350525~100us~2ms

So in single op tests dumpling and firefly are far closer.  Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3011
   write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
  lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
 clat percentiles (usec):
  |  1.00th=[ 1480],  5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[
1672],
  | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[
1832],
  | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[
2128],
  | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[
5344],
  | 99.99th=[ 7072]
 bw (KB/s)  : min=59696, max=77840, per=100.00%, avg=70351.27,
stdev=4783.25
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%,
1000=0.53%
 lat (msec) : 2=85.02%, 4=14.31%, 10=0.13%
   cpu  : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-16 Thread Stefan Priebe - Profihost AG


Am 16.02.2015 um 23:02 schrieb Alexandre DERUMIER aderum...@odiso.com:

 This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s 
 while running dumpling...
 
 Is it for write  only ?
 or do you see same decrease for read too

Just tested write. This might be the result of higher CPU load of the ceph-osd 
processes under firefly. 

Dumpling 180% per process vs. firefly 220%

Stefan

 ?
 
 
 - Mail original -
 De: Stefan Priebe s.pri...@profihost.ag
 À: Mark Nelson mnel...@redhat.com, ceph-devel 
 ceph-devel@vger.kernel.org
 Envoyé: Lundi 16 Février 2015 22:22:01
 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try
 
 I've now upgraded server side and client side to latest upstream/firefly. 
 
 This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s 
 while running dumpling... 
 
 Greets, 
 Stefan 
 Am 15.02.2015 um 19:40 schrieb Stefan Priebe: 
 Hi Mark, 
 
 what's next? 
 
 I've this test cluster only for 2 more days. 
 
 Here some perf Details: 
 
 dumpling: 
 12,65% libc-2.13.so [.] 0x79000 
 2,86% libc-2.13.so [.] malloc 
 2,80% kvm [.] 0xb59c5 
 2,59% libc-2.13.so [.] free 
 2,35% [kernel] [k] __schedule 
 2,16% [kernel] [k] _raw_spin_lock 
 1,92% [kernel] [k] __switch_to 
 1,58% [kernel] [k] lapic_next_deadline 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% [kernel] [k] _raw_spin_lock_irqsave 
 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 
 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,87% [kernel] [k] resched_task 
 0,72% [kernel] [k] cpu_startup_entry 
 0,71% librados.so.2.0.0 [.] crush_hash32_3 
 0,66% [kernel] [k] leave_mm 
 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,64% [kernel] [k] idle_cpu 
 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,59% [kernel] [k] try_to_wake_up 
 0,56% [kernel] [k] wake_futex 
 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 
 
 firefly: 
 12,56% libc-2.13.so [.] 0x7905d 
 2,82% libc-2.13.so [.] malloc 
 2,64% libc-2.13.so [.] free 
 2,61% kvm [.] 0x34322f 
 2,33% [kernel] [k] __schedule 
 2,14% [kernel] [k] _raw_spin_lock 
 1,83% [kernel] [k] __switch_to 
 1,62% [kernel] [k] lapic_next_deadline 
 1,17% [kernel] [k] _raw_spin_lock_irqsave 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,77% [kernel] [k] resched_task 
 0,74% librbd.so.1.0.0 [.] 0x71b73 
 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,68% librados.so.2.0.0 [.] crush_hash32_3 
 0,67% [kernel] [k] idle_cpu 
 0,65% [kernel] [k] leave_mm 
 0,65% [kernel] [k] cpu_startup_entry 
 0,59% [kernel] [k] try_to_wake_up 
 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 
 0,51% [kernel] [k] wake_futex 
 
 Stefan 
 
 Am 11.02.2015 um 06:42 schrieb Stefan Priebe: 
 
 Am 11.02.2015 um 05:45 schrieb Mark Nelson: 
 On 02/10/2015 04:18 PM, Stefan Priebe wrote: 
 
 Am 10.02.2015 um 22:38 schrieb Mark Nelson: 
 On 02/10/2015 03:11 PM, Stefan Priebe wrote: 
 
 mhm i installed librbd1-dbg and librados2-dbg - but the output still 
 looks useless to me. Should i upload it somewhere?
 
 Meh, if it's all just symbols it's probably not that helpful. 
 
 I've summarized your results here: 
 
 1 concurrent 4k write (libaio, direct=1, iodepth=1) 
 
 IOPS Latency 
 wb on wb off wb on wb off 
 dumpling 10870 536 ~100us ~2ms 
 firefly 10350 525 ~100us ~2ms 
 
 So in single op tests dumpling and firefly are far closer. Now let's 
 see each of these cases with iodepth=32 (still 1 thread for now).
 
 
 dumpling: 
 
 file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 
 2.0.8 
 Starting 1 thread 
 Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 
 00m:00s] 
 file1: (groupid=0, jobs=1): err= 0: pid=3011 
 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec 
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 
 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 
 clat percentiles (usec): 
 | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 
 1672], 
 | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 
 1832], 
 | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 
 2128], 
 | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 
 5344], 
 | 99.99th=[ 7072] 
 bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, 
 stdev=4783.25 
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 
 1000=0.53% 
 lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% 
 cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
 =64=0.0%
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 =64=0.0%
 issued : total=r=0/w=527487/d=0, short=r=0/w=0/d=0 
 
 Run status group 0 (all jobs): 
 WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-16 Thread Stefan Priebe - Profihost AG

Hi Mark,
  Hi Alexandre,

Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER:
 Hi Stefan,
 
 I could be interesting to see if you have the same speed decrease with 
 fio-librbd on the host,
 without the qemu layer.
 
 the perf reports don't seem to be too much different.
 do you have the same cpu usage ? (check qemu process usage)

the idea to use fio-librbd was very good.

I cannot reproduce the behaviour using fio-rbd. I can just reproduce it
with qemu.

Very strange. So please ignore me for the moment. I'll try to dig deeper
into it.

Greets,
Stefan

 - Mail original -
 De: Stefan Priebe s.pri...@profihost.ag
 À: Mark Nelson mnel...@redhat.com, ceph-devel 
 ceph-devel@vger.kernel.org
 Envoyé: Dimanche 15 Février 2015 19:40:45
 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try
 
 Hi Mark, 
 
 what's next? 
 
 I've this test cluster only for 2 more days. 
 
 Here some perf Details: 
 
 dumpling: 
 12,65% libc-2.13.so [.] 0x79000 
 2,86% libc-2.13.so [.] malloc 
 2,80% kvm [.] 0xb59c5 
 2,59% libc-2.13.so [.] free 
 2,35% [kernel] [k] __schedule 
 2,16% [kernel] [k] _raw_spin_lock 
 1,92% [kernel] [k] __switch_to 
 1,58% [kernel] [k] lapic_next_deadline 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% [kernel] [k] _raw_spin_lock_irqsave 
 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 
 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,87% [kernel] [k] resched_task 
 0,72% [kernel] [k] cpu_startup_entry 
 0,71% librados.so.2.0.0 [.] crush_hash32_3 
 0,66% [kernel] [k] leave_mm 
 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,64% [kernel] [k] idle_cpu 
 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,59% [kernel] [k] try_to_wake_up 
 0,56% [kernel] [k] wake_futex 
 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 
 
 firefly: 
 12,56% libc-2.13.so [.] 0x7905d 
 2,82% libc-2.13.so [.] malloc 
 2,64% libc-2.13.so [.] free 
 2,61% kvm [.] 0x34322f 
 2,33% [kernel] [k] __schedule 
 2,14% [kernel] [k] _raw_spin_lock 
 1,83% [kernel] [k] __switch_to 
 1,62% [kernel] [k] lapic_next_deadline 
 1,17% [kernel] [k] _raw_spin_lock_irqsave 
 1,09% [kernel] [k] update_sd_lb_stats 
 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 
 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 
 0,77% [kernel] [k] resched_task 
 0,74% librbd.so.1.0.0 [.] 0x71b73 
 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 
 0,68% librados.so.2.0.0 [.] crush_hash32_3 
 0,67% [kernel] [k] idle_cpu 
 0,65% [kernel] [k] leave_mm 
 0,65% [kernel] [k] cpu_startup_entry 
 0,59% [kernel] [k] try_to_wake_up 
 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 
 0,51% [kernel] [k] wake_futex 
 
 Stefan 
 
 Am 11.02.2015 um 06:42 schrieb Stefan Priebe: 

 Am 11.02.2015 um 05:45 schrieb Mark Nelson: 
 On 02/10/2015 04:18 PM, Stefan Priebe wrote: 

 Am 10.02.2015 um 22:38 schrieb Mark Nelson: 
 On 02/10/2015 03:11 PM, Stefan Priebe wrote: 

 mhm i installed librbd1-dbg and librados2-dbg - but the output still 
 looks useless to me. Should i upload it somewhere? 

 Meh, if it's all just symbols it's probably not that helpful. 

 I've summarized your results here: 

 1 concurrent 4k write (libaio, direct=1, iodepth=1) 

 IOPS Latency 
 wb on wb off wb on wb off 
 dumpling 10870 536 ~100us ~2ms 
 firefly 10350 525 ~100us ~2ms 

 So in single op tests dumpling and firefly are far closer. Now let's 
 see each of these cases with iodepth=32 (still 1 thread for now). 


 dumpling: 

 file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 
 2.0.8 
 Starting 1 thread 
 Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 
 00m:00s] 
 file1: (groupid=0, jobs=1): err= 0: pid=3011 
 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec 
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 
 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 
 clat percentiles (usec): 
 | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 
 1672], 
 | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 
 1832], 
 | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 
 2128], 
 | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 
 5344], 
 | 99.99th=[ 7072] 
 bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, 
 stdev=4783.25 
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% 
 lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% 
 cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
 =64=0.0% 
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0% 
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 =64=0.0% 
 issued : total=r=0/w=527487/d=0, short=r=0/w=0/d=0 

 Run status group 0 (all jobs): 
 WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, 
 mint=30001msec, maxt=30001msec 

 Disk stats (read/write): 
 sdb: ios=166/526079

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-15 Thread Stefan Priebe


Hi Mark,

what's next?

I've this test cluster only for 2 more days.

Here some perf Details:

dumpling:
 12,65%  libc-2.13.so [.] 0x79000
  2,86%  libc-2.13.so [.] malloc
  2,80%  kvm  [.] 0xb59c5
  2,59%  libc-2.13.so [.] free
  2,35%  [kernel] [k] __schedule
  2,16%  [kernel] [k] _raw_spin_lock
  1,92%  [kernel] [k] __switch_to
  1,58%  [kernel] [k] lapic_next_deadline
  1,09%  [kernel] [k] update_sd_lb_stats
  1,08%  [kernel] [k] _raw_spin_lock_irqsave
  0,91%  librados.so.2.0.0[.] ceph_crc32c_le_intel
  0,91%  libpthread-2.13.so   [.] pthread_mutex_trylock
  0,87%  [kernel] [k] resched_task
  0,72%  [kernel] [k] cpu_startup_entry
  0,71%  librados.so.2.0.0[.] crush_hash32_3
  0,66%  [kernel] [k] leave_mm
  0,65%  librados.so.2.0.0[.] Mutex::Lock(bool)
  0,64%  [kernel] [k] idle_cpu
  0,62%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  0,59%  [kernel] [k] try_to_wake_up
  0,56%  [kernel] [k] wake_futex
  0,50%  librados.so.2.0.0[.] ceph::buffer::ptr::release()

firefly:
 12,56%  libc-2.13.so [.] 0x7905d
  2,82%  libc-2.13.so [.] malloc
  2,64%  libc-2.13.so [.] free
  2,61%  kvm  [.] 0x34322f
  2,33%  [kernel] [k] __schedule
  2,14%  [kernel] [k] _raw_spin_lock
  1,83%  [kernel] [k] __switch_to
  1,62%  [kernel] [k] lapic_next_deadline
  1,17%  [kernel] [k] _raw_spin_lock_irqsave
  1,09%  [kernel] [k] update_sd_lb_stats
  1,08%  libpthread-2.13.so   [.] pthread_mutex_trylock
  0,85%  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  0,77%  [kernel] [k] resched_task
  0,74%  librbd.so.1.0.0  [.] 0x71b73
  0,72%  librados.so.2.0.0[.] Mutex::Lock(bool)
  0,68%  librados.so.2.0.0[.] crush_hash32_3
  0,67%  [kernel] [k] idle_cpu
  0,65%  [kernel] [k] leave_mm
  0,65%  [kernel] [k] cpu_startup_entry
  0,59%  [kernel] [k] try_to_wake_up
  0,51%  librados.so.2.0.0[.] ceph::buffer::ptr::release()
  0,51%  [kernel] [k] wake_futex

Stefan

Am 11.02.2015 um 06:42 schrieb Stefan Priebe:


Am 11.02.2015 um 05:45 schrieb Mark Nelson:

On 02/10/2015 04:18 PM, Stefan Priebe wrote:


Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

 IOPSLatency
 wb onwb offwb onwb off
dumpling10870536~100us~2ms
firefly10350525~100us~2ms

So in single op tests dumpling and firefly are far closer.  Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3011
   write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
  lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
 clat percentiles (usec):
  |  1.00th=[ 1480],  5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[
1672],
  | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[
1832],
  | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[
2128],
  | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[
5344],
  | 99.99th=[ 7072]
 bw (KB/s)  : min=59696, max=77840, per=100.00%, avg=70351.27,
stdev=4783.25
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53%
 lat (msec) : 2=85.02%, 4=14.31%, 10=0.13%
   cpu  : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s,
mint=30001msec, maxt=30001msec

Disk stats (read/write):
   sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064,
util=98.73%

firefly:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-11 Thread Stefan Priebe - Profihost AG


Am 11.02.2015 um 08:44 schrieb Alexandre DERUMIER:
 same fio, same qemu, same vm, same host, same ceph dumpling storage, 
 different librados / librbd: 16k iop/s for random 4k writes

 What's wrong with librbd / librados2 since firefly?
 
 Maybe could we bissect this ?
 
 Maybe testing intermediate librbd releases between dumpling and firefly,
 
 http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/ref/

Yes may be. Sadly i've currently another problem on my newest cluster
having strange kworker workload - i've never noticed before on any ceph
system. All writes are hanging - while using the same kernel as everywhere.

Stefan

 
 could we give us an hint.
 
 
 - Mail original -
 De: Stefan Priebe s.pri...@profihost.ag
 À: ceph-devel ceph-devel@vger.kernel.org
 Envoyé: Mardi 10 Février 2015 19:55:26
 Objet: speed decrease since firefly,giant,hammer the 2nd try
 
 Hello, 
 
 last year in june i already reported this but there was no real result. 
 (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) 
 
 I then had the hope that this will be fixed itself when hammer is 
 released. Now i tried hammer an the results are bad as before. 
 
 Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s 
 than dumpling - this is also the reason why i still stick to dumpling. 
 
 I've now modified my test again to be a bit more clear. 
 
 Ceph cluster itself completely dumpling. 
 
 librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 
 4k writes 
 
 - stopped qemu 
 - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ 
 - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ 
 - start qemu 
 
 same fio, same qemu, same vm, same host, same ceph dumpling storage, 
 different librados / librbd: 16k iop/s for random 4k writes 
 
 What's wrong with librbd / librados2 since firefly? 
 
 Greets, 
 Stefan 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe


Am 10.02.2015 um 20:05 schrieb Gregory Farnum:

On Tue, Feb 10, 2015 at 10:55 AM, Stefan Priebe s.pri...@profihost.ag wrote:

Hello,

last year in june i already reported this but there was no real result.
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html)

I then had the hope that this will be fixed itself when hammer is released.
Now i tried hammer an the results are bad as before.

Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than
dumpling - this is also the reason why i still stick to dumpling.

I've now modified my test again to be a bit more clear.

Ceph cluster itself completely dumpling.

librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k
writes

- stopped qemu
- cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/
- cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/
- start qemu

same fio, same qemu, same vm, same host, same ceph dumpling storage,
different librados / librbd: 16k iop/s for random 4k writes

What's wrong with librbd / librados2 since firefly?


We're all going to have the same questions now as we did last time,
about what the cluster looks like, what the perfcounters are reporting
on both versions of librados, etc.


I try to answer all your questions - not sue how easy this is.

6 Nodes each with:
- Single Intel E5-1650 v3
- 48GB RAM
- 4x 800GB Samsung SSD
- 2x 10Gbit/s bonded storage network

- Client side
- Dual Xeon E5
- 256GB RAM
- 2x 10Gbit/s bonded storage network

Regarding perf counters - i'm willing to make tests. Just tell me how.


Also, please give us the results from Giant rather than Firefly, for
the reasons I mentioned previously.


As giant is not a long term release and we have a support contract it's 
not an option to me. Even tough i tried hammer git master three days 
ago. Same results.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe


Hello,

last year in june i already reported this but there was no real result. 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html)


I then had the hope that this will be fixed itself when hammer is 
released. Now i tried hammer an the results are bad as before.


Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s 
than dumpling - this is also the reason why i still stick to dumpling.


I've now modified my test again to be a bit more clear.

Ceph cluster itself completely dumpling.

librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 
4k writes


- stopped qemu
- cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/
- cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/
- start qemu

same fio, same qemu, same vm, same host, same ceph dumpling storage, 
different librados / librbd: 16k iop/s for random 4k writes


What's wrong with librbd / librados2 since firefly?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe



Am 10.02.2015 um 20:10 schrieb Mark Nelson:



On 02/10/2015 12:55 PM, Stefan Priebe wrote:

Hello,

last year in june i already reported this but there was no real result.
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html)


I then had the hope that this will be fixed itself when hammer is
released. Now i tried hammer an the results are bad as before.

Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s
than dumpling - this is also the reason why i still stick to dumpling.

I've now modified my test again to be a bit more clear.

Ceph cluster itself completely dumpling.

librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random
4k writes

- stopped qemu
- cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/
- cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/
- start qemu

same fio, same qemu, same vm, same host, same ceph dumpling storage,
different librados / librbd: 16k iop/s for random 4k writes

What's wrong with librbd / librados2 since firefly?


Hi Stephen,

Just off the top of my head, some questions to investigate:

What happens to single op latencies?


How to test this?


Does enabling/disabling RBD cache have any effect?


I've it enabled on both through qemu write back setting.


How's CPU usage? (Does perf report show anything useful?)
Can you get trace data?


I'm not familiar with trace or perf - what should do exactly?

Stefan


Mark



Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe


Am 10.02.2015 um 21:36 schrieb Mark Nelson:



On 02/10/2015 02:24 PM, Stefan Priebe wrote:

Am 10.02.2015 um 20:40 schrieb Mark Nelson:

On 02/10/2015 01:13 PM, Stefan Priebe wrote:

Am 10.02.2015 um 20:10 schrieb Mark Nelson:

On 02/10/2015 12:55 PM, Stefan Priebe wrote:

Hello,

last year in june i already reported this but there was no real
result.
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html)



I then had the hope that this will be fixed itself when hammer is
released. Now i tried hammer an the results are bad as before.

Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s
than dumpling - this is also the reason why i still stick to
dumpling.

I've now modified my test again to be a bit more clear.

Ceph cluster itself completely dumpling.

librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for
random
4k writes

- stopped qemu
- cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/
- cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/
- start qemu

same fio, same qemu, same vm, same host, same ceph dumpling storage,
different librados / librbd: 16k iop/s for random 4k writes

What's wrong with librbd / librados2 since firefly?


Hi Stephen,

Just off the top of my head, some questions to investigate:

What happens to single op latencies?


How to test this?


try your random 4k write test using libaio, direct IO, and iodepth=1.
Actually it would be interesting to know how it is with higher IO depths
as well (I assume this is what you are doing now?) Basically I want to
know if single-op latency changes and whether or not it gets hidden or
exaggerated with lots of concurrent IO.


dumpling:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done]
[0K/85224K /s] [0 /21.4K iops] [eta 00m:00s]

ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done]
[0K/79064K /s] [0 /19.8K iops] [eta 00m:00s]

firefly:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done]
[0K/55781K /s] [0 /15.4K iops] [eta 00m:00s]

ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done]
[0K/46055K /s] [0 /11.6K iops] [eta 00m:00s]



Sorry, please do this with only 1 thread.  If you can include the
latency results too that would be great.


Sorry here again.

Cache on:

dumpling:
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/42892K /s] [0 /10.8K iops] [eta 
00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=3203
  write: io=1273.1MB, bw=43483KB/s, iops=10870 , runt= 30001msec
slat (usec): min=5 , max=183 , avg= 8.99, stdev= 1.78
clat (usec): min=0 , max=6378 , avg=81.15, stdev=44.09
 lat (usec): min=59 , max=6390 , avg=90.35, stdev=44.22
clat percentiles (usec):
 |  1.00th=[   59],  5.00th=[   62], 10.00th=[   64], 20.00th=[   66],
 | 30.00th=[   69], 40.00th=[   71], 50.00th=[   74], 60.00th=[   80],
 | 70.00th=[   87], 80.00th=[   95], 90.00th=[  105], 95.00th=[  114],
 | 99.00th=[  135], 99.50th=[  145], 99.90th=[  179], 99.95th=[  237],
 | 99.99th=[ 2320]
bw (KB/s)  : min=36176, max=46816, per=99.96%, avg=43465.49, 
stdev=2169.33

lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=85.24%
lat (usec) : 250=14.71%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu  : usr=2.95%, sys=12.29%, ctx=329519, majf=0, minf=133
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
=64=0.0%

 issued: total=r=0/w=326130/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=1273.1MB, aggrb=43482KB/s, minb=43482KB/s, maxb=43482KB/s, 
mint=30001msec, maxt=30001msec


Disk stats (read/write):
  sdb: ios=166/325241, merge=0/0, ticks=8/24624, in_queue=24492, 
util=81.64%



firefly:
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/44588K /s] [0 /11.2K iops] [eta 
00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=2904
  write: io=1212.1MB, bw=41401KB/s, iops=10350 , runt= 30001msec
slat (usec): min=5 , max=464 , avg= 8.95, stdev= 2.34
clat (usec): min=0 , max=4410 , avg=85.81, stdev=41.82
 lat (usec): min=59 , max=4418 , avg=94.96, stdev=41.97
clat percentiles (usec):
 |  1.00th=[   59],  5.00th=[   63], 10.00th=[   65], 20.00th=[   68],
 | 30.00th=[   72], 40.00th=[   76], 50.00th=[   80], 60.00th=[   85],
 | 70.00th=[   94], 80.00th=[  102], 90.00th=[  112], 95.00th=[  122],
 | 99.00th=[  145], 99.50th=[  155], 99.90th=[  189], 99.95th=[  239],
 | 99.99th=[ 2192

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe


Am 10.02.2015 um 20:40 schrieb Mark Nelson:

On 02/10/2015 01:13 PM, Stefan Priebe wrote:

Am 10.02.2015 um 20:10 schrieb Mark Nelson:

On 02/10/2015 12:55 PM, Stefan Priebe wrote:

Hello,

last year in june i already reported this but there was no real result.
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html)

I then had the hope that this will be fixed itself when hammer is
released. Now i tried hammer an the results are bad as before.

Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s
than dumpling - this is also the reason why i still stick to dumpling.

I've now modified my test again to be a bit more clear.

Ceph cluster itself completely dumpling.

librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for
random
4k writes

- stopped qemu
- cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/
- cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/
- start qemu

same fio, same qemu, same vm, same host, same ceph dumpling storage,
different librados / librbd: 16k iop/s for random 4k writes

What's wrong with librbd / librados2 since firefly?


Hi Stephen,

Just off the top of my head, some questions to investigate:

What happens to single op latencies?


How to test this?


try your random 4k write test using libaio, direct IO, and iodepth=1.
Actually it would be interesting to know how it is with higher IO depths
as well (I assume this is what you are doing now?) Basically I want to
know if single-op latency changes and whether or not it gets hidden or
exaggerated with lots of concurrent IO.


dumpling:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/85224K /s] [0 /21.4K iops] [eta 00m:00s]


ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/79064K /s] [0 /19.8K iops] [eta 00m:00s]


firefly:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/55781K /s] [0 /15.4K iops] [eta 00m:00s]


ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/46055K /s] [0 /11.6K iops] [eta 00m:00s]



Does enabling/disabling RBD cache have any effect?


I've it enabled on both through qemu write back setting.


It'd be great if you could do the above test both with WB RBD cache and
with it turned off.


Test with cache off:

dumpling:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/85111K /s] [0 /21.3K iops] [eta 00m:00s]


ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/88984K /s] [0 /22.3K iops] [eta 00m:00s]


firefly:
ioengine=libaio and iodepth=32 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/46479K /s] [0 /11.7K iops] [eta 00m:00s]


ioengine=libaio and iodepth=1 with 32 threads:

Jobs: 32 (f=32): [] [100.0% done] 
[0K/46019K /s] [0 /11.6K iops] [eta 00m:00s]



How's CPU usage? (Does perf report show anything useful?)
Can you get trace data?


I'm not familiar with trace or perf - what should do exactly?


you may need extra packages.  Basically on VM host, during the test with
each library you'd do:

sudo perf record -a -g dwarf -F 99
(ctrl+c after a while)
sudo perf report --stdio  foo.txt

if you are on a kernel that doesn't have libunwind support:

sudo perf record -a -g
(ctrl+c after a while)
sudo perf report --stdio  foo.txt

Then look and see what's different.  This may not catch anything though.


Don't have unwind.

Output is only full of hex values.

Stefan


You should also try Greg's suggestion looking at the performance
counters to see if any interesting differences show up between the runs.


Where / how to check?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe



Am 11.02.2015 um 05:45 schrieb Mark Nelson:

On 02/10/2015 04:18 PM, Stefan Priebe wrote:


Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

 IOPSLatency
 wb onwb offwb onwb off
dumpling10870536~100us~2ms
firefly10350525~100us~2ms

So in single op tests dumpling and firefly are far closer.  Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=3011
   write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
 slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
  lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
 clat percentiles (usec):
  |  1.00th=[ 1480],  5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[
1672],
  | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[
1832],
  | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[
2128],
  | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[
5344],
  | 99.99th=[ 7072]
 bw (KB/s)  : min=59696, max=77840, per=100.00%, avg=70351.27,
stdev=4783.25
 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53%
 lat (msec) : 2=85.02%, 4=14.31%, 10=0.13%
   cpu  : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s,
mint=30001msec, maxt=30001msec

Disk stats (read/write):
   sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064,
util=98.73%

firefly:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/69096K /s] [0 /17.3K iops] [eta
00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=2982
   write: io=1784.9MB, bw=60918KB/s, iops=15229 , runt= 30002msec
 slat (usec): min=1 , max=1389 , avg= 3.43, stdev= 5.32
 clat (usec): min=117 , max=8235 , avg=2096.88, stdev=396.30
  lat (usec): min=540 , max=8258 , avg=2100.43, stdev=396.61
 clat percentiles (usec):
  |  1.00th=[ 1608],  5.00th=[ 1720], 10.00th=[ 1768], 20.00th=[
1832],
  | 30.00th=[ 1896], 40.00th=[ 1944], 50.00th=[ 2008], 60.00th=[
2064],
  | 70.00th=[ 2160], 80.00th=[ 2256], 90.00th=[ 2512], 95.00th=[
2896],
  | 99.00th=[ 3600], 99.50th=[ 3792], 99.90th=[ 5088], 99.95th=[
6304],
  | 99.99th=[ 6752]
 bw (KB/s)  : min=36717, max=73712, per=99.94%, avg=60879.92,
stdev=8302.27
 lat (usec) : 250=0.01%, 750=0.01%
 lat (msec) : 2=48.56%, 4=51.18%, 10=0.26%
   cpu  : usr=2.03%, sys=5.48%, ctx=20440, majf=0, minf=133
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
 =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 =64=0.0%
  issued: total=r=0/w=456918/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   WRITE: io=1784.9MB, aggrb=60918KB/s, minb=60918KB/s, maxb=60918KB/s,
mint=30002msec, maxt=30002msec

Disk stats (read/write):
   sdb: ios=166/455574, merge=0/0, ticks=12/897748, in_queue=897696,
util=98.96%



Ok, so it looks like as you increase concurrency the effect increases
(ie contention?).  Does the same thing happen without cache enabled?


here again without rbd cache:

dumpling:
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/83488K /s] [0 /20.9K iops] [eta 
00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=3000
  write: io=2449.2MB, bw=83583KB/s, iops=20895 , runt= 30005msec
slat (usec): min=1 , max=975 , avg= 4.50, stdev= 5.25
clat (usec): min=364 , max=80566 , avg=1525.87, stdev=1194.57
 lat (usec): min=519 , max=80568 , avg=1530.51, stdev=1194.44
clat percentiles (usec):
 |  1.00th=[  660],  5.00th=[  780], 10.00th=[  876], 20.00th=[ 1032],
 | 30.00th=[ 1144], 40.00th=[ 1240], 50.00th=[ 1304], 60.00th=[ 1384],
 | 70.00th=[ 1480], 80.00th=[ 1640], 90.00th=[ 2096], 95.00th=[ 2960],
 | 99.00th=[ 6816], 99.50th=[ 7840

Re: speed decrease since firefly,giant,hammer the 2nd try

2015-02-10 Thread Stefan Priebe



Am 10.02.2015 um 22:38 schrieb Mark Nelson:

On 02/10/2015 03:11 PM, Stefan Priebe wrote:


mhm i installed librbd1-dbg and librados2-dbg - but the output still
looks useless to me. Should i upload it somewhere?


Meh, if it's all just symbols it's probably not that helpful.

I've summarized your results here:

1 concurrent 4k write (libaio, direct=1, iodepth=1)

 IOPSLatency
 wb onwb offwb onwb off
dumpling10870536~100us~2ms
firefly10350525~100us~2ms

So in single op tests dumpling and firefly are far closer.  Now let's
see each of these cases with iodepth=32 (still 1 thread for now).



dumpling:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 
00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=3011
  write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec
slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30
clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43
 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52
clat percentiles (usec):
 |  1.00th=[ 1480],  5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672],
 | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832],
 | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128],
 | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344],
 | 99.99th=[ 7072]
bw (KB/s)  : min=59696, max=77840, per=100.00%, avg=70351.27, 
stdev=4783.25

lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53%
lat (msec) : 2=85.02%, 4=14.31%, 10=0.13%
  cpu  : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
=64=0.0%

 issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, 
mint=30001msec, maxt=30001msec


Disk stats (read/write):
  sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064, 
util=98.73%


firefly:

file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 1 thread
Jobs: 1 (f=1): [w] [100.0% done] [0K/69096K /s] [0 /17.3K iops] [eta 
00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=2982
  write: io=1784.9MB, bw=60918KB/s, iops=15229 , runt= 30002msec
slat (usec): min=1 , max=1389 , avg= 3.43, stdev= 5.32
clat (usec): min=117 , max=8235 , avg=2096.88, stdev=396.30
 lat (usec): min=540 , max=8258 , avg=2100.43, stdev=396.61
clat percentiles (usec):
 |  1.00th=[ 1608],  5.00th=[ 1720], 10.00th=[ 1768], 20.00th=[ 1832],
 | 30.00th=[ 1896], 40.00th=[ 1944], 50.00th=[ 2008], 60.00th=[ 2064],
 | 70.00th=[ 2160], 80.00th=[ 2256], 90.00th=[ 2512], 95.00th=[ 2896],
 | 99.00th=[ 3600], 99.50th=[ 3792], 99.90th=[ 5088], 99.95th=[ 6304],
 | 99.99th=[ 6752]
bw (KB/s)  : min=36717, max=73712, per=99.94%, avg=60879.92, 
stdev=8302.27

lat (usec) : 250=0.01%, 750=0.01%
lat (msec) : 2=48.56%, 4=51.18%, 10=0.26%
  cpu  : usr=2.03%, sys=5.48%, ctx=20440, majf=0, minf=133
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
=64=0.0%

 issued: total=r=0/w=456918/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=1784.9MB, aggrb=60918KB/s, minb=60918KB/s, maxb=60918KB/s, 
mint=30002msec, maxt=30002msec


Disk stats (read/write):
  sdb: ios=166/455574, merge=0/0, ticks=12/897748, in_queue=897696, 
util=98.96%


Stefan


Mark



Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new dev cluster - using giant or hammer git?

2015-02-06 Thread Stefan Priebe - Profihost AG


 Am 06.02.2015 um 15:06 schrieb Sage Weil s...@newdream.net:
 
 On Fri, 6 Feb 2015, Stefan Priebe - Profihost AG wrote:
 Hi,
 
 for deploying a new ceph dev cluster can anybody recommand which git
 branch to use? hammer or giant-backport?
 
 Hi Stefan!
 
 If it's dev I'd recommend hammer.  If all of your clients will be new I'd 
 also recommand 'ceph osd crush tunables hammer' as there is a new and 
 improve crush bucket type.  

Hi sage,

thanks that's good for a test cluster too?

What would you recommand for something which can Crash but i don't want to 
loose data?

Stefan


 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

new dev cluster - using giant or hammer git?

2015-02-06 Thread Stefan Priebe - Profihost AG

Hi,

for deploying a new ceph dev cluster can anybody recommand which git
branch to use? hammer or giant-backport?

-- 
Mit freundlichen Grüßen
  Stefan Priebe
Bachelor of Science in Computer Science (BSCS)
Vorstand (CTO)

---
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 10 times higher disk load with btrfs

2015-01-05 Thread Stefan Priebe - Profihost AG

Hi,
Am 06.01.2015 um 04:44 schrieb Alexandre DERUMIER:
 Hi Stefan,
 
 Do you see a difference if you force filestore journal writeahead for btrfs 
 instead parrallel ?
 
 filestore journal writeahead = 1
 filestore journal parallel = 0

i already tested filestore btrfs snap = false which automatically
disabled parallel write.

Stefan

 - Mail original -
 De: Stefan Priebe s.pri...@profihost.ag
 À: Mark Nelson mnel...@redhat.com, Sage Weil s...@newdream.net
 Cc: ceph-devel ceph-devel@vger.kernel.org
 Envoyé: Lundi 5 Janvier 2015 21:33:22
 Objet: Re: 10 times higher disk load with btrfs
 
 Am 05.01.2015 um 21:29 schrieb Mark Nelson: 


 On 01/05/2015 02:20 PM, Stefan Priebe wrote: 
 Hi Sage, 

 Am 05.01.2015 um 20:25 schrieb Sage Weil: 
 On Mon, 5 Jan 2015, Stefan Priebe wrote: 

 Am 05.01.2015 um 19:36 schrieb Stefan Priebe: 
 Hi devs, 

 while btrfs is now declared as stable ;-) i wanted to retest btrfs on 
 our production cluster on 2 out of 54 osds. So if they crash it 
 doesn't 
 hurt. 

 While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same 
 OSDs after formatting them with btrfs have spikes of 190MB/s every 
 4-7s. 

 Why does just another filesystem raises the disk load by a factor of 
 10? 

 OK this seems to happen cause ceph is creating every 5s a new 
 subvolume / 
 snap. Is this really expected / needed? 

 You can disable it with 

 filestore btrfs snap = false 

 I'm curious how much this drops the load down; originally the 
 snaps were no more expensive than a regular sync but perhaps this 
 has changed... 

 - with XFS the average write is at 9Mb/s 
 - with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s 
 - with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s 

 Is that the average and not the spikes? It looks like before the spikes 
 were 20MB/s and 190MB/s? 
 
 Yes these are average values. 
 
 Spikes: 
 - with XFS the spike write is at 20Mb/s 
 - with btrfs (filestore_btrfs_snap=true) spike write is 200Mb/s 
 - with btrfs (filestore_btrfs_snap=false) spike is still 185Mb/s but avg 
 is 1/2 (20Mb/s) see above 
 
 


 Stefan 
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in 
 the body of a message to majord...@vger.kernel.org 
 More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 10 times higher disk load with btrfs

2015-01-05 Thread Stefan Priebe


Hi Sage,

Am 05.01.2015 um 20:25 schrieb Sage Weil:

On Mon, 5 Jan 2015, Stefan Priebe wrote:


Am 05.01.2015 um 19:36 schrieb Stefan Priebe:

Hi devs,

while btrfs is now declared as stable ;-) i wanted to retest btrfs on
our production cluster on 2 out of 54 osds. So if they crash it doesn't
hurt.

While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same
OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s.

Why does just another filesystem raises the disk load by a factor of 10?


OK this seems to happen cause ceph is creating every 5s a new subvolume /
snap. Is this really expected / needed?


You can disable it with

  filestore btrfs snap = false

I'm curious how much this drops the load down; originally the
snaps were no more expensive than a regular sync but perhaps this
has changed...


- with XFS the average write is at 9Mb/s
- with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s
- with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 10 times higher disk load with btrfs

2015-01-05 Thread Stefan Priebe



Am 05.01.2015 um 21:29 schrieb Mark Nelson:



On 01/05/2015 02:20 PM, Stefan Priebe wrote:

Hi Sage,

Am 05.01.2015 um 20:25 schrieb Sage Weil:

On Mon, 5 Jan 2015, Stefan Priebe wrote:


Am 05.01.2015 um 19:36 schrieb Stefan Priebe:

Hi devs,

while btrfs is now declared as stable ;-) i wanted to retest btrfs on
our production cluster on 2 out of 54 osds. So if they crash it
doesn't
hurt.

While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same
OSDs after formatting them with btrfs have spikes of 190MB/s every
4-7s.

Why does just another filesystem raises the disk load by a factor of
10?


OK this seems to happen cause ceph is creating every 5s a new
subvolume /
snap. Is this really expected / needed?


You can disable it with

  filestore btrfs snap = false

I'm curious how much this drops the load down; originally the
snaps were no more expensive than a regular sync but perhaps this
has changed...


- with XFS the average write is at 9Mb/s
- with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s
- with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s


Is that the average and not the spikes?  It looks like before the spikes
were 20MB/s and 190MB/s?


Yes these are average values.

Spikes:
- with XFS the spike write is at 20Mb/s
- with btrfs (filestore_btrfs_snap=true) spike write is 200Mb/s
- with btrfs (filestore_btrfs_snap=false) spike is still 185Mb/s but avg 
is 1/2 (20Mb/s) see above







Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

10 times higher disk load with btrfs

2015-01-05 Thread Stefan Priebe


Hi devs,

while btrfs is now declared as stable ;-) i wanted to retest btrfs on 
our production cluster on 2 out of 54 osds. So if they crash it doesn't 
hurt.


While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same 
OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s.


Why does just another filesystem raises the disk load by a factor of 10?

I'm running dumpling.

Greets Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 10 times higher disk load with btrfs

2015-01-05 Thread Stefan Priebe



Am 05.01.2015 um 19:36 schrieb Stefan Priebe:

Hi devs,

while btrfs is now declared as stable ;-) i wanted to retest btrfs on
our production cluster on 2 out of 54 osds. So if they crash it doesn't
hurt.

While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same
OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s.

Why does just another filesystem raises the disk load by a factor of 10?


OK this seems to happen cause ceph is creating every 5s a new subvolume 
/ snap. Is this really expected / needed?


Stefan



I'm running dumpling.

Greets Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly

2015-01-02 Thread Stefan Priebe


Am 02.01.2015 um 17:49 schrieb Samuel Just:

Odd, sounds like it might be rbd client side?
-Sam


That one was already on list:
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg19091.html

Sadly there was no result as it was unseen for 2 weeks and i didn't had 
the test equipment anymore.


Greets,
Stefan


On Thu, Jan 1, 2015 at 1:30 AM, Stefan Priebe s.pri...@profihost.ag wrote:

hi,

Am 31.12.2014 um 17:21 schrieb Wido den Hollander:


Hi,

Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
0.80.7 and after the upgrade there was a severe performance drop on the
cluster.

It started raining slow requests after the upgrade and most of them
included a 'snapc' in the request.

That lead me to investigate the RBD snapshots and I found that a rogue
process had created ~1800 snapshots spread out over 200 volumes.

One image even had 181 snapshots!

As the snapshots weren't used I removed them all and after the snapshots
were removed the performance of the cluster came back to normal level
again.

I'm wondering what changed between Dumpling and Firefly which caused
this? I saw OSDs spiking to 100% disk util constantly under Firefly
where this didn't happen with Dumpling.

Did something change in the way OSDs handle RBD snapshots which causes
them to create more disk I/O?



I saw the same and addionally a slowdown in librbd too, that's why i'm still
on dumpling and won't upgrade until hammer.

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly

2015-01-01 Thread Stefan Priebe


hi,

Am 31.12.2014 um 17:21 schrieb Wido den Hollander:

Hi,

Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
0.80.7 and after the upgrade there was a severe performance drop on the
cluster.

It started raining slow requests after the upgrade and most of them
included a 'snapc' in the request.

That lead me to investigate the RBD snapshots and I found that a rogue
process had created ~1800 snapshots spread out over 200 volumes.

One image even had 181 snapshots!

As the snapshots weren't used I removed them all and after the snapshots
were removed the performance of the cluster came back to normal level again.

I'm wondering what changed between Dumpling and Firefly which caused
this? I saw OSDs spiking to 100% disk util constantly under Firefly
where this didn't happen with Dumpling.

Did something change in the way OSDs handle RBD snapshots which causes
them to create more disk I/O?


I saw the same and addionally a slowdown in librbd too, that's why i'm 
still on dumpling and won't upgrade until hammer.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: inode64 mount option for XFS

2014-11-03 Thread Stefan Priebe - Profihost AG


Am 03.11.2014 um 13:28 schrieb Wido den Hollander:
 Hi,
 
 While look at init-ceph and ceph-disk I noticed a discrepancy between them.
 
 init-ceph mounts XFS filesystems with rw,noatime,inode64, but
 ceph-disk(-active) with rw,noatime
 
 As inode64 gives the best performance, shouldn't ceph-disk do the same?
 
 Any implications if we add inode64 on running deployments?

Isn't inode64 XFS default anyway?

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 10/7/2014 Weekly Ceph Performance Meeting: kernel boot params

2014-10-08 Thread Stefan Priebe




Hi,

as mentioned during today's meeting, here are the kernel boot parameters which 
I found to provide the basis for good performance results:

   processor.max_cstate=0
   intel_idle.max_cstate=0

I understand these to basically turn off any power saving modes of the CPU; the 
CPU's we are using are like
  Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
  Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz

At the BIOS level, we
  - turn off Hyperthraeding
  - turn off Turbo mode (in order ot not leave the specifications)
  - turn on frequency floor override

We also assert that
  /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  is set to performance

Using above we see a constant frequency at the maximum level allowed by the CPU 
(except Turbo mode).


How much performance do we gain by this? Till now i thought it's just 
1-3% so i'm still running ondemand govenor plus power savings.


Greets,
Stefan


Best Regards

Andreas Bluemle






On Wed, 8 Oct 2014 02:51:21 +0200
Mark Nelson mark.nel...@inktank.com wrote:


Hi All,

Just a remind that the weekly performance meeting is on Wednesdays at
8AM PST.  Same bat time, same bat channel!

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID:
268261044

To join via Phone:
1) Dial:
+1 408 740 7256
+1 888 240 2560(US Toll Free)
+1 408 317 9253(Alternate Number)
(see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html






--
Andreas Bluemle mailto:andreas.blue...@itxperts.de
ITXperts GmbH   http://www.itxperts.de
Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)  Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: severe librbd performance degradation in Giant

2014-09-19 Thread Stefan Priebe


Am 19.09.2014 03:08, schrieb Shu, Xinxin:

I also observed performance degradation on my full SSD setup ,  I can got  
~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only 
got ~12K IOPS


This are impressive numbers. Can you tell me how many OSDs you have and 
which SSDs you use?


Thanks,
Stefan



Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, September 19, 2014 2:03 AM
To: Alexandre DERUMIER; Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Alexandre,
What tool are you using ? I used fio rbd.

Also, I hope you have Giant package installed in the client side as well and 
rbd_cache =true is set on the client conf file.
FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had 
to make sure fio rbd is really loading giant librbd (if you have multiple 
copies around , which was in my case) for reproducing it.

Thanks  Regards
Somnath

-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
Sent: Thursday, September 18, 2014 2:49 AM
To: Haomai Wang
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy
Subject: Re: severe librbd performance degradation in Giant


According http://tracker.ceph.com/issues/9513, do you mean that rbd
cache will make 10x performance degradation for random read?


Hi, on my side, I don't see any degradation performance on read (seq or rand)  
with or without.

firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops 
 (with or without rbd_cache)

(and I can reach around 2-3 iops on giant with disabling optracker).


rbd_cache only improve write performance for me (4k block )



- Mail original -

De: Haomai Wang haomaiw...@gmail.com
À: Somnath Roy somnath@sandisk.com
Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, 
ceph-devel@vger.kernel.org
Envoyé: Jeudi 18 Septembre 2014 04:27:56
Objet: Re: severe librbd performance degradation in Giant

According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will 
make 10x performance degradation for random read?

On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote:

Josh/Sage,
I should mention that even after turning off rbd cache I am getting ~20% 
degradation over Firefly.

Thanks  Regards
Somnath

-Original Message-
From: Somnath Roy
Sent: Wednesday, September 17, 2014 2:44 PM
To: Sage Weil
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Created a tracker for this.

http://tracker.ceph.com/issues/9513

Thanks  Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, September 17, 2014 2:39 PM
To: Sage Weil
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

Sage,
It's a 4K random read.

Thanks  Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 17, 2014 2:36 PM
To: Somnath Roy
Cc: Josh Durgin; ceph-devel@vger.kernel.org
Subject: RE: severe librbd performance degradation in Giant

What was the io pattern? Sequential or random? For random a slowdown makes 
sense (tho maybe not 10x!) but not for sequentail

s

On Wed, 17 Sep 2014, Somnath Roy wrote:


I set the following in the client side /etc/ceph/ceph.conf where I am running 
fio rbd.

rbd_cache_writethrough_until_flush = false

But, no difference. BTW, I am doing Random read, not write. Still this setting 
applies ?

Next, I tried to tweak the rbd_cache setting to false and I *got back* the old 
performance. Now, it is similar to firefly throughput !

So, loks like rbd_cache=true was the culprit.

Thanks Josh !

Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 17, 2014 2:20 PM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: severe librbd performance degradation in Giant

On 09/17/2014 01:55 PM, Somnath Roy wrote:

Hi Sage,
We are experiencing severe librbd performance degradation in Giant over firefly 
release. Here is the experiment we did to isolate it as a librbd problem.

1. Single OSD is running latest Giant and client is running fio rbd on top of 
firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR).
2. Single OSD is running Giant and client is running fio rbd on top of Giant 
based librbd/librados. For one client it is giving ~1.9K iops (4K RR).
3. Single OSD is running latest Giant and client is running Giant based 
ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K 
iops (4K RR).
4. Giant RGW on top of Giant OSD is also scaling.


So, it is obvious from the above that recent

Re: severe librbd performance degradation in Giant

2014-09-19 Thread Stefan Priebe - Profihost AG

Am 19.09.2014 um 15:02 schrieb Shu, Xinxin:
  12 x Intel DC 3700 200GB, every SSD has two OSDs.

Crazy, I've 56 SSDs and canÄt go above 20 000 iops.

Grüße Stefan

 Cheers,
 xinxin
 
 -Original Message-
 From: Stefan Priebe [mailto:s.pri...@profihost.ag] 
 Sent: Friday, September 19, 2014 2:54 PM
 To: Shu, Xinxin; Somnath Roy; Alexandre DERUMIER; Haomai Wang
 Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
 Subject: Re: severe librbd performance degradation in Giant
 
 Am 19.09.2014 03:08, schrieb Shu, Xinxin:
 I also observed performance degradation on my full SSD setup ,  I can 
 got  ~270K IOPS for 4KB random read with 0.80.4 , but with latest 
 master , I only got ~12K IOPS
 
 This are impressive numbers. Can you tell me how many OSDs you have and which 
 SSDs you use?
 
 Thanks,
 Stefan
 
 
 Cheers,
 xinxin

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Friday, September 19, 2014 2:03 AM
 To: Alexandre DERUMIER; Haomai Wang
 Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 Alexandre,
 What tool are you using ? I used fio rbd.

 Also, I hope you have Giant package installed in the client side as well and 
 rbd_cache =true is set on the client conf file.
 FYI, firefly librbd + librados and Giant cluster will work seamlessly and I 
 had to make sure fio rbd is really loading giant librbd (if you have 
 multiple copies around , which was in my case) for reproducing it.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
 Sent: Thursday, September 18, 2014 2:49 AM
 To: Haomai Wang
 Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy
 Subject: Re: severe librbd performance degradation in Giant

 According http://tracker.ceph.com/issues/9513, do you mean that rbd 
 cache will make 10x performance degradation for random read?

 Hi, on my side, I don't see any degradation performance on read (seq or 
 rand)  with or without.

 firefly : around 12000iops (with or without rbd_cache) giant : around 
 12000iops  (with or without rbd_cache)

 (and I can reach around 2-3 iops on giant with disabling optracker).


 rbd_cache only improve write performance for me (4k block )



 - Mail original -

 De: Haomai Wang haomaiw...@gmail.com
 À: Somnath Roy somnath@sandisk.com
 Cc: Sage Weil sw...@redhat.com, Josh Durgin 
 josh.dur...@inktank.com, ceph-devel@vger.kernel.org
 Envoyé: Jeudi 18 Septembre 2014 04:27:56
 Objet: Re: severe librbd performance degradation in Giant

 According http://tracker.ceph.com/issues/9513, do you mean that rbd cache 
 will make 10x performance degradation for random read?

 On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote:
 Josh/Sage,
 I should mention that even after turning off rbd cache I am getting ~20% 
 degradation over Firefly.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, September 17, 2014 2:44 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 Created a tracker for this.

 http://tracker.ceph.com/issues/9513

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Wednesday, September 17, 2014 2:39 PM
 To: Sage Weil
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 Sage,
 It's a 4K random read.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, September 17, 2014 2:36 PM
 To: Somnath Roy
 Cc: Josh Durgin; ceph-devel@vger.kernel.org
 Subject: RE: severe librbd performance degradation in Giant

 What was the io pattern? Sequential or random? For random a slowdown makes 
 sense (tho maybe not 10x!) but not for sequentail

 s

 On Wed, 17 Sep 2014, Somnath Roy wrote:

 I set the following in the client side /etc/ceph/ceph.conf where I am 
 running fio rbd.

 rbd_cache_writethrough_until_flush = false

 But, no difference. BTW, I am doing Random read, not write. Still this 
 setting applies ?

 Next, I tried to tweak the rbd_cache setting to false and I *got back* the 
 old performance. Now, it is similar to firefly throughput !

 So, loks like rbd_cache=true was the culprit.

 Thanks Josh !

 Regards
 Somnath

 -Original Message-
 From: Josh Durgin [mailto:josh.dur...@inktank.com]
 Sent: Wednesday, September 17, 2014 2:20 PM
 To: Somnath Roy; ceph-devel@vger.kernel.org
 Subject: Re: severe librbd performance degradation in Giant

 On 09/17/2014 01:55 PM, Somnath Roy wrote:
 Hi Sage,
 We are experiencing severe librbd performance degradation in Giant over 
 firefly release. Here is the experiment we did to isolate

Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Stefan Priebe - Profihost AG

Am 02.07.2014 00:51, schrieb Gregory Farnum:
 On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Hi Greg,

 Am 26.06.2014 02:17, schrieb Gregory Farnum:
 Sorry we let this drop; we've all been busy traveling and things.

 There have been a lot of changes to librados between Dumpling and
 Firefly, but we have no idea what would have made it slower. Can you
 provide more details about how you were running these tests?

 it's just a normal fio run:
 fio --ioengine=rbd --bs=4k --name=foo --invalidate=0
 --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor
 --runtime=90 --numjobs=32 --direct=1 --group

 Running one time with firefly libs and one time with dumpling libs.
 Traget is always the same pool on a firefly ceph storage.
 
 What's the backing cluster you're running against? What kind of CPU
 usage do you see with both? 25k IOPS is definitely getting up there,
 but I'd like some guidance about whether we're looking for a reduction
 in parallelism, or an increase in per-op costs, or something else.

Hi Greg,

i don't have that test cluster anymore. It had to go into production
with dumpling.

So i can't tell you.

Sorry.

Stefan

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Stefan Priebe - Profihost AG


Am 02.07.2014 15:07, schrieb Haomai Wang:
 Could you give some perf counter from rbd client side? Such as op latency?

Sorry haven't any counters. As this mail was some days unseen - i
thought nobody has an idea or could help.

Stefan

 On Wed, Jul 2, 2014 at 9:01 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Am 02.07.2014 00:51, schrieb Gregory Farnum:
 On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Hi Greg,

 Am 26.06.2014 02:17, schrieb Gregory Farnum:
 Sorry we let this drop; we've all been busy traveling and things.

 There have been a lot of changes to librados between Dumpling and
 Firefly, but we have no idea what would have made it slower. Can you
 provide more details about how you were running these tests?

 it's just a normal fio run:
 fio --ioengine=rbd --bs=4k --name=foo --invalidate=0
 --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor
 --runtime=90 --numjobs=32 --direct=1 --group

 Running one time with firefly libs and one time with dumpling libs.
 Traget is always the same pool on a firefly ceph storage.

 What's the backing cluster you're running against? What kind of CPU
 usage do you see with both? 25k IOPS is definitely getting up there,
 but I'd like some guidance about whether we're looking for a reduction
 in parallelism, or an increase in per-op costs, or something else.

 Hi Greg,

 i don't have that test cluster anymore. It had to go into production
 with dumpling.

 So i can't tell you.

 Sorry.

 Stefan

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-07-02 Thread Stefan Priebe


Hi Greg,

Am 02.07.2014 21:36, schrieb Gregory Farnum:

On Wed, Jul 2, 2014 at 12:00 PM, Stefan Priebe s.pri...@profihost.ag wrote:


Am 02.07.2014 16:00, schrieb Gregory Farnum:


Yeah, it's fighting for attention with a lot of other urgent stuff. :(

Anyway, even if you can't look up any details or reproduce at this
time, I'm sure you know what shape the cluster was (number of OSDs,
running on SSDs or hard drives, etc), and that would be useful
guidance. :)



Sure

Number of OSDs: 24
Each OSD has an SSD capable of doing tested with fio before installing ceph
(70.000 iop/s 4k write, 580MB/s seq. write 1MB blocks)

Single Xeon E5-1620 v2 @ 3.70GHz

48GB RAM


Awesome, thanks.

I went through the changelogs on the librados/, osdc/, and msg/
directories to see if I could find any likely change candidates
between Dumpling and Firefly and couldn't see any issues. :( But I
suspect that the sharding changes coming will more than make up the
difference, so you might want to plan on checking that out when it
arrives, even if you don't want to deploy it to production.n


To which changes do you refer? Will they be part or backported of/to 
firefly?



-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?

2014-06-27 Thread Stefan Priebe - Profihost AG

Hi Greg,

Am 26.06.2014 02:17, schrieb Gregory Farnum:
 Sorry we let this drop; we've all been busy traveling and things.
 
 There have been a lot of changes to librados between Dumpling and
 Firefly, but we have no idea what would have made it slower. Can you
 provide more details about how you were running these tests?

it's just a normal fio run:
fio --ioengine=rbd --bs=4k --name=foo --invalidate=0
--readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor
--runtime=90 --numjobs=32 --direct=1 --group

Running one time with firefly libs and one time with dumpling libs.
Traget is always the same pool on a firefly ceph storage.

Stefan

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Fri, Jun 13, 2014 at 7:59 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 while testint firefly i cam into the sitation where i had a client where the
 latest dumpling packages where installed (0.67.9).

 As my pool has hashppool false and the tunables are set to default it can
 talk to my firefly ceph sotrage.

 For random 4k writes using fio with librbd and 32 jobs and an iodepth of 32.

 I get these results:

 librbd / librados2 from dumpling:
   write: io=3020.9MB, bw=103083KB/s, iops=25770, runt= 30008msec
   WRITE: io=3020.9MB, aggrb=103082KB/s, minb=103082KB/s, maxb=103082KB/s,
 mint=30008msec, maxt=30008msec

 librbd / librados2 from firefly:
   write: io=7344.3MB, bw=83537KB/s, iops=20884, runt= 90026msec
   WRITE: io=7344.3MB, aggrb=83537KB/s, minb=83537KB/s, maxb=83537KB/s,
 mint=90026msec, maxt=90026msec

 Stefan
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Share]Performance tunning on Ceph FileStore with SSD backend

2014-05-27 Thread Stefan Priebe - Profihost AG

Am 27.05.2014 06:42, schrieb Haomai Wang:
 On Tue, May 27, 2014 at 4:29 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Haomai,

 regarding the FDCache problems you're seeing. Isn't this branch interesting
 for you? Have you ever tested it?

 http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html

 
 Yes, I noticed it. But my main job is improving performance on 0.67.5
 version. Before this branch, my improvement on this problem is avoid
 lfn_find in omap* methods with FileStore
 class.(https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg18505.html)

Avoids mean just remove them? Are they not needed? Do you have a branch
for this?

 Greets,
 Stefan

 Am 09.04.2014 12:05, schrieb Haomai Wang:

 Hi all,

 I would like to share some ideas about how to improve performance on
 ceph with SSD. Not much preciseness.

 Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD).
 ceph version is 0.67.5(Dumping)

 At first, we find three bottleneck on filestore:
 1. fdcache_lock(changed in Firely release)
 2. lfn_find in omap_* methods
 3. DBObjectMap header

 According to my understanding and the docs in

 ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h),
 I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully
 sure the correctness of this change, but it works well still now.

 DBObjectMap header patch is on the pull request queue and may be
 merged in the next feature merge window.

 With things above done, we get much performance improvement in disk
 util and benchmark results(3x-4x).

 Next, we find fdcache size become the main bottleneck. For example, if
 hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot
 data range is 1TB, we need 25(1000GB/4MB) fd to cache. With
 increase filestore_fd_cache_size, the cost of lookup(FDCache) and
 cache miss is expensive and can't be afford. The implementation of
 FDCache isn't O(1). So we only can get high performance on fdcache hit
 range(maybe 100GB with 10240 fdcache size) and more data exceed the
 size of fdcaceh will be disaster. If you want to cache more fd(102400
 fdcache size), the implementation of FDCache will bring on extra CPU
 cost(can't be ignore) for each op.

 Because of the capacity of SSD(several hundreds GB), we try to
 increase the size of rbd object(16MB) so less fd cache is needed. As
 for FDCache implementation, we simply discard SimpleLRU but introduce
 RandomCache. Now we can set much larger fdcache size(near cache all
 fd) with little overload.

 With these, we achieve 3x-4x performance improvements on filestore with
 SSD.

 Maybe it exists something I missed or something wrong, hope can
 correct me. I hope it can help to improve FileStore on SSD and push
 into master branch.


 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Share]Performance tunning on Ceph FileStore with SSD backend

2014-05-27 Thread Stefan Priebe - Profihost AG

Am 27.05.2014 08:37, schrieb Haomai Wang:
 I'm not full sure the correctness of changes although it seemed ok to
 me. And I apply these changes to product env and no problems.

Do you have a branch in your yuyuyu github account for this?

 On Tue, May 27, 2014 at 2:05 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Am 27.05.2014 06:42, schrieb Haomai Wang:
 On Tue, May 27, 2014 at 4:29 AM, Stefan Priebe s.pri...@profihost.ag 
 wrote:
 Hi Haomai,

 regarding the FDCache problems you're seeing. Isn't this branch interesting
 for you? Have you ever tested it?

 http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html


 Yes, I noticed it. But my main job is improving performance on 0.67.5
 version. Before this branch, my improvement on this problem is avoid
 lfn_find in omap* methods with FileStore
 class.(https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg18505.html)

 Avoids mean just remove them? Are they not needed? Do you have a branch
 for this?

 Greets,
 Stefan

 Am 09.04.2014 12:05, schrieb Haomai Wang:

 Hi all,

 I would like to share some ideas about how to improve performance on
 ceph with SSD. Not much preciseness.

 Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD).
 ceph version is 0.67.5(Dumping)

 At first, we find three bottleneck on filestore:
 1. fdcache_lock(changed in Firely release)
 2. lfn_find in omap_* methods
 3. DBObjectMap header

 According to my understanding and the docs in

 ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h),
 I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully
 sure the correctness of this change, but it works well still now.

 DBObjectMap header patch is on the pull request queue and may be
 merged in the next feature merge window.

 With things above done, we get much performance improvement in disk
 util and benchmark results(3x-4x).

 Next, we find fdcache size become the main bottleneck. For example, if
 hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot
 data range is 1TB, we need 25(1000GB/4MB) fd to cache. With
 increase filestore_fd_cache_size, the cost of lookup(FDCache) and
 cache miss is expensive and can't be afford. The implementation of
 FDCache isn't O(1). So we only can get high performance on fdcache hit
 range(maybe 100GB with 10240 fdcache size) and more data exceed the
 size of fdcaceh will be disaster. If you want to cache more fd(102400
 fdcache size), the implementation of FDCache will bring on extra CPU
 cost(can't be ignore) for each op.

 Because of the capacity of SSD(several hundreds GB), we try to
 increase the size of rbd object(16MB) so less fd cache is needed. As
 for FDCache implementation, we simply discard SimpleLRU but introduce
 RandomCache. Now we can set much larger fdcache size(near cache all
 fd) with little overload.

 With these, we achieve 3x-4x performance improvements on filestore with
 SSD.

 Maybe it exists something I missed or something wrong, hope can
 correct me. I hope it can help to improve FileStore on SSD and push
 into master branch.





 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Share]Performance tunning on Ceph FileStore with SSD backend

2014-05-26 Thread Stefan Priebe


Hi Haomai,

regarding the FDCache problems you're seeing. Isn't this branch 
interesting for you? Have you ever tested it?


http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html

Greets,
Stefan

Am 09.04.2014 12:05, schrieb Haomai Wang:

Hi all,

I would like to share some ideas about how to improve performance on
ceph with SSD. Not much preciseness.

Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD).
ceph version is 0.67.5(Dumping)

At first, we find three bottleneck on filestore:
1. fdcache_lock(changed in Firely release)
2. lfn_find in omap_* methods
3. DBObjectMap header

According to my understanding and the docs in
ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h),
I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully
sure the correctness of this change, but it works well still now.

DBObjectMap header patch is on the pull request queue and may be
merged in the next feature merge window.

With things above done, we get much performance improvement in disk
util and benchmark results(3x-4x).

Next, we find fdcache size become the main bottleneck. For example, if
hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot
data range is 1TB, we need 25(1000GB/4MB) fd to cache. With
increase filestore_fd_cache_size, the cost of lookup(FDCache) and
cache miss is expensive and can't be afford. The implementation of
FDCache isn't O(1). So we only can get high performance on fdcache hit
range(maybe 100GB with 10240 fdcache size) and more data exceed the
size of fdcaceh will be disaster. If you want to cache more fd(102400
fdcache size), the implementation of FDCache will bring on extra CPU
cost(can't be ignore) for each op.

Because of the capacity of SSD(several hundreds GB), we try to
increase the size of rbd object(16MB) so less fd cache is needed. As
for FDCache implementation, we simply discard SimpleLRU but introduce
RandomCache. Now we can set much larger fdcache size(near cache all
fd) with little overload.

With these, we achieve 3x-4x performance improvements on filestore with SSD.

Maybe it exists something I missed or something wrong, hope can
correct me. I hope it can help to improve FileStore on SSD and push
into master branch.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Performance] Improvement on DB Performance

2014-05-21 Thread Stefan Priebe - Profihost AG


 Am 21.05.2014 um 20:41 schrieb Sage Weil s...@inktank.com:
 
 On Wed, 21 May 2014, Stefan Priebe - Profihost AG wrote:
 Hi sage,
 
 what about cuttlefish customers?
 
 We stopped backporting fixes to cuttlefish a while ago.  Please upgrade to 
 dumpling!

Did I miss an information from inktank to update to dumpling? I thought we 
should stay at cuttlefish and then upgrade to firefly.

 
 That said, this patch should apply cleanly to cuttlefish.
 
 sage
 
 
 
 Greets,
 Stefan
 Excuse my typo sent from my mobile phone.
 
 Am 21.05.2014 um 18:15 schrieb Sage Weil s...@inktank.com:
 
  On Wed, 21 May 2014, Mike Dawson wrote:
Haomai,
 
 
Thanks for finding this!
 
 
 
Sage,
 
 
We have a client that runs an io intensive, closed-source software
package
 
that seems to issue overzealous flushes which may benefit from 
 this
patch (or
 
the other methods you mention). If you were to spin a wip build 
 based
on
 
Dumpling, I'll be a willing tester.
 
 
  Pushed wip-librbd-flush-dumpling, should be built shortly.
 
  sage
 
 
Thanks,
 
Mike Dawson
 
 
On 5/21/2014 11:23 AM, Sage Weil wrote:
 
  On Wed, 21 May 2014, Haomai Wang wrote:
 
I pushed the commit to fix this
 
problem(https://github.com/ceph/ceph/pull/1848).
 
 
With test program(Each sync request is issued
with ten write request),
 
a significant improvement is noticed.
 
 
aio_flush  sum: 914750
avg: 1239   count:
 
738  max: 4714   min: 1011
 
flush_set  sum: 904200
avg: 1225   count:
 
738  max: 4698   min: 999
 
flush  sum: 641648
avg: 173count:
 
3690 max: 1340   min: 128
 
 
Compared to last mail, it reduce each aio_flush
request to 1239 ns
 
instead of 24145 ns.
 
 
  Good catch!  That's a great improvement.
 
 
  The patch looks clearly correct.  We can probably do even
  better by
 
  putting the Objects on a list when they get the first dirty
  buffer so that
 
  we only cycle through the dirty ones.  Or, have a global
  list of dirty
 
  buffers (instead of dirty objects - dirty buffers).
 
 
  sage
 
 
 
I hope it's the root cause for db on rbd
performance.
 
 
On Wed, May 21, 2014 at 6:15 PM, Haomai Wang
haomaiw...@gmail.com wrote:
 
  Hi all,
 
 
  I remember there exists discuss
  about DB(mysql) performance on rbd.
 
  Recently I test mysql-bench with
  rbd and found awful performance. So
  I
 
  dive into it and find that main
  cause is flush request from
  guest.
 
  As we know, applications such as
  mysql, ceph has own journal for
 
  durable and journal usually send
  syncdirect io. If fs barrier is
  on,
 
  each sync io operation make kernel
  issue sync(barrier) request to
 
  block device. Here, qemu will call
  rbd_aio_flush to apply.
 
 
  Via systemtap, I found a amazing
  thing:
 
  aio_flush
   sum:
  4177085avg: 24145  count:
 
  173  max: 28172  min: 22747
 
  flush_set
   sum:
  4172116avg: 24116  count:
 
  173  max: 28034  min: 22733
 
  flush
   sum:
  3029910avg: 4  count:
 
  670477   max: 1893   min: 3
 
 
  This statistic info is gathered in
  5s. Most

Re: [Performance] Improvement on DB Performance

2014-05-21 Thread Stefan Priebe


*arg* sorry missed emperor with dumpling.. sorry.

Stefan

Am 21.05.2014 20:51, schrieb Stefan Priebe - Profihost AG:



Am 21.05.2014 um 20:41 schrieb Sage Weil s...@inktank.com:


On Wed, 21 May 2014, Stefan Priebe - Profihost AG wrote:
Hi sage,

what about cuttlefish customers?


We stopped backporting fixes to cuttlefish a while ago.  Please upgrade to
dumpling!


Did I miss an information from inktank to update to dumpling? I thought we 
should stay at cuttlefish and then upgrade to firefly.



That said, this patch should apply cleanly to cuttlefish.

sage




Greets,
Stefan
Excuse my typo sent from my mobile phone.

Am 21.05.2014 um 18:15 schrieb Sage Weil s...@inktank.com:

  On Wed, 21 May 2014, Mike Dawson wrote:
Haomai,


Thanks for finding this!



Sage,


We have a client that runs an io intensive, closed-source software
package

that seems to issue overzealous flushes which may benefit from this
patch (or

the other methods you mention). If you were to spin a wip build 
based
on

Dumpling, I'll be a willing tester.


  Pushed wip-librbd-flush-dumpling, should be built shortly.

  sage


Thanks,

Mike Dawson


On 5/21/2014 11:23 AM, Sage Weil wrote:

  On Wed, 21 May 2014, Haomai Wang wrote:

I pushed the commit to fix this

problem(https://github.com/ceph/ceph/pull/1848).


With test program(Each sync request is issued
with ten write request),

a significant improvement is noticed.


aio_flush  sum: 914750
avg: 1239   count:

738  max: 4714   min: 1011

flush_set  sum: 904200
avg: 1225   count:

738  max: 4698   min: 999

flush  sum: 641648
avg: 173count:

3690 max: 1340   min: 128


Compared to last mail, it reduce each aio_flush
request to 1239 ns

instead of 24145 ns.


  Good catch!  That's a great improvement.


  The patch looks clearly correct.  We can probably do even
  better by

  putting the Objects on a list when they get the first dirty
  buffer so that

  we only cycle through the dirty ones.  Or, have a global
  list of dirty

  buffers (instead of dirty objects - dirty buffers).


  sage



I hope it's the root cause for db on rbd
performance.


On Wed, May 21, 2014 at 6:15 PM, Haomai Wang
haomaiw...@gmail.com wrote:

  Hi all,


  I remember there exists discuss
  about DB(mysql) performance on rbd.

  Recently I test mysql-bench with
  rbd and found awful performance. So
  I

  dive into it and find that main
  cause is flush request from
  guest.

  As we know, applications such as
  mysql, ceph has own journal for

  durable and journal usually send
  syncdirect io. If fs barrier is
  on,

  each sync io operation make kernel
  issue sync(barrier) request to

  block device. Here, qemu will call
  rbd_aio_flush to apply.


  Via systemtap, I found a amazing
  thing:

  aio_flush
   sum:
  4177085avg: 24145  count:

  173  max: 28172  min: 22747

  flush_set
   sum:
  4172116avg: 24116  count:

  173  max: 28034  min: 22733

  flush
   sum:
  3029910avg: 4  count:

  670477   max: 1893   min: 3


  This statistic info is gathered

Re: default filestore max sync interval

2014-04-29 Thread Stefan Priebe


H Greg,


Am 29.04.2014 22:23, schrieb Gregory Farnum:

On Tue, Apr 29, 2014 at 1:10 PM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:

Hi all,
Why is the default max sync interval only 5 seconds?

Today we realized what a huge difference that increasing this to 30 or 60s can 
do for the small write latency. Basically, with a 5s interval our 4k write 
latency is above 30-35ms and once we increase it to 30s we can get under 10ms 
(using spinning disks for journal and data.)

See the attached plot for the affect of this on a running cluster (the plot 
shows the max, avg, min write latency from a short rados bench every 10 mins). 
The change from 5s to 60s was applied at noon today. (And our journals are 
large enough, don't worry).

In the interest of having sensible defaults, is there any reason not to 
increase this to 30s?


If you've got reasonable confidence in the quality of your
measurements across the workloads you serve, you should bump it up.
Part of what might be happening here is simply that fewer of your
small-io writes are running into a sync interval.
I suspect that most users will see improvement by bumping up the
limits and occasionally agitate to change the defaults, but Sam has
always pushed back against doing so for reasons I don't entirely
recall. :) (The potential for a burstier throughput profile?)
-Greg


What is about those?

filestore queue max ops = 500
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 419430400
filestore_queue_committing_max_bytes = 419430400

  filestore_wbthrottle_xfs_bytes_start_flusher = 125829120
  filestore_wbthrottle_xfs_bytes_hard_limit = 419430400
  filestore_wbthrottle_xfs_ios_start_flusher = 5000
  filestore_wbthrottle_xfs_ios_hard_limit = 5
  filestore_wbthrottle_xfs_inodes_start_flusher = 1000
  filestore_wbthrottle_xfs_inodes_hard_limit = 1

They should be adjusted too? right?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: default filestore max sync interval

2014-04-29 Thread Stefan Priebe


Hi Dan,

Am 29.04.2014 22:10, schrieb Dan Van Der Ster:

Hi all,
Why is the default max sync interval only 5 seconds?

Today we realized what a huge difference that increasing this to 30 or 60s can 
do for the small write latency. Basically, with a 5s interval our 4k write 
latency is above 30-35ms and once we increase it to 30s we can get under 10ms 
(using spinning disks for journal and data.)

See the attached plot for the affect of this on a running cluster (the plot 
shows the max, avg, min write latency from a short rados bench every 10 mins). 
The change from 5s to 60s was applied at noon today. (And our journals are 
large enough, don't worry).

In the interest of having sensible defaults, is there any reason not to 
increase this to 30s?


I was playing with them too but didn't get any viewable results. How do 
you get / graph the ceph latency?


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: firefly timing

2014-03-18 Thread Stefan Priebe - Profihost AG

Hi Sage,

i really would like to test the tiering. Is there any detailed
documentation about it and how it works?

Greets,
Stefan

Am 18.03.2014 05:45, schrieb Sage Weil:
 Hi everyone,
 
 It's taken longer than expected, but the tests for v0.78 are calming down 
 and it looks like we'll be able to get the release out this week.
 
 However, we've decided NOT to make this release firefly.  It will be a 
 normal development release.  This will be the first release that includes 
 some key new functionality (erasure coding and cache tiering) and although 
 it is passing our tests we'd like to have some operational experience with 
 it in more users' hands before we commit to supporting it long term.
 
 The tentative plan is to freeze and then release v0.79 after a normal two 
 week cycle.  This will serve as a 'release candidate' that shaves off a 
 few rough edges from the pending release (including some improvements with 
 the API for setting up erasure coded pools).  It is possible that 0.79 
 will turn into firefly, but more likely that we will opt for another two 
 weeks of hardening and make 0.80 the release we name firefly and maintain 
 for the long term.
 
 Long story short: 0.78 will be out soon, and you should test it!  It is 
 will vary from the final firefly in a few subtle ways, but any feedback or 
 usability and bug reports at this point will be very helpful in shaping 
 things.
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: firefly timing

2014-03-18 Thread Stefan Priebe - Profihost AG


 Am 18.03.2014 um 17:06 schrieb Sage Weil s...@inktank.com:
 
 On Tue, 18 Mar 2014, Stefan Priebe - Profihost AG wrote:
 Hi Sage,
 
 i really would like to test the tiering. Is there any detailed
 documentation about it and how it works?
 
 Great!  Here is a quick synopiss on how to set it up:
 
http://ceph.com/docs/master/dev/cache-pool/

What I'm missing is a documentation about the cache settings?

 
 sage
 
 
 
 
 Greets,
 Stefan
 
 Am 18.03.2014 05:45, schrieb Sage Weil:
 Hi everyone,
 
 It's taken longer than expected, but the tests for v0.78 are calming down 
 and it looks like we'll be able to get the release out this week.
 
 However, we've decided NOT to make this release firefly.  It will be a 
 normal development release.  This will be the first release that includes 
 some key new functionality (erasure coding and cache tiering) and although 
 it is passing our tests we'd like to have some operational experience with 
 it in more users' hands before we commit to supporting it long term.
 
 The tentative plan is to freeze and then release v0.79 after a normal two 
 week cycle.  This will serve as a 'release candidate' that shaves off a 
 few rough edges from the pending release (including some improvements with 
 the API for setting up erasure coded pools).  It is possible that 0.79 
 will turn into firefly, but more likely that we will opt for another two 
 weeks of hardening and make 0.80 the release we name firefly and maintain 
 for the long term.
 
 Long story short: 0.78 will be out soon, and you should test it!  It is 
 will vary from the final firefly in a few subtle ways, but any feedback or 
 usability and bug reports at this point will be very helpful in shaping 
 things.
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph cli delay when one mon is down

2014-01-14 Thread Stefan Priebe - Profihost AG


Am 15.01.2014 um 08:33 schrieb Dietmar Maurer diet...@proxmox.com:

 You can avoid this, and speed things up in general, by using the interactive
 mode:
 
 #!/bin/sh
 ceph EOM
 do something
 do something else
 EOM
 
 Above is a bit clumsy.

Especially you don't know which command fails in which way and produced which 
exitcode or output.


 To be honest, I want to do things with perl, so 
 I guess it is better to use perl bindings for librados.
 
 Are perl bindings already available? 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal for adding disable FileJournal option

2014-01-09 Thread Stefan Priebe - Profihost AG

I had the samt question in the past but there seems no way to change it for the 
ceph team

Stefan

This mail was sent with my iPhone.

 Am 09.01.2014 um 18:28 schrieb Gregory Farnum g...@inktank.com:
 
 The FileJournal is also for data safety whenever we're using write
 ahead. To disable it we need a backing store that we know can provide
 us consistent checkpoints (i.e., we can use parallel journaling mode —
 so for the FileJournal, we're using btrfs, or maybe zfs someday). But
 for those systems you can already configure the system not to use a
 journal.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Thu, Jan 9, 2014 at 12:13 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Hi all,
 
 We know FileJournal plays a important role in FileStore backend, it can
 hugely reduce write latency and improve small write operations.
 
 But in practice, there exists exceptions such as we already use FlashCache 
 or cachepool(although it's not ready).
 
 If cachepool enabled, we may use use journal in cache_pool but may
 not like to use journal in base_pool. The main reason why drop journal
 in base_pool is that journal take over a single physical device and waste
 too much in base_pool.
 
 Like above, if I enable FlashCache or other cache, I'd not like to enable
 journal in OSD layer.
 
 So is it necessary to disable journal in special(not really special) case?
 
 Best regards,
 Wheats
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

rocksdb Seen today - replacement for leveldb?

2013-11-27 Thread Stefan Priebe - Profihost AG

Hi,

while googles leveldb was too slow for facebook they created rocksdb
(http://rocksdb.org/) may be interesting for Ceph? It's already
production quality.

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] rocksdb Seen today - replacement for leveldb?

2013-11-27 Thread Stefan Priebe - Profihost AG

the performance comparisions are very impressive:

https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks

Stefan

Am 27.11.2013 11:55, schrieb Stefan Priebe - Profihost AG:
 Hi,
 
 while googles leveldb was too slow for facebook they created rocksdb
 (http://rocksdb.org/) may be interesting for Ceph? It's already
 production quality.
 
 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Intel 520/530 SSD for ceph

2013-11-21 Thread Stefan Priebe - Profihost AG

Hi,

Am 21.11.2013 01:29, schrieb m...@linuxbox.com:
 On Tue, Nov 19, 2013 at 09:02:41AM +0100, Stefan Priebe wrote:
 ...
 You might be able to vary this behavior by experimenting with sdparm,
 smartctl or other tools, or possibly with different microcode in the drive.
 Which values or which settings do you think of?
 ...
 
 Off-hand, I don't know.  Probably the first thing would be
 to compare the configuration of your 520  530; anything that's
 different is certainly worth investigating.
 
 This should display all pages,
   sdparm --all --long /dev/sdX
 the 520 only appears to have 3 pages, which can be fetched directly w/
   sdparm --page=ca --long /dev/sdX
   sdparm --page=co --long /dev/sdX
   sdparm --page=rw --long /dev/sdX
 
 The sample machine I'm looking has an intel 520, and on ours,
 most options show as 0 except for
   AWRE1  [cha: n, def:  1]  Automatic write reallocation enabled
   WCE 1  [cha: y, def:  1]  Write cache enable
   DRA 1  [cha: n, def:  1]  Disable read ahead
   GLTSD   1  [cha: n, def:  1]  Global logging target save disable
   BTP-1  [cha: n, def: -1]  Busy timeout period (100us)
   ESTCT  30  [cha: n, def: 30]  Extended self test completion time (sec)
 Perhaps that's an interesting data point to compare with yours.
 
 Figuring out if you have up-to-date intel firmware appears to require
 burning and running an iso image from
 https://downloadcenter.intel.com/Detail_Desc.aspx?agr=YDwnldID=18455
 
 The results of sdparm --page=whatever --long /dev/sdc
 show the intel firmware, but this labels it better:
 smartctl -i /dev/sdc
 Our 520 has firmware 400i loaded.

Firmware is up2date and all values are the same. I expect that the 520
firmware just ignores CMD_FLUSH commands and the 530 does not.

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Intel 520/530 SSD for ceph

2013-11-19 Thread Stefan Priebe


Hi Marcus,

Am 18.11.2013 23:51, schrieb m...@linuxbox.com:

On Mon, Nov 18, 2013 at 02:38:42PM +0100, Stefan Priebe - Profihost AG wrote:
You may actually be doing O_SYNC - recent kernels implement O_DSYNC,
but glibc maps O_DSYNC into O_SYNC.  But since you're writing to the
block device this won't matter much.


No difference regarding O_DSYNC or O_SYNC the values are the same. Also 
I'm using 3.10.19 as a kernel so it is recent enough.



I believe the effect of O_DIRECT by itself is just to bypass the buffer
cache, which is not going to make much difference for your dd case.
(It will mainly affect other applications that are also using the
buffer cache...)

 O_SYNC should be causing the writes to block until a response
 is received from the disk.  Without O_SYNC, the writes will
 just queue operations and return - potentially very fast.
 Your dd is probably writing enough data that there is some
 throttling by the system as it runs out of disk buffers and
 has to wait for some previous data to be written to the drive,
 but the delay for any individual block is not likely to matter.
 With O_SYNC, you are measuring the delay for each block directly,
 and you have absolutely removed the ability for the disk to
 perform any sort of parallism.

That's correct but ceph uses O_DSYNC for his journal and may be other 
stuff so it is important to have devices performing well with O_DSYNC.



Sounds like the intel 530 is has a much larger block write latency,
but can make up for it by performing more overlapped operations.

You might be able to vary this behavior by experimenting with sdparm,
smartctl or other tools, or possibly with different microcode in the drive.

Which values or which settings do you think of?

Greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANN] ceph-deploy 1.3 released!

2013-11-01 Thread Stefan Priebe


Hi,

i didn't found something in the changelog so i just would like to ask if 
this is planned.


Right now you can already create a new cluster using hostA:IPA HostB:IPB ...

but it does not use these IPs as mon addr also the hostA hostB names 
need to match the hostname this is pretty bad as you cannot change IPs 
or hosts of mons later easily, so i tend to use special names and Ips 
which i can move later to different machines.


The normal ceph config supports:

[mon.a]
   host name = abc
   mon addr = 85.58.34.12

Thanks,
Stefan

Am 01.11.2013 13:54, schrieb Alfredo Deza:

Hi all,

A new version (1.3) of ceph-deploy is now out, a lot of fixes went
into this release including the addition of a more robust library to
connect to remote hosts and it removed the one extra dependency we
used. Installation should be simpler.

The complete changelog can be found at:

https://github.com/ceph/ceph-deploy/blob/master/docs/source/changelog.rst


The highlights for this release are:


* We now allow to use `--username` to connect on remote hosts,
specifying something different than the current user or the SSH
config.

* Global timeouts for remote commands to be able to disconnect if
there is no input received (defaults to 5 minutes), but still allowing
other more granular timeouts for some commands that need to just run a
simple command without output expectation.


Please make sure you update (install instructions:
http://github.com/ceph/ceph-deploy/#installation) and use the latest
version!


Thanks,


Alfredo
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-22 Thread Stefan Priebe - Profihost AG

Am 22.08.2013 05:34, schrieb Samuel Just:
 It's not really possible at this time to control that limit because
 changing the primary is actually fairly expensive and doing it
 unnecessarily would probably make the situation much worse

I'm sorry but remapping or backfilling is far less expensive on all of
my machines than recovering.

While backfilling i've around 8-10% I/O waits while under recovery i
have 40%-50%


 (it's
 mostly necessary for backfilling, which is expensive anyway).  It
 seems like forwarding IO on an object which needs to be recovered to a
 replica with the object would be the next step.  Certainly something
 to consider for the future.

Yes this would be the solution.

Stefan

 -Sam
 
 On Wed, Aug 21, 2013 at 12:37 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Sam,
 Am 21.08.2013 21:13, schrieb Samuel Just:

 As long as the request is for an object which is up to date on the
 primary, the request will be served without waiting for recovery.


 Sure but remember if you have VM random 4K workload a lot of objects go out
 of date pretty soon.


 A request only waits on recovery if the particular object being read or

 written must be recovered.


 Yes but on 4k load this can be a lot.


 Your issue was that recovering the
 particular object being requested was unreasonably slow due to
 silliness in the recovery code which you disabled by disabling
 osd_recover_clone_overlap.


 Yes and no. It's better now but far away from being good or perfect. My VMs
 do not crash anymore but i still have a bunch of slow requests (just around
 10 messages) and still a VERY high I/O load on the disks during recovery.


 In cases where the primary osd is significantly behind, we do make one
 of the other osds primary during recovery in order to expedite
 requests (pgs in this state are shown as remapped).


 oh never seen that but at least in my case even 60s are a very long
 timeframe and the OSD is very stressed during recovery. Is it possible for
 me to set this value?


 Stefan

 -Sam

 On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe s.pri...@profihost.ag
 wrote:

 Am 21.08.2013 17:32, schrieb Samuel Just:

 Have you tried setting osd_recovery_clone_overlap to false?  That
 seemed to help with Stefan's issue.



 This might sound a bug harsh but maybe due to my limited english skills
 ;-)

 I still think that Cephs recovery system is broken by design. If an OSD
 comes back (was offline) all write requests regarding PGs where this one
 is
 primary are targeted immediatly to this OSD. If this one is not up2date
 for
 an PG it tries to recover that one immediatly which costs 4MB / block. If
 you have a lot of small write all over your OSDs and PGs you're sucked as
 your OSD has to recover ALL it's PGs immediatly or at least lots of them
 WHICH can't work. This is totally crazy.

 I think the right way would be:
 1.) if an OSD goes down the replicas got primaries

 or

 2.) an OSD which does not have an up2date PG should redirect to the OSD
 holding the secondary or third replica.

 Both results in being able to have a really smooth and slow recovery
 without
 any stress even under heavy 4K workloads like rbd backed VMs.

 Thanks for reading!

 Greets Stefan



 -Sam

 On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com
 wrote:


 Sam/Josh,

 We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
 morning,
 hoping it would improve this situation, but there was no appreciable
 change.

 One node in our cluster fsck'ed after a reboot and got a bit behind.
 Our
 instances backed by RBD volumes were OK at that point, but once the
 node
 booted fully and the OSDs started, all Windows instances with rbd
 volumes
 experienced very choppy performance and were unable to ingest video
 surveillance traffic and commit it to disk. Once the cluster got back
 to
 HEALTH_OK, they resumed normal operation.

 I tried for a time with conservative recovery settings (osd max
 backfills
 =
 1, osd recovery op priority = 1, and osd recovery max active = 1). No
 improvement for the guests. So I went to more aggressive settings to
 get
 things moving faster. That decreased the duration of the outage.

 During the entire period of recovery/backfill, the network looked
 fine...no
 where close to saturation. iowait on all drives look fine as well.

 Any ideas?

 Thanks,
 Mike Dawson



 On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:



 the same problem still occours. Will need to check when i've time to
 gather logs again.

 Am 14.08.2013 01:11, schrieb Samuel Just:



 I'm not sure, but your logs did show that you had 16 recovery ops in
 flight, so it's worth a try.  If it doesn't help, you should collect
 the same set of logs I'll look again.  Also, there are a few other
 patches between 61.7 and current cuttlefish which may help.
 -Sam

 On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:




 Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j

Re: still recovery issues with cuttlefish

2013-08-21 Thread Stefan Priebe


Am 21.08.2013 17:32, schrieb Samuel Just:

Have you tried setting osd_recovery_clone_overlap to false?  That
seemed to help with Stefan's issue.


This might sound a bug harsh but maybe due to my limited english skills ;-)

I still think that Cephs recovery system is broken by design. If an OSD 
comes back (was offline) all write requests regarding PGs where this one 
is primary are targeted immediatly to this OSD. If this one is not 
up2date for an PG it tries to recover that one immediatly which costs 
4MB / block. If you have a lot of small write all over your OSDs and PGs 
you're sucked as your OSD has to recover ALL it's PGs immediatly or at 
least lots of them WHICH can't work. This is totally crazy.


I think the right way would be:
1.) if an OSD goes down the replicas got primaries

or

2.) an OSD which does not have an up2date PG should redirect to the OSD 
holding the secondary or third replica.


Both results in being able to have a really smooth and slow recovery 
without any stress even under heavy 4K workloads like rbd backed VMs.


Thanks for reading!

Greets Stefan



-Sam

On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com wrote:

Sam/Josh,

We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning,
hoping it would improve this situation, but there was no appreciable change.

One node in our cluster fsck'ed after a reboot and got a bit behind. Our
instances backed by RBD volumes were OK at that point, but once the node
booted fully and the OSDs started, all Windows instances with rbd volumes
experienced very choppy performance and were unable to ingest video
surveillance traffic and commit it to disk. Once the cluster got back to
HEALTH_OK, they resumed normal operation.

I tried for a time with conservative recovery settings (osd max backfills =
1, osd recovery op priority = 1, and osd recovery max active = 1). No
improvement for the guests. So I went to more aggressive settings to get
things moving faster. That decreased the duration of the outage.

During the entire period of recovery/backfill, the network looked fine...no
where close to saturation. iowait on all drives look fine as well.

Any ideas?

Thanks,
Mike Dawson



On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:


the same problem still occours. Will need to check when i've time to
gather logs again.

Am 14.08.2013 01:11, schrieb Samuel Just:


I'm not sure, but your logs did show that you had 16 recovery ops in
flight, so it's worth a try.  If it doesn't help, you should collect
the same set of logs I'll look again.  Also, there are a few other
patches between 61.7 and current cuttlefish which may help.
-Sam

On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:



Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com:


I just backported a couple of patches from next to fix a bug where we
weren't respecting the osd_recovery_max_active config in some cases
(1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
current cuttlefish branch or wait for a 61.8 release.



Thanks! Are you sure that this is the issue? I don't believe that but
i'll give it a try. I already tested a branch from sage where he fixed a
race regarding max active some weeks ago. So active recovering was max 1 but
the issue didn't went away.

Stefan


-Sam

On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com
wrote:


I got swamped today.  I should be able to look tomorrow.  Sorry!
-Sam

On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:


Did you take a look?

Stefan

Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com:


Great!  I'll take a look on Monday.
-Sam

On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
s.pri...@profihost.ag wrote:


Hi Samual,

Am 09.08.2013 23:44, schrieb Samuel Just:


I think Stefan's problem is probably distinct from Mike's.

Stefan: Can you reproduce the problem with

debug osd = 20
debug filestore = 20
debug ms = 1
debug optracker = 20

on a few osds (including the restarted osd), and upload those osd
logs
along with the ceph.log from before killing the osd until after
the
cluster becomes clean again?




done - you'll find the logs at cephdrop folder:
slow_requests_recovering_cuttlefish

osd.52 was the one recovering

Thanks!

Greets,
Stefan


--
To unsubscribe from this list: send the line unsubscribe
ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send

Re: still recovery issues with cuttlefish

2013-08-21 Thread Stefan Priebe


Hi Sam,
Am 21.08.2013 21:13, schrieb Samuel Just:

As long as the request is for an object which is up to date on the
primary, the request will be served without waiting for recovery.


Sure but remember if you have VM random 4K workload a lot of objects go 
out of date pretty soon.


 A request only waits on recovery if the particular object being read or

written must be recovered.


Yes but on 4k load this can be a lot.


Your issue was that recovering the
particular object being requested was unreasonably slow due to
silliness in the recovery code which you disabled by disabling
osd_recover_clone_overlap.


Yes and no. It's better now but far away from being good or perfect. My 
VMs do not crash anymore but i still have a bunch of slow requests (just 
around 10 messages) and still a VERY high I/O load on the disks during 
recovery.



In cases where the primary osd is significantly behind, we do make one
of the other osds primary during recovery in order to expedite
requests (pgs in this state are shown as remapped).


oh never seen that but at least in my case even 60s are a very long 
timeframe and the OSD is very stressed during recovery. Is it possible 
for me to set this value?


Stefan


-Sam

On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe s.pri...@profihost.ag wrote:

Am 21.08.2013 17:32, schrieb Samuel Just:


Have you tried setting osd_recovery_clone_overlap to false?  That
seemed to help with Stefan's issue.



This might sound a bug harsh but maybe due to my limited english skills ;-)

I still think that Cephs recovery system is broken by design. If an OSD
comes back (was offline) all write requests regarding PGs where this one is
primary are targeted immediatly to this OSD. If this one is not up2date for
an PG it tries to recover that one immediatly which costs 4MB / block. If
you have a lot of small write all over your OSDs and PGs you're sucked as
your OSD has to recover ALL it's PGs immediatly or at least lots of them
WHICH can't work. This is totally crazy.

I think the right way would be:
1.) if an OSD goes down the replicas got primaries

or

2.) an OSD which does not have an up2date PG should redirect to the OSD
holding the secondary or third replica.

Both results in being able to have a really smooth and slow recovery without
any stress even under heavy 4K workloads like rbd backed VMs.

Thanks for reading!

Greets Stefan




-Sam

On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com
wrote:


Sam/Josh,

We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
morning,
hoping it would improve this situation, but there was no appreciable
change.

One node in our cluster fsck'ed after a reboot and got a bit behind. Our
instances backed by RBD volumes were OK at that point, but once the node
booted fully and the OSDs started, all Windows instances with rbd volumes
experienced very choppy performance and were unable to ingest video
surveillance traffic and commit it to disk. Once the cluster got back to
HEALTH_OK, they resumed normal operation.

I tried for a time with conservative recovery settings (osd max backfills
=
1, osd recovery op priority = 1, and osd recovery max active = 1). No
improvement for the guests. So I went to more aggressive settings to get
things moving faster. That decreased the duration of the outage.

During the entire period of recovery/backfill, the network looked
fine...no
where close to saturation. iowait on all drives look fine as well.

Any ideas?

Thanks,
Mike Dawson



On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:



the same problem still occours. Will need to check when i've time to
gather logs again.

Am 14.08.2013 01:11, schrieb Samuel Just:



I'm not sure, but your logs did show that you had 16 recovery ops in
flight, so it's worth a try.  If it doesn't help, you should collect
the same set of logs I'll look again.  Also, there are a few other
patches between 61.7 and current cuttlefish which may help.
-Sam

On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:




Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com:


I just backported a couple of patches from next to fix a bug where we
weren't respecting the osd_recovery_max_active config in some cases
(1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
current cuttlefish branch or wait for a 61.8 release.




Thanks! Are you sure that this is the issue? I don't believe that but
i'll give it a try. I already tested a branch from sage where he fixed
a
race regarding max active some weeks ago. So active recovering was max
1 but
the issue didn't went away.

Stefan


-Sam

On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com
wrote:



I got swamped today.  I should be able to look tomorrow.  Sorry!
-Sam

On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:



Did you take a look?

Stefan

Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j

[PATCH] debian/control libgoogle-perftools-dev (= 2.0-2)

2013-08-15 Thread Stefan Priebe

---
 debian/control |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/debian/control b/debian/control
index 5c14ebb..b39579f 100644
--- a/debian/control
+++ b/debian/control
@@ -25,7 +25,7 @@ Build-Depends: autoconf,
libexpat1-dev,
libfcgi-dev,
libfuse-dev,
-   libgoogle-perftools-dev [i386 amd64],
+   libgoogle-perftools-dev (= 2.0-2) [i386 amd64],
libkeyutils-dev,
libleveldb-dev,
libnss3-dev,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] allow also curl openssl binding

2013-08-15 Thread Stefan Priebe

---
 debian/control |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/debian/control b/debian/control
index b39579f..957727d 100644
--- a/debian/control
+++ b/debian/control
@@ -20,7 +20,7 @@ Build-Depends: autoconf,
libboost-program-options-dev (= 1.42),
libboost-thread-dev (= 1.42),
libboost-system-dev (= 1.42),
-   libcurl4-gnutls-dev,
+   libcurl4-gnutls-dev | libcurl4-openssl-dev,
libedit-dev,
libexpat1-dev,
libfcgi-dev,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-14 Thread Stefan Priebe - Profihost AG

the same problem still occours. Will need to check when i've time to
gather logs again.

Am 14.08.2013 01:11, schrieb Samuel Just:
 I'm not sure, but your logs did show that you had 16 recovery ops in
 flight, so it's worth a try.  If it doesn't help, you should collect
 the same set of logs I'll look again.  Also, there are a few other
 patches between 61.7 and current cuttlefish which may help.
 -Sam
 
 On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:

 Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com:

 I just backported a couple of patches from next to fix a bug where we
 weren't respecting the osd_recovery_max_active config in some cases
 (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
 current cuttlefish branch or wait for a 61.8 release.

 Thanks! Are you sure that this is the issue? I don't believe that but i'll 
 give it a try. I already tested a branch from sage where he fixed a race 
 regarding max active some weeks ago. So active recovering was max 1 but the 
 issue didn't went away.

 Stefan

 -Sam

 On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote:
 I got swamped today.  I should be able to look tomorrow.  Sorry!
 -Sam

 On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Did you take a look?

 Stefan

 Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com:

 Great!  I'll take a look on Monday.
 -Sam

 On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag 
 wrote:
 Hi Samual,

 Am 09.08.2013 23:44, schrieb Samuel Just:

 I think Stefan's problem is probably distinct from Mike's.

 Stefan: Can you reproduce the problem with

 debug osd = 20
 debug filestore = 20
 debug ms = 1
 debug optracker = 20

 on a few osds (including the restarted osd), and upload those osd logs
 along with the ceph.log from before killing the osd until after the
 cluster becomes clean again?


 done - you'll find the logs at cephdrop folder:
 slow_requests_recovering_cuttlefish

 osd.52 was the one recovering

 Thanks!

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-13 Thread Stefan Priebe - Profihost AG


Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com:

 I just backported a couple of patches from next to fix a bug where we
 weren't respecting the osd_recovery_max_active config in some cases
 (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
 current cuttlefish branch or wait for a 61.8 release.

Thanks! Are you sure that this is the issue? I don't believe that but i'll give 
it a try. I already tested a branch from sage where he fixed a race regarding 
max active some weeks ago. So active recovering was max 1 but the issue didn't 
went away.

Stefan

 -Sam
 
 On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote:
 I got swamped today.  I should be able to look tomorrow.  Sorry!
 -Sam
 
 On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Did you take a look?
 
 Stefan
 
 Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com:
 
 Great!  I'll take a look on Monday.
 -Sam
 
 On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag 
 wrote:
 Hi Samual,
 
 Am 09.08.2013 23:44, schrieb Samuel Just:
 
 I think Stefan's problem is probably distinct from Mike's.
 
 Stefan: Can you reproduce the problem with
 
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 debug optracker = 20
 
 on a few osds (including the restarted osd), and upload those osd logs
 along with the ceph.log from before killing the osd until after the
 cluster becomes clean again?
 
 
 done - you'll find the logs at cephdrop folder:
 slow_requests_recovering_cuttlefish
 
 osd.52 was the one recovering
 
 Thanks!
 
 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-12 Thread Stefan Priebe - Profihost AG

Did you take a look?

Stefan

Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com:

 Great!  I'll take a look on Monday.
 -Sam
 
 On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Samual,
 
 Am 09.08.2013 23:44, schrieb Samuel Just:
 
 I think Stefan's problem is probably distinct from Mike's.
 
 Stefan: Can you reproduce the problem with
 
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 debug optracker = 20
 
 on a few osds (including the restarted osd), and upload those osd logs
 along with the ceph.log from before killing the osd until after the
 cluster becomes clean again?
 
 
 done - you'll find the logs at cephdrop folder:
 slow_requests_recovering_cuttlefish
 
 osd.52 was the one recovering
 
 Thanks!
 
 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-10 Thread Stefan Priebe


Hi Samual,

Am 09.08.2013 23:44, schrieb Samuel Just:

I think Stefan's problem is probably distinct from Mike's.

Stefan: Can you reproduce the problem with

debug osd = 20
debug filestore = 20
debug ms = 1
debug optracker = 20

on a few osds (including the restarted osd), and upload those osd logs
along with the ceph.log from before killing the osd until after the
cluster becomes clean again?


done - you'll find the logs at cephdrop folder: 
slow_requests_recovering_cuttlefish


osd.52 was the one recovering

Thanks!

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-08 Thread Stefan Priebe


Hi Mike,

Am 08.08.2013 16:05, schrieb Mike Dawson:

Stefan,

I see the same behavior and I theorize it is linked to an issue detailed
in another thread [0]. Do your VM guests ever hang while your cluster is
HEALTH_OK like described in that other thread?

[0] http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2982


mhm no can't see that. All our VMs are working fine even under high load 
while ceph is OK.



A few observations:

- The VMs that hang do lots of writes (video surveillance).
- I use rbd and qemu. The problem exists in both qemu 1.4.x and 1.5.2.
- The problem exists with or without joshd's qemu async flush patch.
- Windows VMs seem to be more vulnerable than Linux VMs.
- If I restart the qemu-system-x86_64 process, the guest will come back
to life.
- A partial workaround seems to be console input (NoVNC or 'virsh
screenshot'), but restarting qemu-system-x86_64 works better.
- The issue of VMs hanging seems worse with RBD writeback cache enabled
- I typically run virtio, but I believe I've seen it with e1000, too.
- VM guests hang at different times, not all at once on a host (or
across all hosts).
- I co-mingle VM guests on servers that host ceph OSDs.



Oliver,

If your cluster has to recover/backfill, do your guest VMs hang with
more frequency than under normal HEALTH_OK conditions, even if you
prioritize client IO as Sam wrote below?


Sam,

Turning down all the settings you mentioned certainly does slow the
recover/backfill process, but it doesn't prevent the VM guests backed by
RBD volumes from hanging. In fact, I often try to prioritize
recovery/backfill because my guests tend to hang until I get back to
HEALTH_OK. Given this apparent bug, completing recovery/backfill quicker
leads to less total outage, it seems.


Josh,

How can I help you investigate if RBD is the common source of both of
these issues?


Thanks,
Mike Dawson


On 8/2/2013 2:46 PM, Stefan Priebe wrote:

Hi,

 osd recovery max active = 1
 osd max backfills = 1
 osd recovery op priority = 5

still no difference...

Stefan

Am 02.08.2013 20:21, schrieb Samuel Just:

Also, you have osd_recovery_op_priority at 50.  That is close to the
priority of client IO.  You want it below 10 (defaults to 10), perhaps
at 1.  You can also adjust down osd_recovery_max_active.
-Sam

On Fri, Aug 2, 2013 at 11:16 AM, Stefan Priebe s.pri...@profihost.ag
wrote:

I already tried both values this makes no difference. The drives are
not the
bottleneck.

Am 02.08.2013 19:35, schrieb Samuel Just:


You might try turning osd_max_backfills to 2 or 1.
-Sam

On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag
wrote:


Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd
settings?


sudo ceph --admin-daemon ceph-osd.osdid.asok config show



Sure.



{ name: osd.0,
cluster: ceph,
none: 0\/5,
lockdep: 0\/0,
context: 0\/0,
crush: 0\/0,
mds: 0\/0,
mds_balancer: 0\/0,
mds_locker: 0\/0,
mds_log: 0\/0,
mds_log_expire: 0\/0,
mds_migrator: 0\/0,
buffer: 0\/0,
timer: 0\/0,
filer: 0\/0,
striper: 0\/1,
objecter: 0\/0,
rados: 0\/0,
rbd: 0\/0,
journaler: 0\/0,
objectcacher: 0\/0,
client: 0\/0,
osd: 0\/0,
optracker: 0\/0,
objclass: 0\/0,
filestore: 0\/0,
journal: 0\/0,
ms: 0\/0,
mon: 0\/0,
monc: 0\/0,
paxos: 0\/0,
tp: 0\/0,
auth: 0\/0,
crypto: 1\/5,
finisher: 0\/0,
heartbeatmap: 0\/0,
perfcounter: 0\/0,
rgw: 0\/0,
hadoop: 0\/0,
javaclient: 1\/5,
asok: 0\/0,
throttle: 0\/0,
host: cloud1-1268,
fsid: ----,
public_addr: 10.255.0.90:0\/0,
cluster_addr: 10.255.0.90:0\/0,
public_network: 10.255.0.1\/24,
cluster_network: 10.255.0.1\/24,
num_client: 1,
monmap: ,
mon_host: ,
lockdep: false,
run_dir: \/var\/run\/ceph,
admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok,
daemonize: true,
pid_file: \/var\/run\/ceph\/osd.0.pid,
chdir: \/,
max_open_files: 0,
fatal_signal_handlers: true,
log_file: \/var\/log\/ceph\/ceph-osd.0.log,
log_max_new: 1000,
log_max_recent: 1,
log_to_stderr: false,
err_to_stderr: true,
log_to_syslog: false,
err_to_syslog: false,
log_flush_on_exit: true,
log_stop_at_utilization: 0.97,
clog_to_monitors: true,
clog_to_syslog: false,
clog_to_syslog_level: info,
clog_to_syslog_facility: daemon,
mon_cluster_log_to_syslog: false,
mon_cluster_log_to_syslog_level: info,
mon_cluster_log_to_syslog_facility: daemon,
mon_cluster_log_file: \/var\/log\/ceph\/ceph.log,
key: ,
keyfile: ,
keyring: \/etc\/ceph\/osd.0.keyring,
heartbeat_interval: 5,
heartbeat_file: ,
heartbeat_inject_failure: 0,
perf: true,
ms_tcp_nodelay: true,
ms_tcp_rcvbuf: 0,
ms_initial_backoff: 0.2,
ms_max_backoff: 15,
ms_nocrc: false,
ms_die_on_bad_msg: false

Re: still recovery issues with cuttlefish

2013-08-02 Thread Stefan Priebe

,
  rgw_socket_path: ,
  rgw_host: ,
  rgw_port: ,
  rgw_dns_name: ,
  rgw_script_uri: ,
  rgw_request_uri: ,
  rgw_swift_url: ,
  rgw_swift_url_prefix: swift,
  rgw_swift_auth_url: ,
  rgw_swift_auth_entry: auth,
  rgw_keystone_url: ,
  rgw_keystone_admin_token: ,
  rgw_keystone_accepted_roles: Member, admin,
  rgw_keystone_token_cache_size: 1,
  rgw_keystone_revocation_interval: 900,
  rgw_admin_entry: admin,
  rgw_enforce_swift_acls: true,
  rgw_swift_token_expiration: 86400,
  rgw_print_continue: true,
  rgw_remote_addr_param: REMOTE_ADDR,
  rgw_op_thread_timeout: 600,
  rgw_op_thread_suicide_timeout: 0,
  rgw_thread_pool_size: 100,
  rgw_num_control_oids: 8,
  rgw_zone_root_pool: .rgw.root,
  rgw_log_nonexistent_bucket: false,
  rgw_log_object_name: %Y-%m-%d-%H-%i-%n,
  rgw_log_object_name_utc: false,
  rgw_usage_max_shards: 32,
  rgw_usage_max_user_shards: 1,
  rgw_enable_ops_log: false,
  rgw_enable_usage_log: false,
  rgw_ops_log_rados: true,
  rgw_ops_log_socket_path: ,
  rgw_ops_log_data_backlog: 5242880,
  rgw_usage_log_flush_threshold: 1024,
  rgw_usage_log_tick_interval: 30,
  rgw_intent_log_object_name: %Y-%m-%d-%i-%n,
  rgw_intent_log_object_name_utc: false,
  rgw_init_timeout: 300,
  rgw_mime_types_file: \/etc\/mime.types,
  rgw_gc_max_objs: 32,
  rgw_gc_obj_min_wait: 7200,
  rgw_gc_processor_max_time: 3600,
  rgw_gc_processor_period: 3600,
  rgw_s3_success_create_obj_status: 0,
  rgw_resolve_cname: false,
  rgw_obj_stripe_size: 4194304,
  rgw_extended_http_attrs: ,
  rgw_exit_timeout_secs: 120,
  rgw_get_obj_window_size: 16777216,
  rgw_get_obj_max_req_size: 4194304,
  rgw_relaxed_s3_bucket_names: false,
  rgw_list_buckets_max_chunk: 1000,
  mutex_perf_counter: false,
  internal_safe_to_start_threads: true}



Stefan


-Sam

On Thu, Aug 1, 2013 at 12:07 PM, Stefan Priebe s.pri...@profihost.ag wrote:

Mike we already have the async patch running. Yes it helps but only helps it
does not solve. It just hides the issue ...
Am 01.08.2013 20:54, schrieb Mike Dawson:


I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
  - VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
  - VMs mounted from RBD volumes become unresponsive.

- Recovery completes
  - VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else
going on?

Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:


Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:


Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and re backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-02 Thread Stefan Priebe

I already tried both values this makes no difference. The drives are not 
the bottleneck.


Am 02.08.2013 19:35, schrieb Samuel Just:

You might try turning osd_max_backfills to 2 or 1.
-Sam

On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag wrote:

Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd settings?


sudo ceph --admin-daemon ceph-osd.osdid.asok config show


Sure.



{ name: osd.0,
   cluster: ceph,
   none: 0\/5,
   lockdep: 0\/0,
   context: 0\/0,
   crush: 0\/0,
   mds: 0\/0,
   mds_balancer: 0\/0,
   mds_locker: 0\/0,
   mds_log: 0\/0,
   mds_log_expire: 0\/0,
   mds_migrator: 0\/0,
   buffer: 0\/0,
   timer: 0\/0,
   filer: 0\/0,
   striper: 0\/1,
   objecter: 0\/0,
   rados: 0\/0,
   rbd: 0\/0,
   journaler: 0\/0,
   objectcacher: 0\/0,
   client: 0\/0,
   osd: 0\/0,
   optracker: 0\/0,
   objclass: 0\/0,
   filestore: 0\/0,
   journal: 0\/0,
   ms: 0\/0,
   mon: 0\/0,
   monc: 0\/0,
   paxos: 0\/0,
   tp: 0\/0,
   auth: 0\/0,
   crypto: 1\/5,
   finisher: 0\/0,
   heartbeatmap: 0\/0,
   perfcounter: 0\/0,
   rgw: 0\/0,
   hadoop: 0\/0,
   javaclient: 1\/5,
   asok: 0\/0,
   throttle: 0\/0,
   host: cloud1-1268,
   fsid: ----,
   public_addr: 10.255.0.90:0\/0,
   cluster_addr: 10.255.0.90:0\/0,
   public_network: 10.255.0.1\/24,
   cluster_network: 10.255.0.1\/24,
   num_client: 1,
   monmap: ,
   mon_host: ,
   lockdep: false,
   run_dir: \/var\/run\/ceph,
   admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok,
   daemonize: true,
   pid_file: \/var\/run\/ceph\/osd.0.pid,
   chdir: \/,
   max_open_files: 0,
   fatal_signal_handlers: true,
   log_file: \/var\/log\/ceph\/ceph-osd.0.log,
   log_max_new: 1000,
   log_max_recent: 1,
   log_to_stderr: false,
   err_to_stderr: true,
   log_to_syslog: false,
   err_to_syslog: false,
   log_flush_on_exit: true,
   log_stop_at_utilization: 0.97,
   clog_to_monitors: true,
   clog_to_syslog: false,
   clog_to_syslog_level: info,
   clog_to_syslog_facility: daemon,
   mon_cluster_log_to_syslog: false,
   mon_cluster_log_to_syslog_level: info,
   mon_cluster_log_to_syslog_facility: daemon,
   mon_cluster_log_file: \/var\/log\/ceph\/ceph.log,
   key: ,
   keyfile: ,
   keyring: \/etc\/ceph\/osd.0.keyring,
   heartbeat_interval: 5,
   heartbeat_file: ,
   heartbeat_inject_failure: 0,
   perf: true,
   ms_tcp_nodelay: true,
   ms_tcp_rcvbuf: 0,
   ms_initial_backoff: 0.2,
   ms_max_backoff: 15,
   ms_nocrc: false,
   ms_die_on_bad_msg: false,
   ms_die_on_unhandled_msg: false,
   ms_dispatch_throttle_bytes: 104857600,
   ms_bind_ipv6: false,
   ms_bind_port_min: 6800,
   ms_bind_port_max: 7100,
   ms_rwthread_stack_bytes: 1048576,
   ms_tcp_read_timeout: 900,
   ms_pq_max_tokens_per_priority: 4194304,
   ms_pq_min_cost: 65536,
   ms_inject_socket_failures: 0,
   ms_inject_delay_type: ,
   ms_inject_delay_max: 1,
   ms_inject_delay_probability: 0,
   ms_inject_internal_delays: 0,
   mon_data: \/var\/lib\/ceph\/mon\/ceph-0,
   mon_initial_members: ,
   mon_sync_fs_threshold: 5,
   mon_compact_on_start: false,
   mon_compact_on_bootstrap: false,
   mon_compact_on_trim: true,
   mon_tick_interval: 5,
   mon_subscribe_interval: 300,
   mon_osd_laggy_halflife: 3600,
   mon_osd_laggy_weight: 0.3,
   mon_osd_adjust_heartbeat_grace: true,
   mon_osd_adjust_down_out_interval: true,
   mon_osd_auto_mark_in: false,
   mon_osd_auto_mark_auto_out_in: true,
   mon_osd_auto_mark_new_in: true,
   mon_osd_down_out_interval: 300,
   mon_osd_down_out_subtree_limit: rack,
   mon_osd_min_up_ratio: 0.3,
   mon_osd_min_in_ratio: 0.3,
   mon_stat_smooth_intervals: 2,
   mon_lease: 5,
   mon_lease_renew_interval: 3,
   mon_lease_ack_timeout: 10,
   mon_clock_drift_allowed: 0.05,
   mon_clock_drift_warn_backoff: 5,
   mon_timecheck_interval: 300,
   mon_accept_timeout: 10,
   mon_pg_create_interval: 30,
   mon_pg_stuck_threshold: 300,
   mon_osd_full_ratio: 0.95,
   mon_osd_nearfull_ratio: 0.85,
   mon_globalid_prealloc: 100,
   mon_osd_report_timeout: 900,
   mon_force_standby_active: true,
   mon_min_osdmap_epochs: 500,
   mon_max_pgmap_epochs: 500,
   mon_max_log_epochs: 500,
   mon_max_osd: 1,
   mon_probe_timeout: 2,
   mon_slurp_timeout: 10,
   mon_slurp_bytes: 262144,
   mon_client_bytes: 104857600,
   mon_daemon_bytes: 419430400,
   mon_max_log_entries_per_event: 4096,
   mon_health_data_update_interval: 60,
   mon_data_avail_crit: 5,
   mon_data_avail_warn: 30,
   mon_config_key_max_entry_size: 4096,
   mon_sync_trim_timeout: 30,
   mon_sync_heartbeat_timeout: 30,
   mon_sync_heartbeat_interval: 5,
   mon_sync_backoff_timeout: 30,
   mon_sync_timeout: 30,
   mon_sync_max_retries: 5,
   mon_sync_max_payload_size: 1048576,
   mon_sync_debug: false,
   mon_sync_debug_leader: -1,
   mon_sync_debug_provider: -1,
   mon_sync_debug_provider_fallback: -1,
   mon_debug_dump_transactions: false,
   mon_debug_dump_location: \/var\/log\/ceph\/ceph-osd.0.tdump,
   mon_sync_leader_kill_at: 0

Re: still recovery issues with cuttlefish

2013-08-02 Thread Stefan Priebe


Hi,

osd recovery max active = 1
osd max backfills = 1
osd recovery op priority = 5

still no difference...

Stefan

Am 02.08.2013 20:21, schrieb Samuel Just:

Also, you have osd_recovery_op_priority at 50.  That is close to the
priority of client IO.  You want it below 10 (defaults to 10), perhaps
at 1.  You can also adjust down osd_recovery_max_active.
-Sam

On Fri, Aug 2, 2013 at 11:16 AM, Stefan Priebe s.pri...@profihost.ag wrote:

I already tried both values this makes no difference. The drives are not the
bottleneck.

Am 02.08.2013 19:35, schrieb Samuel Just:


You might try turning osd_max_backfills to 2 or 1.
-Sam

On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag
wrote:


Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd
settings?


sudo ceph --admin-daemon ceph-osd.osdid.asok config show



Sure.



{ name: osd.0,
cluster: ceph,
none: 0\/5,
lockdep: 0\/0,
context: 0\/0,
crush: 0\/0,
mds: 0\/0,
mds_balancer: 0\/0,
mds_locker: 0\/0,
mds_log: 0\/0,
mds_log_expire: 0\/0,
mds_migrator: 0\/0,
buffer: 0\/0,
timer: 0\/0,
filer: 0\/0,
striper: 0\/1,
objecter: 0\/0,
rados: 0\/0,
rbd: 0\/0,
journaler: 0\/0,
objectcacher: 0\/0,
client: 0\/0,
osd: 0\/0,
optracker: 0\/0,
objclass: 0\/0,
filestore: 0\/0,
journal: 0\/0,
ms: 0\/0,
mon: 0\/0,
monc: 0\/0,
paxos: 0\/0,
tp: 0\/0,
auth: 0\/0,
crypto: 1\/5,
finisher: 0\/0,
heartbeatmap: 0\/0,
perfcounter: 0\/0,
rgw: 0\/0,
hadoop: 0\/0,
javaclient: 1\/5,
asok: 0\/0,
throttle: 0\/0,
host: cloud1-1268,
fsid: ----,
public_addr: 10.255.0.90:0\/0,
cluster_addr: 10.255.0.90:0\/0,
public_network: 10.255.0.1\/24,
cluster_network: 10.255.0.1\/24,
num_client: 1,
monmap: ,
mon_host: ,
lockdep: false,
run_dir: \/var\/run\/ceph,
admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok,
daemonize: true,
pid_file: \/var\/run\/ceph\/osd.0.pid,
chdir: \/,
max_open_files: 0,
fatal_signal_handlers: true,
log_file: \/var\/log\/ceph\/ceph-osd.0.log,
log_max_new: 1000,
log_max_recent: 1,
log_to_stderr: false,
err_to_stderr: true,
log_to_syslog: false,
err_to_syslog: false,
log_flush_on_exit: true,
log_stop_at_utilization: 0.97,
clog_to_monitors: true,
clog_to_syslog: false,
clog_to_syslog_level: info,
clog_to_syslog_facility: daemon,
mon_cluster_log_to_syslog: false,
mon_cluster_log_to_syslog_level: info,
mon_cluster_log_to_syslog_facility: daemon,
mon_cluster_log_file: \/var\/log\/ceph\/ceph.log,
key: ,
keyfile: ,
keyring: \/etc\/ceph\/osd.0.keyring,
heartbeat_interval: 5,
heartbeat_file: ,
heartbeat_inject_failure: 0,
perf: true,
ms_tcp_nodelay: true,
ms_tcp_rcvbuf: 0,
ms_initial_backoff: 0.2,
ms_max_backoff: 15,
ms_nocrc: false,
ms_die_on_bad_msg: false,
ms_die_on_unhandled_msg: false,
ms_dispatch_throttle_bytes: 104857600,
ms_bind_ipv6: false,
ms_bind_port_min: 6800,
ms_bind_port_max: 7100,
ms_rwthread_stack_bytes: 1048576,
ms_tcp_read_timeout: 900,
ms_pq_max_tokens_per_priority: 4194304,
ms_pq_min_cost: 65536,
ms_inject_socket_failures: 0,
ms_inject_delay_type: ,
ms_inject_delay_max: 1,
ms_inject_delay_probability: 0,
ms_inject_internal_delays: 0,
mon_data: \/var\/lib\/ceph\/mon\/ceph-0,
mon_initial_members: ,
mon_sync_fs_threshold: 5,
mon_compact_on_start: false,
mon_compact_on_bootstrap: false,
mon_compact_on_trim: true,
mon_tick_interval: 5,
mon_subscribe_interval: 300,
mon_osd_laggy_halflife: 3600,
mon_osd_laggy_weight: 0.3,
mon_osd_adjust_heartbeat_grace: true,
mon_osd_adjust_down_out_interval: true,
mon_osd_auto_mark_in: false,
mon_osd_auto_mark_auto_out_in: true,
mon_osd_auto_mark_new_in: true,
mon_osd_down_out_interval: 300,
mon_osd_down_out_subtree_limit: rack,
mon_osd_min_up_ratio: 0.3,
mon_osd_min_in_ratio: 0.3,
mon_stat_smooth_intervals: 2,
mon_lease: 5,
mon_lease_renew_interval: 3,
mon_lease_ack_timeout: 10,
mon_clock_drift_allowed: 0.05,
mon_clock_drift_warn_backoff: 5,
mon_timecheck_interval: 300,
mon_accept_timeout: 10,
mon_pg_create_interval: 30,
mon_pg_stuck_threshold: 300,
mon_osd_full_ratio: 0.95,
mon_osd_nearfull_ratio: 0.85,
mon_globalid_prealloc: 100,
mon_osd_report_timeout: 900,
mon_force_standby_active: true,
mon_min_osdmap_epochs: 500,
mon_max_pgmap_epochs: 500,
mon_max_log_epochs: 500,
mon_max_osd: 1,
mon_probe_timeout: 2,
mon_slurp_timeout: 10,
mon_slurp_bytes: 262144,
mon_client_bytes: 104857600,
mon_daemon_bytes: 419430400,
mon_max_log_entries_per_event: 4096,
mon_health_data_update_interval

still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe - Profihost AG

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and re backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe


m 01.08.2013 20:34, schrieb Samuel Just:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam


Sure which log levels?


On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and re backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe

Mike we already have the async patch running. Yes it helps but only 
helps it does not solve. It just hides the issue ...

Am 01.08.2013 20:54, schrieb Mike Dawson:

I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
 - VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
 - VMs mounted from RBD volumes become unresponsive.

- Recovery completes
 - VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else
going on?

Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and re backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Upgrading from 0.61.5 to 0.61.6 ended in disaster

2013-07-24 Thread Stefan Priebe - Profihost AG

Hi,

today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.

But this ended in a complete desaster.

What i've done:
1.) recompiled ceph tagged with 0.61.6
2.) installed new ceph version on all machines
3.) JUST tried to restart ONE mon

this failed with:
[1774]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
/var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '

2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
Signal Terminated ***
2013-07-24 08:41:43.088090 7f53c185d700  0 quorum service shutdown
2013-07-24 08:41:43.088094 7f53c185d700  0 mon.a@0(???).health(3840)
HealthMonitor::service_shutdown 1 services
2013-07-24 08:41:43.088097 7f53c185d700  0 quorum service shutdown
2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
ceph-mon, pid 29871
2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fae6384a780 time 2013-07-24 08:41:56.096683
mon/OSDMonitor.cc: 156: FAILED assert(latest_full  0)

 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
   -13 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
register_command perfcounters_dump hook 0x2682010
   -12 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
register_command 1 hook 0x2682010
   -11 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
register_command perf dump hook 0x2682010
   -10 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
register_command perfcounters_schema hook 0x2682010
-9 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
register_command 2 hook 0x2682010
-8 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
register_command perf schema hook 0x2682010
-7 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
register_command config show hook 0x2682010
-6 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
register_command config set hook 0x2682010
-5 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
register_command log flush hook 0x2682010
-4 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
register_command log dump hook 0x2682010
-3 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
register_command log reopen hook 0x2682010
-2 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
ceph-mon, pid 29871
-1 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
global_init_daemonize
 0 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fae6384a780 time 2013-07-24 08:41:56.096683
mon/OSDMonitor.cc: 156: FAILED assert(latest_full  0)

 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
were still trying to reach mon.a

2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 
10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 
10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:11.793525 7f0109d7d700  0 -- 10.255.0.82:6802/29397 
10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:23.794315 7f0109e7e700  0 -- 10.255.0.82:6802/29397 
10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:27.621336 7f0122d2e700  0 log [WRN] : 5 slow requests,
5 included below; oldest blocked for  30.378391 secs
2013-07-24 08:42:27.621344 7f0122d2e700  0 log [WRN] : slow request
30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
osd_op(client.14727601.0:3839848

Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster

2013-07-24 Thread Stefan Priebe - Profihost AG

Hi,

i uploaded my ceph mon store to cephdrop
/home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.

So hopefully someone can find the culprit soon.

It fails in OSDMonitor.cc here:

   // if we trigger this, then there's something else going with the store
// state, and we shouldn't want to work around it without knowing what
// exactly happened.
assert(latest_full  0);

Stefan

Am 24.07.2013 09:05, schrieb Stefan Priebe - Profihost AG:
 Hi,
 
 today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.
 
 But this ended in a complete desaster.
 
 What i've done:
 1.) recompiled ceph tagged with 0.61.6
 2.) installed new ceph version on all machines
 3.) JUST tried to restart ONE mon
 
 this failed with:
 [1774]: (33) Numerical argument out of domain
 failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
 /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
 
 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
 Signal Terminated ***
 2013-07-24 08:41:43.088090 7f53c185d700  0 quorum service shutdown
 2013-07-24 08:41:43.088094 7f53c185d700  0 mon.a@0(???).health(3840)
 HealthMonitor::service_shutdown 1 services
 2013-07-24 08:41:43.088097 7f53c185d700  0 quorum service shutdown
 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
 ceph-mon, pid 29871
 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7fae6384a780 time 2013-07-24 08:41:56.096683
 mon/OSDMonitor.cc: 156: FAILED assert(latest_full  0)
 
  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
  4: (Monitor::init_paxos()+0xe5) [0x48f955]
  5: (Monitor::preinit()+0x679) [0x4bba79]
  6: (main()+0x36b0) [0x484bb0]
  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
  8: /usr/bin/ceph-mon() [0x4801e9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 --- begin dump of recent events ---
-13 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
 register_command perfcounters_dump hook 0x2682010
-12 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
 register_command 1 hook 0x2682010
-11 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
 register_command perf dump hook 0x2682010
-10 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
 register_command perfcounters_schema hook 0x2682010
 -9 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
 register_command 2 hook 0x2682010
 -8 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
 register_command perf schema hook 0x2682010
 -7 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
 register_command config show hook 0x2682010
 -6 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
 register_command config set hook 0x2682010
 -5 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
 register_command log flush hook 0x2682010
 -4 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
 register_command log dump hook 0x2682010
 -3 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
 register_command log reopen hook 0x2682010
 -2 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
 ceph-mon, pid 29871
 -1 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
 global_init_daemonize
  0 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7fae6384a780 time 2013-07-24 08:41:56.096683
 mon/OSDMonitor.cc: 156: FAILED assert(latest_full  0)
 
  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
  4: (Monitor::init_paxos()+0xe5) [0x48f955]
  5: (Monitor::preinit()+0x679) [0x4bba79]
  6: (main()+0x36b0) [0x484bb0]
  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
  8: /usr/bin/ceph-mon() [0x4801e9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
 were still trying to reach mon.a
 
 2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
 2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 
 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
 2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 
 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
 2013-07-24 08:42:11.793525 7f0109d7d700  0

Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster

2013-07-24 Thread Stefan Priebe - Profihost AG

Am 24.07.2013 13:11, schrieb Joao Eduardo Luis:
 On 07/24/2013 08:37 AM, Stefan Priebe - Profihost AG wrote:
 Hi,

 i uploaded my ceph mon store to cephdrop
 /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.

 So hopefully someone can find the culprit soon.

 It fails in OSDMonitor.cc here:

 // if we trigger this, then there's something else going with the
 store
  // state, and we shouldn't want to work around it without knowing
 what
  // exactly happened.
  assert(latest_full  0);

 
 Wrong variable being used in a loop as part of a workaround for 5704.
 
 Opened a bug for this on http://tracker.ceph.com/issues/5737
 
 A fix is available on wip-5737 (next) and wip-5737-cuttlefish.
 
 Tested the mon against your store and it worked flawlessly.  Also tested
 it against the same stores used during the original fix and also they
 worked just fine.
 
 My question now is how the hell those stores worked fine although the
 original fix was grabbing what should have been a non-existent version,
 or how did they not trigger that assert.  Which is what I'm going to
 investigate next.

What i don't understand is WHY the hell the OSDs haven't used the 2nd or
3rd monitor which wasn't restarted?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] v0.61.5 Cuttlefish update released

2013-07-19 Thread Stefan Priebe - Profihost AG

All mons do not work anymore:

=== mon.a ===
Starting Ceph mon.a on ccad...
[21207]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
/var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '

Stefan

Am 19.07.2013 07:59, schrieb Sage Weil:
 A note on upgrading:
 
 One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature 
 bits.  We did not realize it before, but the fix will prevent 0.61.4 (or 
 earlier) from forming a quorum with 0.61.5. This is similar to the upgrade 
 from bobtail (and the future upgrade to dumpling). As such, we recommend 
 you upgrade all monitors at once to avoid the potential for discruption in 
 service.
 
 I'm adding a note to the release notes.
 
 Thanks!
 sage
 
 
 On Thu, 18 Jul 2013, Sage Weil wrote:
 
 We've prepared another update for the Cuttlefish v0.61.x series. This 
 release primarily contains monitor stability improvements, although there 
 are also some important fixes for ceph-osd for large clusters and a few 
 important CephFS fixes. We recommend that all v0.61.x users upgrade.

  * mon: misc sync improvements (faster, more reliable, better tuning)
  * mon: enable leveldb cache by default (big performance improvement)
  * mon: new scrub feature (primarily for diagnostic, testing purposes)
  * mon: fix occasional leveldb assertion on startup
  * mon: prevent reads until initial state is committed
  * mon: improved logic for trimming old osdmaps
  * mon: fix pick_addresses bug when expanding mon cluster
  * mon: several small paxos fixes, improvements
  * mon: fix bug osdmap trim behavior
  * osd: fix several bugs with PG stat reporting
  * osd: limit number of maps shared with peers (which could cause domino 
 failures)
  * rgw: fix radosgw-admin buckets list (for all buckets)
  * mds: fix occasional client failure to reconnect
  * mds: fix bad list traversal after unlink
  * mds: fix underwater dentry cleanup (occasional crash after mds restart)
  * libcephfs, ceph-fuse: fix occasional hangs on umount
  * libcephfs, ceph-fuse: fix old bug with O_LAZY vs O_NOATIME confusion
  * ceph-disk: more robust journal device detection on RHEL/CentOS
  * ceph-disk: better, simpler locking
  * ceph-disk: do not inadvertantely mount over existing osd mounts
  * ceph-disk: better handling for unusual device names
  * sysvinit, upstart: handle symlinks in /var/lib/ceph/*

 Please also refer to the complete release notes:

http://ceph.com/docs/master/release-notes/#v0-61-5-cuttlefish

 You can get v0.61.5 from the usual locations:

  * Git at git://github.com/ceph/ceph.git
  * Tarball at http://ceph.com/download/ceph-0.61.5.tar.gz
  * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
  * For RPMs, see http://ceph.com/docs/master/install/rpm

 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] v0.61.5 Cuttlefish update released

2013-07-19 Thread Stefan Priebe - Profihost AG

crash is this one:

2013-07-19 08:59:32.137646 7f484a872780  0 ceph version
0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
ceph-mon, pid 22172
2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7f484a872780 time 2013-07-19 08:59:32.173506
mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
   -13 2013-07-19 08:59:32.136172 7f484a872780  5 asok(0x131a000)
register_command perfcounters_dump hook 0x1304010
   -12 2013-07-19 08:59:32.136191 7f484a872780  5 asok(0x131a000)
register_command 1 hook 0x1304010
   -11 2013-07-19 08:59:32.136194 7f484a872780  5 asok(0x131a000)
register_command perf dump hook 0x1304010
   -10 2013-07-19 08:59:32.136200 7f484a872780  5 asok(0x131a000)
register_command perfcounters_schema hook 0x1304010
-9 2013-07-19 08:59:32.136204 7f484a872780  5 asok(0x131a000)
register_command 2 hook 0x1304010
-8 2013-07-19 08:59:32.136206 7f484a872780  5 asok(0x131a000)
register_command perf schema hook 0x1304010
-7 2013-07-19 08:59:32.136208 7f484a872780  5 asok(0x131a000)
register_command config show hook 0x1304010
-6 2013-07-19 08:59:32.136211 7f484a872780  5 asok(0x131a000)
register_command config set hook 0x1304010
-5 2013-07-19 08:59:32.136214 7f484a872780  5 asok(0x131a000)
register_command log flush hook 0x1304010
-4 2013-07-19 08:59:32.136216 7f484a872780  5 asok(0x131a000)
register_command log dump hook 0x1304010
-3 2013-07-19 08:59:32.136219 7f484a872780  5 asok(0x131a000)
register_command log reopen hook 0x1304010
-2 2013-07-19 08:59:32.137646 7f484a872780  0 ceph version
0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
ceph-mon, pid 22172
-1 2013-07-19 08:59:32.137967 7f484a872780  1 finished
global_init_daemonize
 0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7f484a872780 time 2013-07-19 08:59:32.173506
mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Mi

Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG:
 All mons do not work anymore:
 
 === mon.a ===
 Starting Ceph mon.a on ccad...
 [21207]: (33) Numerical argument out of domain
 failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
 /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
 
 Stefan
 
 Am 19.07.2013 07:59, schrieb Sage Weil:
 A note on upgrading:

 One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature 
 bits.  We did not realize it before, but the fix will prevent 0.61.4 (or 
 earlier) from forming a quorum with 0.61.5. This is similar to the upgrade 
 from bobtail (and the future upgrade to dumpling). As such, we recommend 
 you upgrade all monitors at once to avoid the potential for discruption in 
 service.

 I'm adding a note to the release notes.

 Thanks!
 sage


 On Thu, 18 Jul 2013, Sage Weil wrote:

 We've prepared another update for the Cuttlefish v0.61.x series. This 
 release primarily contains monitor stability improvements, although there 
 are also some important fixes for ceph-osd for large clusters and a few 
 important CephFS fixes. We recommend that all v0.61.x users upgrade.

  * mon: misc sync improvements (faster, more reliable, better tuning)
  * mon: enable leveldb cache by default (big performance improvement)
  * mon: new scrub feature (primarily for diagnostic, testing purposes)
  * mon: fix occasional leveldb assertion on startup
  * mon: prevent reads until initial state is committed
  * mon: improved logic for trimming old osdmaps
  * mon: fix pick_addresses bug when expanding mon cluster
  * mon: several small paxos fixes, improvements
  * mon: fix bug osdmap trim behavior
  * osd: fix several bugs with PG stat reporting
  * osd: limit number of maps shared with peers (which could cause domino

Re: [ceph-users] v0.61.5 Cuttlefish update released

2013-07-19 Thread Stefan Priebe - Profihost AG

Complete Output / log with debug mon 20 here:

http://pastebin.com/raw.php?i=HzegqkFz

Stefan

Am 19.07.2013 09:00, schrieb Stefan Priebe - Profihost AG:
 crash is this one:
 
 2013-07-19 08:59:32.137646 7f484a872780  0 ceph version
 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
 ceph-mon, pid 22172
 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7f484a872780 time 2013-07-19 08:59:32.173506
 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)
 
  ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
  1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
  4: (Monitor::init_paxos()+0xe5) [0x48f955]
  5: (Monitor::preinit()+0x679) [0x4bba79]
  6: (main()+0x36b0) [0x484bb0]
  7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
  8: /usr/bin/ceph-mon() [0x4801e9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 --- begin dump of recent events ---
-13 2013-07-19 08:59:32.136172 7f484a872780  5 asok(0x131a000)
 register_command perfcounters_dump hook 0x1304010
-12 2013-07-19 08:59:32.136191 7f484a872780  5 asok(0x131a000)
 register_command 1 hook 0x1304010
-11 2013-07-19 08:59:32.136194 7f484a872780  5 asok(0x131a000)
 register_command perf dump hook 0x1304010
-10 2013-07-19 08:59:32.136200 7f484a872780  5 asok(0x131a000)
 register_command perfcounters_schema hook 0x1304010
 -9 2013-07-19 08:59:32.136204 7f484a872780  5 asok(0x131a000)
 register_command 2 hook 0x1304010
 -8 2013-07-19 08:59:32.136206 7f484a872780  5 asok(0x131a000)
 register_command perf schema hook 0x1304010
 -7 2013-07-19 08:59:32.136208 7f484a872780  5 asok(0x131a000)
 register_command config show hook 0x1304010
 -6 2013-07-19 08:59:32.136211 7f484a872780  5 asok(0x131a000)
 register_command config set hook 0x1304010
 -5 2013-07-19 08:59:32.136214 7f484a872780  5 asok(0x131a000)
 register_command log flush hook 0x1304010
 -4 2013-07-19 08:59:32.136216 7f484a872780  5 asok(0x131a000)
 register_command log dump hook 0x1304010
 -3 2013-07-19 08:59:32.136219 7f484a872780  5 asok(0x131a000)
 register_command log reopen hook 0x1304010
 -2 2013-07-19 08:59:32.137646 7f484a872780  0 ceph version
 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
 ceph-mon, pid 22172
 -1 2013-07-19 08:59:32.137967 7f484a872780  1 finished
 global_init_daemonize
  0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7f484a872780 time 2013-07-19 08:59:32.173506
 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)
 
  ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
  1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
  4: (Monitor::init_paxos()+0xe5) [0x48f955]
  5: (Monitor::preinit()+0x679) [0x4bba79]
  6: (main()+0x36b0) [0x484bb0]
  7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
  8: /usr/bin/ceph-mon() [0x4801e9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 Mi
 
 Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG:
 All mons do not work anymore:

 === mon.a ===
 Starting Ceph mon.a on ccad...
 [21207]: (33) Numerical argument out of domain
 failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
 /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '

 Stefan

 Am 19.07.2013 07:59, schrieb Sage Weil:
 A note on upgrading:

 One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature 
 bits.  We did not realize it before, but the fix will prevent 0.61.4 (or 
 earlier) from forming a quorum with 0.61.5. This is similar to the upgrade 
 from bobtail (and the future upgrade to dumpling). As such, we recommend 
 you upgrade all monitors at once to avoid the potential for discruption in 
 service.

 I'm adding a note to the release notes.

 Thanks!
 sage


 On Thu, 18 Jul 2013, Sage Weil wrote:

 We've prepared another update for the Cuttlefish v0.61.x series. This 
 release primarily contains monitor stability improvements, although there 
 are also some important fixes for ceph-osd for large clusters and a few 
 important CephFS fixes. We recommend that all v0.61.x users upgrade.

  * mon: misc sync improvements (faster, more reliable, better tuning)
  * mon: enable leveldb cache by default (big performance improvement)
  * mon: new scrub feature (primarily for diagnostic, testing purposes)
  * mon: fix occasional leveldb assertion on startup
  * mon: prevent reads until initial state is committed
  * mon: improved logic for trimming old osdmaps
  * mon: fix pick_addresses bug when expanding

Re: [ceph-users] v0.61.5 Cuttlefish update released

2013-07-19 Thread Stefan Priebe - Profihost AG

Am 19.07.2013 09:56, schrieb Dan van der Ster:
 Was that 0.61.4 - 0.61.5? Our upgrade of all mons and osds on SL6.4
 went without incident.

It was from a git version in between 0.61.4 / 0.61.5 to  0.61.5.

Stefan

 
 -- 
 Dan van der Ster
 CERN IT-DSS
 
 On Friday, July 19, 2013 at 9:00 AM, Stefan Priebe - Profihost AG wrote:
 
 crash is this one:

 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version
 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
 ceph-mon, pid 22172
 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc
 http://OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7f484a872780 time 2013-07-19 08:59:32.173506
 mon/OSDMonitor.cc http://OSDMonitor.cc: 132: FAILED
 assert(latest_bl.length() != 0)

 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- begin dump of recent events ---
 -13 2013-07-19 08:59:32.136172 7f484a872780 5 asok(0x131a000)
 register_command perfcounters_dump hook 0x1304010
 -12 2013-07-19 08:59:32.136191 7f484a872780 5 asok(0x131a000)
 register_command 1 hook 0x1304010
 -11 2013-07-19 08:59:32.136194 7f484a872780 5 asok(0x131a000)
 register_command perf dump hook 0x1304010
 -10 2013-07-19 08:59:32.136200 7f484a872780 5 asok(0x131a000)
 register_command perfcounters_schema hook 0x1304010
 -9 2013-07-19 08:59:32.136204 7f484a872780 5 asok(0x131a000)
 register_command 2 hook 0x1304010
 -8 2013-07-19 08:59:32.136206 7f484a872780 5 asok(0x131a000)
 register_command perf schema hook 0x1304010
 -7 2013-07-19 08:59:32.136208 7f484a872780 5 asok(0x131a000)
 register_command config show hook 0x1304010
 -6 2013-07-19 08:59:32.136211 7f484a872780 5 asok(0x131a000)
 register_command config set hook 0x1304010
 -5 2013-07-19 08:59:32.136214 7f484a872780 5 asok(0x131a000)
 register_command log flush hook 0x1304010
 -4 2013-07-19 08:59:32.136216 7f484a872780 5 asok(0x131a000)
 register_command log dump hook 0x1304010
 -3 2013-07-19 08:59:32.136219 7f484a872780 5 asok(0x131a000)
 register_command log reopen hook 0x1304010
 -2 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version
 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process
 ceph-mon, pid 22172
 -1 2013-07-19 08:59:32.137967 7f484a872780 1 finished
 global_init_daemonize
 0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc
 http://OSDMonitor.cc: In
 function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
 7f484a872780 time 2013-07-19 08:59:32.173506
 mon/OSDMonitor.cc http://OSDMonitor.cc: 132: FAILED
 assert(latest_bl.length() != 0)

 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b)
 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7f48489cec8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Mi

 Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG:
 All mons do not work anymore:

 === mon.a ===
 Starting Ceph mon.a on ccad...
 [21207]: (33) Numerical argument out of domain
 failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file
 /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '

 Stefan

 Am 19.07.2013 07:59, schrieb Sage Weil:
 A note on upgrading:

 One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the
 feature
 bits. We did not realize it before, but the fix will prevent 0.61.4 (or
 earlier) from forming a quorum with 0.61.5. This is similar to the
 upgrade
 from bobtail (and the future upgrade to dumpling). As such, we
 recommend
 you upgrade all monitors at once to avoid the potential for
 discruption in
 service.

 I'm adding a note to the release notes.

 Thanks!
 sage


 On Thu, 18 Jul 2013, Sage Weil wrote:

 We've prepared another update for the Cuttlefish v0.61.x series. This
 release primarily contains monitor stability improvements, although
 there
 are also some important fixes for ceph-osd for large clusters and a
 few
 important CephFS fixes. We recommend that all v0.61.x users upgrade.

 * mon: misc sync improvements (faster, more reliable, better tuning)
 * mon: enable leveldb cache by default (big performance improvement)
 * mon: new scrub feature (primarily for diagnostic, testing purposes)
 * mon: fix occasional leveldb assertion

[PATCH] mon: use first_commited instead of latest_full map if latest_bl.length() == 0

2013-07-19 Thread Stefan Priebe

this fixes a failure like:
 0 2013-07-19 09:29:16.803918 7f7fb5f31780 -1 mon/OSDMonitor.cc: In 
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 
7f7fb5f31780 time 2013-07-19 09:29:16.803439
mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)

 ceph version 0.61.5-15-g72c7c74 (72c7c74e1f160e6be39b6edf30bce09b770fa777)
 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d121]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2a46]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4b1cf9]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7f7fb408dc8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
---
 src/mon/OSDMonitor.cc |6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 9c854cd..ab3b8ec 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -129,6 +129,12 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
   if ((latest_full  0)  (latest_full  osdmap.epoch)) {
 bufferlist latest_bl;
 get_version_full(latest_full, latest_bl);
+
+if (latest_bl.length() == 0  latest_full != 0  get_first_committed()  
1) {
+dout(0)  __func__   latest_bl.length() == 0 use first_commited 
instead of latest_full  dendl;
+latest_full = get_first_committed();
+get_version_full(latest_full, latest_bl);
+}
 assert(latest_bl.length() != 0);
 dout(7)  __func__   loading latest full map e  latest_full  
dendl;
 osdmap.decode(latest_bl);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mon: use first_commited instead of latest_full map if latest_bl.length() == 0

2013-07-19 Thread Stefan Priebe


Hi,

sorry as all my mons were down with the same error - i was in a hurry 
made sadly no copy of the mons and workaround by hack ;-( but i posted a 
log to pastebin with debug mon 20. (see last email)


Stefan

Mit freundlichen Grüßen
  Stefan Priebe
Bachelor of Science in Computer Science (BSCS)
Vorstand (CTO)

---
Profihost AG
Am Mittelfelde 29
30519 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)

Am 19.07.2013 14:54, schrieb Joao Eduardo Luis:

On 07/19/2013 09:31 AM, Stefan Priebe wrote:

this fixes a failure like:
  0 2013-07-19 09:29:16.803918 7f7fb5f31780 -1 mon/OSDMonitor.cc:
In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7f7fb5f31780 time 2013-07-19 09:29:16.803439
mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0)

  ceph version 0.61.5-15-g72c7c74
(72c7c74e1f160e6be39b6edf30bce09b770fa777)
  1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d121]
  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2a46]
  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
  4: (Monitor::init_paxos()+0xe5) [0x48f955]
  5: (Monitor::preinit()+0x679) [0x4b1cf9]
  6: (main()+0x36b0) [0x484bb0]
  7: (__libc_start_main()+0xfd) [0x7f7fb408dc8d]
  8: /usr/bin/ceph-mon() [0x4801e9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.
---
  src/mon/OSDMonitor.cc |6 ++
  1 file changed, 6 insertions(+)

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 9c854cd..ab3b8ec 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -129,6 +129,12 @@ void OSDMonitor::update_from_paxos(bool
*need_bootstrap)
if ((latest_full  0)  (latest_full  osdmap.epoch)) {
  bufferlist latest_bl;
  get_version_full(latest_full, latest_bl);
+
+if (latest_bl.length() == 0  latest_full != 0 
get_first_committed()  1) {


latest_full is always  0 here, following the previous if check.


+dout(0)  __func__   latest_bl.length() == 0 use
first_commited instead of latest_full  dendl;
+latest_full = get_first_committed();
+get_version_full(latest_full, latest_bl);
+}
  assert(latest_bl.length() != 0);
  dout(7)  __func__   loading latest full map e 
latest_full  dendl;
  osdmap.decode(latest_bl);



Although appreciated, this patch fixes the symptom leading to the crash.
  The bug itself seems to be that there is a latest_full version that is
empty.  Until we know for sure what is happening and what is leading to
such state, fixing the symptom is not advisable, as it is not only
masking the real issue but it may also have unforeseen long-term effects.

Stefan, do you still have the store state on which this was triggered?
If so, can you share it with us (or dig a bit into it yourself if you
can't share the store, in which case I'll let you know what to look for).

   -Joao



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: slow request problem

2013-07-14 Thread Stefan Priebe


Hello list,

might this be a problem due to having too much PGs? I've 370 per OSD 
instead of having 33 / OSD (OSDs*100/3).


Is there any plan for PG merging?

Stefan

Hello list,

anyone else here who always has problems bringing back an offline OSD?
Since cuttlefish i'm seeing slow requests for the first 2-5 minutes
after bringing an OSD oinline again but that's so long that the VMs
crash as they think their disk is offline...

Under bobtail i never had any problems with that.

Please HELP!

Greets,
Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] slow request problem

2013-07-14 Thread Stefan Priebe - Profihost AG

Hi sage,

Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com:

 On Sun, 14 Jul 2013, Stefan Priebe wrote:
 Hello list,
 
 might this be a problem due to having too much PGs? I've 370 per OSD instead
 of having 33 / OSD (OSDs*100/3).
 
 That might exacerbate it.
 
 Can you try setting
 
 osd min pg log entries = 50
 osd max pg log entries = 100

What does that exactly do? And why is a restart of all osds needed. Thanks!

 across your cluster, restarting your osds, and see if that makes a 
 difference?  I'm wondering if this is a problem with pg log rewrites after 
 peering.  Note that adding that option and restarting isn't enough to 
 trigger the trim; you have to hit the cluster with some IO too, and (if 
 this is the source of your problem) the trim itself might be expensive.  
 So add it, restart, do a bunch of io (to all pools/pgs if you can), and 
 then see if the problem is still present?

Will try can't produce a write to every pg. it's a prod. Cluster with KVM rbd. 
But it has 800-1200 iop/s per second. 

 
 Also note that the lower osd min pg log entries means that the osd cannot 
 be down as long without requiring a backfill (50 ios per pg).  These 
 probably aren't the values that we want, but I'd like to find out whether 
 the pg log rewrites after peering in cuttlefish are the culprit here.


 
 Thanks!
 
 Is there any plan for PG merging?
 
 Not right now.  :(  I'll talk to Sam, though, to see how difficult it 
 would be given the split approach we settled on.
 
 Thanks!
 sage
 
 
 
 Stefan
 Hello list,
 
 anyone else here who always has problems bringing back an offline OSD?
 Since cuttlefish i'm seeing slow requests for the first 2-5 minutes
 after bringing an OSD oinline again but that's so long that the VMs
 crash as they think their disk is offline...
 
 Under bobtail i never had any problems with that.
 
 Please HELP!
 
 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] slow request problem

2013-07-14 Thread Stefan Priebe


Am 14.07.2013 18:19, schrieb Sage Weil:

On Sun, 14 Jul 2013, Stefan Priebe - Profihost AG wrote:

Hi sage,

Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com:


On Sun, 14 Jul 2013, Stefan Priebe wrote:

Hello list,

might this be a problem due to having too much PGs? I've 370 per OSD instead
of having 33 / OSD (OSDs*100/3).


That might exacerbate it.

Can you try setting

osd min pg log entries = 50
osd max pg log entries = 100


What does that exactly do? And why is a restart of all osds needed. Thanks!


This limits the size of the pg log.




across your cluster, restarting your osds, and see if that makes a
difference?  I'm wondering if this is a problem with pg log rewrites after
peering.  Note that adding that option and restarting isn't enough to
trigger the trim; you have to hit the cluster with some IO too, and (if
this is the source of your problem) the trim itself might be expensive.
So add it, restart, do a bunch of io (to all pools/pgs if you can), and
then see if the problem is still present?


Will try can't produce a write to every pg. it's a prod. Cluster with
KVM rbd. But it has 800-1200 iop/s per second.


Hmm, if this is a production cluster, I would be careful, then!  Setting
the pg logs too short can lead to backfill, which is very expensive (as
you know).

The defaults are 3000 / 1, so maybe try something less aggressive like
changing min to 500?


I've lowered the values to 500 / 1500 and it seems to lower the impact 
but does not seem to solve that one.


Stefan


Also, I think

  ceph osd tell \* injectargs '--osd-min-pg-log-entries 500'

should work as well.  But again, be aware that lowering the value will
incur a trim that may in itself be a bit expensive (if this is the source
of the problem).

It is probably worth watching ceph pg dump | grep $some_random_pg and
watching the 'v' column over time (say, a minute or two) to see how
quickly pg events are being generated on your cluster. This will give you
a sense of how much time 500 (or however many) pg log entries covers!

sage






Also note that the lower osd min pg log entries means that the osd cannot
be down as long without requiring a backfill (50 ios per pg).  These
probably aren't the values that we want, but I'd like to find out whether
the pg log rewrites after peering in cuttlefish are the culprit here.





Thanks!


Is there any plan for PG merging?


Not right now.  :(  I'll talk to Sam, though, to see how difficult it
would be given the split approach we settled on.

Thanks!
sage




Stefan

Hello list,

anyone else here who always has problems bringing back an offline OSD?
Since cuttlefish i'm seeing slow requests for the first 2-5 minutes
after bringing an OSD oinline again but that's so long that the VMs
crash as they think their disk is offline...

Under bobtail i never had any problems with that.

Please HELP!

Greets,
Stefan

___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] slow request problem

2013-07-14 Thread Stefan Priebe


Am 14.07.2013 21:05, schrieb Sage Weil:

On Sun, 14 Jul 2013, Stefan Priebe wrote:

Am 14.07.2013 18:19, schrieb Sage Weil:

On Sun, 14 Jul 2013, Stefan Priebe - Profihost AG wrote:

Hi sage,

Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com:


On Sun, 14 Jul 2013, Stefan Priebe wrote:

Hello list,

might this be a problem due to having too much PGs? I've 370 per OSD
instead
of having 33 / OSD (OSDs*100/3).


That might exacerbate it.

Can you try setting

osd min pg log entries = 50
osd max pg log entries = 100


What does that exactly do? And why is a restart of all osds needed.
Thanks!


This limits the size of the pg log.




across your cluster, restarting your osds, and see if that makes a
difference?  I'm wondering if this is a problem with pg log rewrites
after
peering.  Note that adding that option and restarting isn't enough to
trigger the trim; you have to hit the cluster with some IO too, and (if
this is the source of your problem) the trim itself might be expensive.
So add it, restart, do a bunch of io (to all pools/pgs if you can), and
then see if the problem is still present?


Will try can't produce a write to every pg. it's a prod. Cluster with
KVM rbd. But it has 800-1200 iop/s per second.


Hmm, if this is a production cluster, I would be careful, then!  Setting
the pg logs too short can lead to backfill, which is very expensive (as
you know).

The defaults are 3000 / 1, so maybe try something less aggressive like
changing min to 500?


I've lowered the values to 500 / 1500 and it seems to lower the impact but
does not seem to solve that one.


This suggests that the problem is the pg log rewrites that are an inherent
part of cuttlefish.  This is replaced with improved rewrite logic in 0.66
or so, so dumpling will be better.  I suspect that having a large number
of pgs is exacerbating the issue for you.

We think there is still a different peering performance problem that Sam
and paravoid have been trying to track down, but I believe in that case
reducing the pg log sizes didn't have much effect.  (Maybe one of them can
chime in here.)

This was unfortunately something we failed to catch before cuttlefish was
released.  One of the main focuses right now is in creating large clusters
and observing peering and recovery to make sure we don't repeat the same
sort of mistake for dumpling!


Thanks Sage for these information. I had some OSD restarts which went 
better with the new settings but others which don't. But it's hard to 
measure and compare restart OSD.X with OSD.Y.


Do you have any recommandations for me? Wait for dumpling and hope that 
nothing fails until then? Or upgrading to 0.66? Or trying to move all 
data to a new pool having fewer PGs?


Thanks!

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

slow request problem

2013-07-12 Thread Stefan Priebe - Profihost AG

Hello list,

anyone else here who always has problems bringing back an offline OSD?
Since cuttlefish i'm seeing slow requests for the first 2-5 minutes
after bringing an OSD oinline again but that's so long that the VMs
crash as they think their disk is offline...

Under bobtail i never had any problems with that.

Please HELP!

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

still cuttlefish recovery problems

2013-06-22 Thread Stefan Priebe


Hello,

since the peering problems are gone with 0.61.4 (Bug 5232) i'm still 
having heavy problems with recovering after OSD or host restart.


I'm seeing a lot of slow requests and stucked I/O from clients.

I've opened i bug report here:
http://tracker.ceph.com/issues/5401

I really would like to know if i'm the only one.

Should i update to 0.66?

Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

flatten rbd export / export-diff ?

2013-06-04 Thread Stefan Priebe - Profihost AG

Hi,

is there a way to flatten the rbd export-diff to a new image FILE. Or
do i always have to:

rbd import OLD BASE IMAGE
rbd import-diff diff1
rbd import-diff diff1-2
rbd import-diff diff2-3
rbd import-diff diff3-4
rbd import-diff diff4-5
...
and so on?

I would like to apply the diffs on local disk and then import the new file.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: flatten rbd export / export-diff ?

2013-06-04 Thread Stefan Priebe


Am 04.06.2013 17:23, schrieb Sage Weil:

On Tue, 4 Jun 2013, Stefan Priebe - Profihost AG wrote:

Hi,

is there a way to flatten the rbd export-diff to a new image FILE. Or
do i always have to:

rbd import OLD BASE IMAGE
rbd import-diff diff1
rbd import-diff diff1-2
rbd import-diff diff2-3
rbd import-diff diff3-4
rbd import-diff diff4-5
...
and so on?

I would like to apply the diffs on local disk and then import the new file.


Not currently.  The format is very simple, though; it should be pretty
simple to implement a subcommand in the rbd tool to do it.


Oh my C skills are more than limited ;-( i could do it in perl ;-) Is 
there a format description?


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 >

1 - 100 of 528 matches

Mail list logo