Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever
> - Original Message - >> From: "Alexandre DERUMIER">> To: "ceph-devel" >> Cc: "qemu-devel" , jdur...@redhat.com >> Sent: Monday, November 9, 2015 5:48:45 AM >> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and >> vm_stop is hanging forever >> >> adding to ceph.conf >> >> [client] >> rbd_non_blocking_aio = false >> >> >> fix the problem for me (with rbd_cache=false) >> >> >> (@cc jdur...@redhat.com) +1 same to me. Stefan >> >> >> >> - Mail original - >> De: "Denis V. Lunev" >> À: "aderumier" , "ceph-devel" >> , "qemu-devel" >> Envoyé: Lundi 9 Novembre 2015 08:22:34 >> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop >> is hanging forever >> >> On 11/09/2015 10:19 AM, Denis V. Lunev wrote: >>> On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: Hi, with qemu (2.4.1), if I do an internal snapshot of an rbd device, then I pause the vm with vm_stop, the qemu process is hanging forever monitor commands to reproduce: # snapshot_blkdev_internal drive-virtio0 yoursnapname # stop I don't see this with qcow2 or sheepdog block driver for example. Regards, Alexandre >>> this could look like the problem I have recenty trying to >>> fix with dataplane enabled. Patch series is named as >>> >>> [PATCH for 2.5 v6 0/10] dataplane snapshot fixes >>> >>> Den >> >> anyway, even if above will not help, can you collect gdb >> traces from all threads in QEMU process. May be I'll be >> able to give a hit. >> >> Den >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Hackathon: More Memory Allocator Testing
Am 19.08.2015 um 22:34 schrieb Somnath Roy: But, you said you need to remove libcmalloc *not* libtcmalloc... I saw librbd/librados is built with libcmalloc not with libtcmalloc.. So, are you saying to remove libtcmalloc (not libcmalloc) to enable jemalloc ? Ouch my mistake. I read libtcmalloc - too late here. My build (Hammer) says: # ldd /usr/lib/librados.so.2.0.0 linux-vdso.so.1 = (0x7fff4f71d000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fafdb26c000) libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 (0x7fafdb24f000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fafdb032000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fafda924000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7fafda71f000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fafda516000) libboost_system.so.1.49.0 = /usr/lib/libboost_system.so.1.49.0 (0x7fafda512000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7fafda20b000) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fafd9f88000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fafd9bfd000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7fafd99e7000) /lib64/ld-linux-x86-64.so.2 (0x56358ecfe000) Only ceph-osd is linked against libjemalloc for me. Stefan -Original Message- From: Stefan Priebe [mailto:s.pri...@profihost.ag] Sent: Wednesday, August 19, 2015 1:31 PM To: Somnath Roy; Alexandre DERUMIER; Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Am 19.08.2015 um 22:29 schrieb Somnath Roy: Hmm...We need to fix that as part of configure/Makefile I guess (?).. Since we have done this jemalloc integration originally, we can take that ownership unless anybody sees a problem of enabling tcmalloc/jemalloc with librbd/librados. You have to remove libcmalloc out of your build environment to get this done How do I do that ? I am using Ubuntu and can't afford to remove libc* packages. I always use a chroot to build packages where only a minimal bootstrap + the build deps are installed. googleperftools where libtcmalloc comes from is not Ubuntu core/minimal. Stefan Thanks Regards Somnath -Original Message- From: Stefan Priebe [mailto:s.pri...@profihost.ag] Sent: Wednesday, August 19, 2015 1:18 PM To: Somnath Roy; Alexandre DERUMIER; Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Am 19.08.2015 um 22:16 schrieb Somnath Roy: Alexandre, I am not able to build librados/librbd by using the following config option. ./configure –without-tcmalloc –with-jemalloc Same issue to me. You have to remove libcmalloc out of your build environment to get this done. Stefan It seems it is building osd/mon/Mds/RGW with jemalloc enabled.. root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd linux-vdso.so.1 = (0x7ffd0eb43000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7f5f92d7) ... root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0 linux-vdso.so.1 = (0x7ffed46f2000) libboost_thread.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000) liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 (0x7ff68763d000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7ff68721a000) libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so (0x7ff686ee) libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so (0x7ff686cb3000) libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so (0x7ff686a76000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7ff686871000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000) libboost_system.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7ff68616) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7ff68587e000) liblttng-ust-tracepoint.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000) liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000) liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000) /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000) libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so (0x7ff685029000) libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so
Re: Ceph Hackathon: More Memory Allocator Testing
Am 19.08.2015 um 22:29 schrieb Somnath Roy: Hmm...We need to fix that as part of configure/Makefile I guess (?).. Since we have done this jemalloc integration originally, we can take that ownership unless anybody sees a problem of enabling tcmalloc/jemalloc with librbd/librados. You have to remove libcmalloc out of your build environment to get this done How do I do that ? I am using Ubuntu and can't afford to remove libc* packages. I always use a chroot to build packages where only a minimal bootstrap + the build deps are installed. googleperftools where libtcmalloc comes from is not Ubuntu core/minimal. Stefan Thanks Regards Somnath -Original Message- From: Stefan Priebe [mailto:s.pri...@profihost.ag] Sent: Wednesday, August 19, 2015 1:18 PM To: Somnath Roy; Alexandre DERUMIER; Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Am 19.08.2015 um 22:16 schrieb Somnath Roy: Alexandre, I am not able to build librados/librbd by using the following config option. ./configure –without-tcmalloc –with-jemalloc Same issue to me. You have to remove libcmalloc out of your build environment to get this done. Stefan It seems it is building osd/mon/Mds/RGW with jemalloc enabled.. root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd linux-vdso.so.1 = (0x7ffd0eb43000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7f5f92d7) ... root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0 linux-vdso.so.1 = (0x7ffed46f2000) libboost_thread.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000) liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 (0x7ff68763d000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7ff68721a000) libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so (0x7ff686ee) libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so (0x7ff686cb3000) libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so (0x7ff686a76000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7ff686871000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000) libboost_system.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7ff68616) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7ff68587e000) liblttng-ust-tracepoint.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000) liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000) liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000) /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000) libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so (0x7ff685029000) libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so (0x7ff684e24000) libplds4.so = /usr/lib/x86_64-linux-gnu/libplds4.so (0x7ff684c2) It is building with libcmalloc always... Did you change the ceph makefiles to build librbd/librados with jemalloc ? Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER Sent: Wednesday, August 19, 2015 7:01 AM To: Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Thanks Marc, Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs jemalloc. and indeed tcmalloc, even with bigger cache, seem decrease over time. What is funny, is that I see exactly same behaviour client librbd side, with qemu and multiple iothreads. Switching both server and client to jemalloc give me best performance on small read currently. - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-devel@vger.kernel.org Envoyé: Mercredi 19 Août 2015 06:45:36 Objet: Ceph Hackathon: More Memory Allocator Testing Hi Everyone, One of the goals at the Ceph Hackathon last week was to examine how to improve Ceph Small IO performance. Jian Zhang presented findings showing a dramatic improvement in small random IO performance when Ceph is used with jemalloc. His results build upon Sandisk's original findings that the default thread cache values are a major bottleneck in TCMalloc 2.1. To further verify these results, we sat down at the Hackathon and configured the new performance test cluster that Intel generously donated to the Ceph community laboratory to run through a variety of tests with different memory
Re: Ceph Hackathon: More Memory Allocator Testing
Am 19.08.2015 um 22:16 schrieb Somnath Roy: Alexandre, I am not able to build librados/librbd by using the following config option. ./configure –without-tcmalloc –with-jemalloc Same issue to me. You have to remove libcmalloc out of your build environment to get this done. Stefan It seems it is building osd/mon/Mds/RGW with jemalloc enabled.. root@emsnode10:~/ceph-latest/src# ldd ./ceph-osd linux-vdso.so.1 = (0x7ffd0eb43000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7f5f92d7) ... root@emsnode10:~/ceph-latest/src/.libs# ldd ./librados.so.2.0.0 linux-vdso.so.1 = (0x7ffed46f2000) libboost_thread.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x7ff687887000) liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 (0x7ff68763d000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7ff687438000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7ff68721a000) libnss3.so = /usr/lib/x86_64-linux-gnu/libnss3.so (0x7ff686ee) libsmime3.so = /usr/lib/x86_64-linux-gnu/libsmime3.so (0x7ff686cb3000) libnspr4.so = /usr/lib/x86_64-linux-gnu/libnspr4.so (0x7ff686a76000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7ff686871000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7ff686668000) libboost_system.so.1.55.0 = /usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x7ff686464000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7ff68616) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7ff685e59000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7ff685a94000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7ff68587e000) liblttng-ust-tracepoint.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7ff685663000) liburcu-bp.so.1 = /usr/lib/liburcu-bp.so.1 (0x7ff68545c000) liburcu-cds.so.1 = /usr/lib/liburcu-cds.so.1 (0x7ff685255000) /lib64/ld-linux-x86-64.so.2 (0x7ff68a0f6000) libnssutil3.so = /usr/lib/x86_64-linux-gnu/libnssutil3.so (0x7ff685029000) libplc4.so = /usr/lib/x86_64-linux-gnu/libplc4.so (0x7ff684e24000) libplds4.so = /usr/lib/x86_64-linux-gnu/libplds4.so (0x7ff684c2) It is building with libcmalloc always... Did you change the ceph makefiles to build librbd/librados with jemalloc ? Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER Sent: Wednesday, August 19, 2015 7:01 AM To: Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Thanks Marc, Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs jemalloc. and indeed tcmalloc, even with bigger cache, seem decrease over time. What is funny, is that I see exactly same behaviour client librbd side, with qemu and multiple iothreads. Switching both server and client to jemalloc give me best performance on small read currently. - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-devel ceph-devel@vger.kernel.org Envoyé: Mercredi 19 Août 2015 06:45:36 Objet: Ceph Hackathon: More Memory Allocator Testing Hi Everyone, One of the goals at the Ceph Hackathon last week was to examine how to improve Ceph Small IO performance. Jian Zhang presented findings showing a dramatic improvement in small random IO performance when Ceph is used with jemalloc. His results build upon Sandisk's original findings that the default thread cache values are a major bottleneck in TCMalloc 2.1. To further verify these results, we sat down at the Hackathon and configured the new performance test cluster that Intel generously donated to the Ceph community laboratory to run through a variety of tests with different memory allocator configurations. I've since written the results of those tests up in pdf form for folks who are interested. The results are located here: http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf I want to be clear that many other folks have done the heavy lifting here. These results are simply a validation of the many tests that other folks have already done. Many thanks to Sandisk and others for figuring this out as it's a pretty big deal! Side note: Very little tuning other than swapping the memory allocator and a couple of quick and dirty ceph tunables were set during these tests. It's quite possible that higher IOPS will be achieved as we really start digging into the cluster and learning what the bottlenecks are. Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Ceph Hackathon: More Memory Allocator Testing
Thanks for sharing. Do those tests use jemalloc for fio too? Otherwise librbd on client side is running with tcmalloc again. Stefan Am 19.08.2015 um 06:45 schrieb Mark Nelson: Hi Everyone, One of the goals at the Ceph Hackathon last week was to examine how to improve Ceph Small IO performance. Jian Zhang presented findings showing a dramatic improvement in small random IO performance when Ceph is used with jemalloc. His results build upon Sandisk's original findings that the default thread cache values are a major bottleneck in TCMalloc 2.1. To further verify these results, we sat down at the Hackathon and configured the new performance test cluster that Intel generously donated to the Ceph community laboratory to run through a variety of tests with different memory allocator configurations. I've since written the results of those tests up in pdf form for folks who are interested. The results are located here: http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf I want to be clear that many other folks have done the heavy lifting here. These results are simply a validation of the many tests that other folks have already done. Many thanks to Sandisk and others for figuring this out as it's a pretty big deal! Side note: Very little tuning other than swapping the memory allocator and a couple of quick and dirty ceph tunables were set during these tests. It's quite possible that higher IOPS will be achieved as we really start digging into the cluster and learning what the bottlenecks are. Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Is it safe to increase pg number in a production environment
We've done the splitting several times. The most important thing is to run a ceph version which does not have the linger ops bug. This is dumpling latest release, giant and hammer. Latest firefly release still has this bug. Which results in wrong watchers and no working snapshots. Stefan Am 04.08.2015 um 18:46 schrieb Samuel Just: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time. Two long-term solutions come in my mind, 1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects. 2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance of Clients? Any comments/help/suggestions will be highly appreciated. -- Best Regards Jevon ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Is it safe to increase pg number in a production environment
Hi, Am 04.08.2015 um 21:16 schrieb Ketor D: Hi Stefan, Could you describe more about the linger ops bug? I'm runing Firefly as you say still has this bug. It will be fixed in next ff release. This on: http://tracker.ceph.com/issues/9806 Stefan Thanks! On Wed, Aug 5, 2015 at 12:51 AM, Stefan Priebe s.pri...@profihost.ag wrote: We've done the splitting several times. The most important thing is to run a ceph version which does not have the linger ops bug. This is dumpling latest release, giant and hammer. Latest firefly release still has this bug. Which results in wrong watchers and no working snapshots. Stefan Am 04.08.2015 um 18:46 schrieb Samuel Just: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 scaleq...@gmail.com wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time. Two long-term solutions come in my mind, 1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects. 2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance of Clients? Any comments/help/suggestions will be highly appreciated. -- Best Regards Jevon ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 22:50 schrieb Josh Durgin: Yes, I'm afraid it sounds like it is. You can double check whether the watch exists on an image by getting the id of the image from 'rbd info $pool/$image | grep block_name_prefix': block_name_prefix: rbd_data.105674b0dc51 The id is the hex number there. Append that to 'rbd_header.' and you have the header object name. Check whether it has watchers with: rados listwatchers -p $pool rbd_header.105674b0dc51 If that doesn't show any watchers while the image is in use by a vm, it's #9806. Yes it does not show any watchers. I just merged the backport for firefly, so it'll be in 0.80.11. Sorry it took so long to get to firefly :(. We'll need to be more vigilant about checking non-trivial backports when we're going through all the bugs periodically. That would be really important. I've seen that this one was already in upstream/firefly-backports. What's the purpose of that branch? Greets, Stefan Josh On 07/21/2015 12:52 PM, Stefan Priebe wrote: So this is really this old bug? http://tracker.ceph.com/issues/9806 Stefan Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 16:32 schrieb Jason Dillaman: Any chance that the snapshot was just created prior to the first export and you have a process actively writing to the image? Sadly not. I executed those commands exactly as i've posted manually at bash. I can reproduce this at 5 different ceph cluster and 500 vms each. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? How can i do this exactly? Have you increased pgs in all those clusters recently? Yes i bumped from 2048 to 4096 as i doubled the osds. Stefan Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
So this is really this old bug? http://tracker.ceph.com/issues/9806 Stefan Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
upstream/firefly exporting the same snap 2 times results in different exports
Hi, i remember there was a bug before in ceph not sure in which release where exporting the same rbd snap multiple times results in different raw images. Currently running upstream/firefly and i'm seeing the same again. # rbd export cephstor/disk-116@snap dump1 # sleep 10 # rbd export cephstor/disk-116@snap dump2 # md5sum -b dump* b89198f118de59b3aa832db1bfddaf8f *dump1 f63ed9345ac2d5898483531e473772b1 *dump2 Can anybody help? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc
Am 07.07.2015 um 12:55 schrieb Shishir Gowda: Hi Stefan, I tried with hammer, and with google perf devel tools installed, and still worked as expected. You can check in .../ceph/src/ceph-osd directory to confirm if you are checking with the right binaries. Strange under Debian Wheezy it always bind to tcmalloc. I've now removed googleperftools and it works fine. Stefan With regards, Shishir -Original Message- From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] Sent: Tuesday, July 07, 2015 2:48 PM To: Shishir Gowda; ceph-devel@vger.kernel.org Subject: Re: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc Am 07.07.2015 um 09:56 schrieb Shishir Gowda: Hi Stefan, I built it with ./configure --without-tcmalloc and --with-jemalloc, and resulting binaries are not being linked with tcmalloc. It works for me if i remove the google perftools dev package. But if it is installed hammer always builds against tcmalloc. ldd src/ceph-osd linux-vdso.so.1 = (0x7fff2a5fe000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7f99d1c7b000) libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7f99d1a79000) libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x7f99d182b000) liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 (0x7f99d15dc000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f99d13be000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7f99d0cc1000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7f99d0abc000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99d08b8000) libboost_thread.so.1.54.0 = /usr/lib/x86_64-linux- gnu/libboost_thread.so.1.54.0 (0x7f99d06a1000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7f99d0499000) libboost_system.so.1.54.0 = /usr/lib/x86_64-linux- gnu/libboost_system.so.1.54.0 (0x7f99d0295000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7f99cff9) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7f99cfc8a000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7f99cfa74000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7f99cf6ae000) /lib64/ld-linux-x86-64.so.2 (0x7f99d1ec2000) libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7f99cf4a8000) liblttng-ust-tracepoint.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust- tracepoint.so.0 (0x7f99cf28e000) liburcu-bp.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-bp.so.2 (0x7f99cf086000) liburcu-cds.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-cds.so.2 (0x7f99cee7f000) I tried it with upstream master, what branch are you using. With regards, Shishir -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Stefan Priebe Sent: Friday, July 03, 2015 2:45 PM To: ceph-devel@vger.kernel.org Subject: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc Hi, i'm trying to compile current hammer with-jemalloc. configure .. --without-tcmalloc --with-jemalloc but resulting ceph-osd is still linked against tcmalloc: ldd /usr/bin/ceph-osd linux-vdso.so.1 = (0x7fffbf3b9000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7fc44bc25000) libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7fc44ba23000) libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x7fc44b7d2000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fc44b5b6000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7fc44aca2000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fc44aa9e000) libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 (0x7fc44aa81000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fc44a878000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7fc44a571000) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7fc44a0d8000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000) /lib64/ld-linux-x86-64.so.2 (0x7fc44be65000) libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000) libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000) libunwind.so.7 = /usr/lib/libunwind.so.7 (0x7fc4496bb000) Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo
Re: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc
Am 07.07.2015 um 09:56 schrieb Shishir Gowda: Hi Stefan, I built it with ./configure --without-tcmalloc and --with-jemalloc, and resulting binaries are not being linked with tcmalloc. It works for me if i remove the google perftools dev package. But if it is installed hammer always builds against tcmalloc. ldd src/ceph-osd linux-vdso.so.1 = (0x7fff2a5fe000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7f99d1c7b000) libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7f99d1a79000) libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x7f99d182b000) liblttng-ust.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 (0x7f99d15dc000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f99d13be000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7f99d0cc1000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7f99d0abc000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99d08b8000) libboost_thread.so.1.54.0 = /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0 (0x7f99d06a1000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7f99d0499000) libboost_system.so.1.54.0 = /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0 (0x7f99d0295000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7f99cff9) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7f99cfc8a000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7f99cfa74000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7f99cf6ae000) /lib64/ld-linux-x86-64.so.2 (0x7f99d1ec2000) libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7f99cf4a8000) liblttng-ust-tracepoint.so.0 = /usr/lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0 (0x7f99cf28e000) liburcu-bp.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-bp.so.2 (0x7f99cf086000) liburcu-cds.so.2 = /usr/lib/x86_64-linux-gnu/liburcu-cds.so.2 (0x7f99cee7f000) I tried it with upstream master, what branch are you using. With regards, Shishir -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Stefan Priebe Sent: Friday, July 03, 2015 2:45 PM To: ceph-devel@vger.kernel.org Subject: trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc Hi, i'm trying to compile current hammer with-jemalloc. configure .. --without-tcmalloc --with-jemalloc but resulting ceph-osd is still linked against tcmalloc: ldd /usr/bin/ceph-osd linux-vdso.so.1 = (0x7fffbf3b9000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7fc44bc25000) libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7fc44ba23000) libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x7fc44b7d2000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fc44b5b6000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7fc44aca2000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fc44aa9e000) libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 (0x7fc44aa81000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fc44a878000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7fc44a571000) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7fc44a0d8000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000) /lib64/ld-linux-x86-64.so.2 (0x7fc44be65000) libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000) libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000) libunwind.so.7 = /usr/lib/libunwind.so.7 (0x7fc4496bb000) Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies
trying to compile with-jemalloc but ceph-osd is still linked to libtcmalloc
Hi, i'm trying to compile current hammer with-jemalloc. configure .. --without-tcmalloc --with-jemalloc but resulting ceph-osd is still linked against tcmalloc: ldd /usr/bin/ceph-osd linux-vdso.so.1 = (0x7fffbf3b9000) libjemalloc.so.1 = /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (0x7fc44bc25000) libaio.so.1 = /lib/x86_64-linux-gnu/libaio.so.1 (0x7fc44ba23000) libleveldb.so.1 = /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x7fc44b7d2000) libpthread.so.0 = /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fc44b5b6000) libcrypto++.so.9 = /usr/lib/libcrypto++.so.9 (0x7fc44aea8000) libuuid.so.1 = /lib/x86_64-linux-gnu/libuuid.so.1 (0x7fc44aca2000) libdl.so.2 = /lib/x86_64-linux-gnu/libdl.so.2 (0x7fc44aa9e000) libboost_thread.so.1.49.0 = /usr/lib/libboost_thread.so.1.49.0 (0x7fc44aa81000) librt.so.1 = /lib/x86_64-linux-gnu/librt.so.1 (0x7fc44a878000) libstdc++.so.6 = /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7fc44a571000) libm.so.6 = /lib/x86_64-linux-gnu/libm.so.6 (0x7fc44a2ef000) libgcc_s.so.1 = /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7fc44a0d8000) libc.so.6 = /lib/x86_64-linux-gnu/libc.so.6 (0x7fc449d4d000) /lib64/ld-linux-x86-64.so.2 (0x7fc44be65000) libsnappy.so.1 = /usr/lib/libsnappy.so.1 (0x7fc449b47000) libtcmalloc.so.4 = /usr/lib/libtcmalloc.so.4 (0x7fc4498d4000) libunwind.so.7 = /usr/lib/libunwind.so.7 (0x7fc4496bb000) Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd_cache, limiting read on high iops around 40k
Am 22.06.2015 um 09:08 schrieb Alexandre DERUMIER aderum...@odiso.com: Just an update, there seems to be no proper way to pass iothread parameter from openstack-nova (not at least in Juno release). So a default single iothread per VM is what all we have. So in conclusion a nova instance max iops on ceph rbd will be limited to 30-40K. Thanks for the update. For proxmox users, I have added iothread option to gui for proxmox 4.0 Can we make iothread the default? Does it also help for single disks or only multiple disks? and added jemalloc as default memory allocator I have also send a jemmaloc patch to qemu dev mailing https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05265.html (Help is welcome to push it in qemu upstream ! ) - Mail original - De: pushpesh sharma pushpesh@gmail.com À: aderumier aderum...@odiso.com Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, ceph-users ceph-us...@lists.ceph.com Envoyé: Lundi 22 Juin 2015 07:58:47 Objet: Re: rbd_cache, limiting read on high iops around 40k Just an update, there seems to be no proper way to pass iothread parameter from openstack-nova (not at least in Juno release). So a default single iothread per VM is what all we have. So in conclusion a nova instance max iops on ceph rbd will be limited to 30-40K. On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, some news about qemu with tcmalloc vs jemmaloc. I'm testing with multiple disks (with iothreads) in 1 qemu guest. And if tcmalloc is a little faster than jemmaloc, I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. The problem is that when I hit malloc bug, I'm around 4000-1 iops, and only way to fix is is to restart qemu ... - Mail original - De: pushpesh sharma pushpesh@gmail.com À: aderumier aderum...@odiso.com Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, ceph-users ceph-us...@lists.ceph.com Envoyé: Vendredi 12 Juin 2015 08:58:21 Objet: Re: rbd_cache, limiting read on high iops around 40k Thanks, posted the question in openstack list. Hopefully will get some expert opinion. On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, here a libvirt xml sample from libvirt src (you need to define iothreads number, then assign then in disks). I don't use openstack, so I really don't known how it's working with it. domain type='qemu' nameQEMUGuest1/name uuidc7a5fdbd-edaf-9455-926a-d65c16db1809/uuid memory unit='KiB'219136/memory currentMemory unit='KiB'219136/currentMemory vcpu placement='static'2/vcpu iothreads2/iothreads os type arch='i686' machine='pc'hvm/type boot dev='hd'/ /os clock offset='utc'/ on_poweroffdestroy/on_poweroff on_rebootrestart/on_reboot on_crashdestroy/on_crash devices emulator/usr/bin/qemu/emulator disk type='file' device='disk' driver name='qemu' type='raw' iothread='1'/ source file='/var/lib/libvirt/images/iothrtest1.img'/ target dev='vdb' bus='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x04' function='0x0'/ /disk disk type='file' device='disk' driver name='qemu' type='raw' iothread='2'/ source file='/var/lib/libvirt/images/iothrtest2.img'/ target dev='vdc' bus='virtio'/ /disk controller type='usb' index='0'/ controller type='ide' index='0'/ controller type='pci' index='0' model='pci-root'/ memballoon model='none'/ /devices /domain - Mail original - De: pushpesh sharma pushpesh@gmail.com À: aderumier aderum...@odiso.com Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov malm...@gmail.com, ceph-devel ceph-devel@vger.kernel.org, ceph-users ceph-us...@lists.ceph.com Envoyé: Vendredi 12 Juin 2015 07:52:41 Objet: Re: rbd_cache, limiting read on high iops around 40k Hi Alexandre, I agree with your rational, of one iothread per disk. CPU consumed in IOwait is pretty high in each VM. But I am not finding a way to set the same on a nova instance. I am using openstack Juno with QEMU+KVM. As per libvirt documentation for setting iothreads, I can edit domain.xml directly and achieve the same effect. However in as in openstack env domain xml is created by nova with some additional metadata, so editing the domain xml using 'virsh edit' does not seems to work(I agree, it is not a very cloud way of doing things, but a hack). Changes made there vanish after saving them, due to reason libvirt validation fails on the same. #virsh dumpxml instance-00c5 vm.xml #virt-xml-validate vm.xml
Re: Memstore performance improvements v0.90 vs v0.87
Am 20.02.2015 um 17:03 schrieb Alexandre DERUMIER: http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/ It's possible that this could be having an effect on the results. Isn't auto numa balancing enabled by default since kernel 3.8 ? it can be checked with cat /proc/sys/kernel/numa_balancing I have it disabled in kernel due to many libc memory allocation failures when enabled. Stefan - Mail original - De: Mark Nelson mnel...@redhat.com À: Blair Bethwaite blair.bethwa...@gmail.com, James Page james.p...@ubuntu.com Cc: ceph-devel ceph-devel@vger.kernel.org, Stephen L Blinick stephen.l.blin...@intel.com, Jay Vosburgh jay.vosbu...@canonical.com, Colin Ian King colin.k...@canonical.com, Patricia Gaughen patricia.gaug...@canonical.com, Leann Ogasawara leann.ogasaw...@canonical.com Envoyé: Vendredi 20 Février 2015 16:38:02 Objet: Re: Memstore performance improvements v0.90 vs v0.87 I think paying attention to NUMA is good advice. One of the things that apparently changed in RHEL7 is that they are now doing automatic NUMA tuning: http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/ It's possible that this could be having an effect on the results. Mark On 02/20/2015 03:49 AM, Blair Bethwaite wrote: Hi James, Interesting results, but did you do any tests with a NUMA system? IIUC the original report was from a dual socket setup, and that'd presumably be the standard setup for most folks (both OSD server and client side). Cheers, On 20 February 2015 at 20:07, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi All The Ubuntu Kernel team have spent the last few weeks investigating the apparent performance disparity between RHEL 7 and Ubuntu 14.04; we've focussed efforts in a few ways (see below). All testing has been done using the latest Firefly release. 1) Base network latency Jay Vosburgh looked at the base network latencies between RHEL 7 and Ubuntu 14.04; under default install, RHEL actually had slightly worse latency than Ubuntu due to the default enablement of a firewall; disabling this brought latency back inline between the two distributions: OS rtt min/avg/max/mdev Ubuntu 14.04 (3.13) 0.013/0.016/0.018/0.005 ms RHEL7 (3.10) 0.010/0.018/0.025/0.005 ms ...base network latency is pretty much the same. This testing was performed on a matched pair of Dell Poweredge R610's, configured with a single 4 core CPU and 8G of RAM. 2) Latency and performance in Ceph using Rados bench Colin King spent a number of days testing and analysing results using rados bench against a single node ceph deployment, configured with a single memory backed OSD, to see if we could reproduce the disparities reported. He ran 120 second OSD benchmarks on RHEL 7 as well as Ubuntu 14.04 LTS with a selection of kernels including 3.10 vanilla, 3.13.0-44 (release kernel), 3.16.0-30 (utopic HWE kernel), 3.18.0-12 (vivid HWE kernel) and 3.19-rc6 with 1, 16 and 128 client threads. The data collected is available at [0]. Each round of tests consisted of 15 runs, from which we computed average latency, latency deviation and latency distribution: 120 second x 1 thread Results all seem to cluster around 0.04-0.05ms, with RHEL 7 averaging at 0.044 and recent Ubuntu kernels at 0.036-0.037ms. The older 3.10 kernel in RHEL 7 does have some slightly higher average latency. 120 second x 16 threads Results all seem to cluster around 0.6-0.7ms. 3.19.0-rc6 had a couple of 1.4ms outliers which pushed it out to be worse than RHEL 7. On the whole Ubuntu 3.10-3.18 kernels are better than RHEL 7 by ~0.1ms. RHEL shows a far higher standard deviation, due to the bimodal latency distribution, which from the casual observer may appear to be more jittery. 120 second x 128 threads Later kernels show up to have less standard deviation than RHEL 7, so that shows perhaps less jitter in the stats than RHEL 7's 3.10 kernel. With this many threads pounding the test, we get a wider spread of latencies and it is hard to tell any kind of latency distribution patterns with just 15 rounds because of the large amount of latency jitter. All systems show a latency of ~ 5ms. Taking into consideration the amount of jitter, we think these results do not make much sense unless we repeat these tests with say 100 samples. 3) Conclusion We’ve have not been able to show any major anomalies in Ceph on Ubuntu compared to RHEL 7 when using memstore. Our current hypothesis is that one needs to run the OSD bench stressor many times to get a fair capture of system latency stats. The reason for this is: * Latencies are very low with memstore, so any small jitter in scheduling etc will show up as a large distortion (as shown by the large standard deviations in the samples). * When memstore is heavily utilized, memory pressure causes the system to page heavily and so we are subject to the nature of perhaps delays on paging that cause some
Re: speed decrease since firefly,giant,hammer the 2nd try
[.] __pthread_mutex_unlock_usercnt 0,56% ceph-osd [.] ceph::buffer::list::iterator::advance(int) 0,44% ceph-osd [.] ceph::buffer::ptr::append(char const*, unsigned int) Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: aderumier aderum...@odiso.com Cc: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 16 Février 2015 23:08:37 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Am 16.02.2015 um 23:02 schrieb Alexandre DERUMIER aderum...@odiso.com: This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s while running dumpling... Is it for write only ? or do you see same decrease for read too Just tested write. This might be the result of higher CPU load of the ceph-osd processes under firefly. Dumpling 180% per process vs. firefly 220% Stefan ? - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 16 Février 2015 22:22:01 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try I've now upgraded server side and client side to latest upstream/firefly. This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s while running dumpling... Greets, Stefan Am 15.02.2015 um 19:40 schrieb Stefan Priebe: Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0 [.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,68% librados.so.2.0.0 [.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPS Latency wb on wb off wb on wb off dumpling 10870 536 ~100us ~2ms firefly 10350 525 ~100us ~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01
firefly: librbd: reads contending for cache space can cause livelock
Hi, is there any reason why this one is not merged into firefly yet? http://tracker.ceph.com/issues/9854 librbd: reads contending for cache space can cause livelock Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 16.02.2015 um 17:45 schrieb Alexandre DERUMIER: I also thinked about 1 thing fio-lirbd use the rbd_cache value from ceph.conf. and qemu change the value if cache=none or cache=writeback in qemu conf. So, verify that too. I'm thinked of this old bug with cache http://tracker.ceph.com/issues/9513 It was a bug in giant, but tracker said also dumpling and firefly (but no commit for them) But the original bug was http://tracker.ceph.com/issues/9854 and I'm not sure it's already released No it's not in latest firefly nor in latest dumpling. But it's in latest git for both. But it looks read related not write related - isn't it? Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: aderumier aderum...@odiso.com Cc: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 16 Février 2015 15:50:56 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Hi Mark, Hi Alexandre, Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER: Hi Stefan, I could be interesting to see if you have the same speed decrease with fio-librbd on the host, without the qemu layer. the perf reports don't seem to be too much different. do you have the same cpu usage ? (check qemu process usage) the idea to use fio-librbd was very good. I cannot reproduce the behaviour using fio-rbd. I can just reproduce it with qemu. Very strange. So please ignore me for the moment. I'll try to dig deeper into it. Greets, Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Dimanche 15 Février 2015 19:40:45 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0 [.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,68% librados.so.2.0.0 [.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPS Latency wb on wb off wb on wb off dumpling 10870 536 ~100us ~2ms firefly 10350 525 ~100us ~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 16.02.2015 um 16:36 schrieb Alexandre DERUMIER: What is you fio command line ? fio rbd or fio under qemu? do you test with numjobs 1 ? Both. (I think under qemu, you can use any numjobs value, as it's use only 1 thread, is equal to numjobs=1) numjobs 1 gives also under qemu better results. Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: aderumier aderum...@odiso.com Cc: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 16 Février 2015 15:50:56 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Hi Mark, Hi Alexandre, Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER: Hi Stefan, I could be interesting to see if you have the same speed decrease with fio-librbd on the host, without the qemu layer. the perf reports don't seem to be too much different. do you have the same cpu usage ? (check qemu process usage) the idea to use fio-librbd was very good. I cannot reproduce the behaviour using fio-rbd. I can just reproduce it with qemu. Very strange. So please ignore me for the moment. I'll try to dig deeper into it. Greets, Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Dimanche 15 Février 2015 19:40:45 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0 [.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,68% librados.so.2.0.0 [.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPS Latency wb on wb off wb on wb off dumpling 10870 536 ~100us ~2ms firefly 10350 525 ~100us ~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0
Re: speed decrease since firefly,giant,hammer the 2nd try
I've now upgraded server side and client side to latest upstream/firefly. This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s while running dumpling... Greets, Stefan Am 15.02.2015 um 19:40 schrieb Stefan Priebe: Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0[.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0[.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0[.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0[.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0[.] Mutex::Lock(bool) 0,68% librados.so.2.0.0[.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0[.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPSLatency wb onwb offwb onwb off dumpling10870536~100us~2ms firefly10350525~100us~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 16.02.2015 um 23:02 schrieb Alexandre DERUMIER aderum...@odiso.com: This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s while running dumpling... Is it for write only ? or do you see same decrease for read too Just tested write. This might be the result of higher CPU load of the ceph-osd processes under firefly. Dumpling 180% per process vs. firefly 220% Stefan ? - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 16 Février 2015 22:22:01 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try I've now upgraded server side and client side to latest upstream/firefly. This results in fio-rbd showing avg 26000 iop/s instead of 30500 iop/s while running dumpling... Greets, Stefan Am 15.02.2015 um 19:40 schrieb Stefan Priebe: Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0 [.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,68% librados.so.2.0.0 [.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPS Latency wb on wb off wb on wb off dumpling 10870 536 ~100us ~2ms firefly 10350 525 ~100us ~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued : total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s
Re: speed decrease since firefly,giant,hammer the 2nd try
Hi Mark, Hi Alexandre, Am 16.02.2015 um 10:11 schrieb Alexandre DERUMIER: Hi Stefan, I could be interesting to see if you have the same speed decrease with fio-librbd on the host, without the qemu layer. the perf reports don't seem to be too much different. do you have the same cpu usage ? (check qemu process usage) the idea to use fio-librbd was very good. I cannot reproduce the behaviour using fio-rbd. I can just reproduce it with qemu. Very strange. So please ignore me for the moment. I'll try to dig deeper into it. Greets, Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Dimanche 15 Février 2015 19:40:45 Objet: Re: speed decrease since firefly,giant,hammer the 2nd try Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0 [.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0 [.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0 [.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0 [.] Mutex::Lock(bool) 0,68% librados.so.2.0.0 [.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0 [.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPS Latency wb on wb off wb on wb off dumpling 10870 536 ~100us ~2ms firefly 10350 525 ~100us ~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued : total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): sdb: ios=166/526079
Re: speed decrease since firefly,giant,hammer the 2nd try
Hi Mark, what's next? I've this test cluster only for 2 more days. Here some perf Details: dumpling: 12,65% libc-2.13.so [.] 0x79000 2,86% libc-2.13.so [.] malloc 2,80% kvm [.] 0xb59c5 2,59% libc-2.13.so [.] free 2,35% [kernel] [k] __schedule 2,16% [kernel] [k] _raw_spin_lock 1,92% [kernel] [k] __switch_to 1,58% [kernel] [k] lapic_next_deadline 1,09% [kernel] [k] update_sd_lb_stats 1,08% [kernel] [k] _raw_spin_lock_irqsave 0,91% librados.so.2.0.0[.] ceph_crc32c_le_intel 0,91% libpthread-2.13.so [.] pthread_mutex_trylock 0,87% [kernel] [k] resched_task 0,72% [kernel] [k] cpu_startup_entry 0,71% librados.so.2.0.0[.] crush_hash32_3 0,66% [kernel] [k] leave_mm 0,65% librados.so.2.0.0[.] Mutex::Lock(bool) 0,64% [kernel] [k] idle_cpu 0,62% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,59% [kernel] [k] try_to_wake_up 0,56% [kernel] [k] wake_futex 0,50% librados.so.2.0.0[.] ceph::buffer::ptr::release() firefly: 12,56% libc-2.13.so [.] 0x7905d 2,82% libc-2.13.so [.] malloc 2,64% libc-2.13.so [.] free 2,61% kvm [.] 0x34322f 2,33% [kernel] [k] __schedule 2,14% [kernel] [k] _raw_spin_lock 1,83% [kernel] [k] __switch_to 1,62% [kernel] [k] lapic_next_deadline 1,17% [kernel] [k] _raw_spin_lock_irqsave 1,09% [kernel] [k] update_sd_lb_stats 1,08% libpthread-2.13.so [.] pthread_mutex_trylock 0,85% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,77% [kernel] [k] resched_task 0,74% librbd.so.1.0.0 [.] 0x71b73 0,72% librados.so.2.0.0[.] Mutex::Lock(bool) 0,68% librados.so.2.0.0[.] crush_hash32_3 0,67% [kernel] [k] idle_cpu 0,65% [kernel] [k] leave_mm 0,65% [kernel] [k] cpu_startup_entry 0,59% [kernel] [k] try_to_wake_up 0,51% librados.so.2.0.0[.] ceph::buffer::ptr::release() 0,51% [kernel] [k] wake_futex Stefan Am 11.02.2015 um 06:42 schrieb Stefan Priebe: Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPSLatency wb onwb offwb onwb off dumpling10870536~100us~2ms firefly10350525~100us~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064, util=98.73% firefly: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 11.02.2015 um 08:44 schrieb Alexandre DERUMIER: same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Maybe could we bissect this ? Maybe testing intermediate librbd releases between dumpling and firefly, http://gitbuilder.ceph.com/ceph-deb-wheezy-x86_64-basic/ref/ Yes may be. Sadly i've currently another problem on my newest cluster having strange kworker workload - i've never noticed before on any ceph system. All writes are hanging - while using the same kernel as everywhere. Stefan could we give us an hint. - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: ceph-devel ceph-devel@vger.kernel.org Envoyé: Mardi 10 Février 2015 19:55:26 Objet: speed decrease since firefly,giant,hammer the 2nd try Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 10.02.2015 um 20:05 schrieb Gregory Farnum: On Tue, Feb 10, 2015 at 10:55 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? We're all going to have the same questions now as we did last time, about what the cluster looks like, what the perfcounters are reporting on both versions of librados, etc. I try to answer all your questions - not sue how easy this is. 6 Nodes each with: - Single Intel E5-1650 v3 - 48GB RAM - 4x 800GB Samsung SSD - 2x 10Gbit/s bonded storage network - Client side - Dual Xeon E5 - 256GB RAM - 2x 10Gbit/s bonded storage network Regarding perf counters - i'm willing to make tests. Just tell me how. Also, please give us the results from Giant rather than Firefly, for the reasons I mentioned previously. As giant is not a long term release and we have a support contract it's not an option to me. Even tough i tried hammer git master three days ago. Same results. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
speed decrease since firefly,giant,hammer the 2nd try
Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 10.02.2015 um 20:10 schrieb Mark Nelson: On 02/10/2015 12:55 PM, Stefan Priebe wrote: Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Hi Stephen, Just off the top of my head, some questions to investigate: What happens to single op latencies? How to test this? Does enabling/disabling RBD cache have any effect? I've it enabled on both through qemu write back setting. How's CPU usage? (Does perf report show anything useful?) Can you get trace data? I'm not familiar with trace or perf - what should do exactly? Stefan Mark Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 10.02.2015 um 21:36 schrieb Mark Nelson: On 02/10/2015 02:24 PM, Stefan Priebe wrote: Am 10.02.2015 um 20:40 schrieb Mark Nelson: On 02/10/2015 01:13 PM, Stefan Priebe wrote: Am 10.02.2015 um 20:10 schrieb Mark Nelson: On 02/10/2015 12:55 PM, Stefan Priebe wrote: Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Hi Stephen, Just off the top of my head, some questions to investigate: What happens to single op latencies? How to test this? try your random 4k write test using libaio, direct IO, and iodepth=1. Actually it would be interesting to know how it is with higher IO depths as well (I assume this is what you are doing now?) Basically I want to know if single-op latency changes and whether or not it gets hidden or exaggerated with lots of concurrent IO. dumpling: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/85224K /s] [0 /21.4K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/79064K /s] [0 /19.8K iops] [eta 00m:00s] firefly: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/55781K /s] [0 /15.4K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/46055K /s] [0 /11.6K iops] [eta 00m:00s] Sorry, please do this with only 1 thread. If you can include the latency results too that would be great. Sorry here again. Cache on: dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/42892K /s] [0 /10.8K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3203 write: io=1273.1MB, bw=43483KB/s, iops=10870 , runt= 30001msec slat (usec): min=5 , max=183 , avg= 8.99, stdev= 1.78 clat (usec): min=0 , max=6378 , avg=81.15, stdev=44.09 lat (usec): min=59 , max=6390 , avg=90.35, stdev=44.22 clat percentiles (usec): | 1.00th=[ 59], 5.00th=[ 62], 10.00th=[ 64], 20.00th=[ 66], | 30.00th=[ 69], 40.00th=[ 71], 50.00th=[ 74], 60.00th=[ 80], | 70.00th=[ 87], 80.00th=[ 95], 90.00th=[ 105], 95.00th=[ 114], | 99.00th=[ 135], 99.50th=[ 145], 99.90th=[ 179], 99.95th=[ 237], | 99.99th=[ 2320] bw (KB/s) : min=36176, max=46816, per=99.96%, avg=43465.49, stdev=2169.33 lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=85.24% lat (usec) : 250=14.71%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=2.95%, sys=12.29%, ctx=329519, majf=0, minf=133 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=326130/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=1273.1MB, aggrb=43482KB/s, minb=43482KB/s, maxb=43482KB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): sdb: ios=166/325241, merge=0/0, ticks=8/24624, in_queue=24492, util=81.64% firefly: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/44588K /s] [0 /11.2K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=2904 write: io=1212.1MB, bw=41401KB/s, iops=10350 , runt= 30001msec slat (usec): min=5 , max=464 , avg= 8.95, stdev= 2.34 clat (usec): min=0 , max=4410 , avg=85.81, stdev=41.82 lat (usec): min=59 , max=4418 , avg=94.96, stdev=41.97 clat percentiles (usec): | 1.00th=[ 59], 5.00th=[ 63], 10.00th=[ 65], 20.00th=[ 68], | 30.00th=[ 72], 40.00th=[ 76], 50.00th=[ 80], 60.00th=[ 85], | 70.00th=[ 94], 80.00th=[ 102], 90.00th=[ 112], 95.00th=[ 122], | 99.00th=[ 145], 99.50th=[ 155], 99.90th=[ 189], 99.95th=[ 239], | 99.99th=[ 2192
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 10.02.2015 um 20:40 schrieb Mark Nelson: On 02/10/2015 01:13 PM, Stefan Priebe wrote: Am 10.02.2015 um 20:10 schrieb Mark Nelson: On 02/10/2015 12:55 PM, Stefan Priebe wrote: Hello, last year in june i already reported this but there was no real result. (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041070.html) I then had the hope that this will be fixed itself when hammer is released. Now i tried hammer an the results are bad as before. Since firefly librbd1 / librados2 are 20% slower for 4k random iop/s than dumpling - this is also the reason why i still stick to dumpling. I've now modified my test again to be a bit more clear. Ceph cluster itself completely dumpling. librbd1 / librados from dumpling (fio inside qemu): 23k iop/s for random 4k writes - stopped qemu - cp -ra firefly_0.80.8/usr/lib/librados.so.2.0.0 /usr/lib/ - cp -ra firefly_0.80.8/usr/lib/librbd.so.1.0.0 /usr/lib/ - start qemu same fio, same qemu, same vm, same host, same ceph dumpling storage, different librados / librbd: 16k iop/s for random 4k writes What's wrong with librbd / librados2 since firefly? Hi Stephen, Just off the top of my head, some questions to investigate: What happens to single op latencies? How to test this? try your random 4k write test using libaio, direct IO, and iodepth=1. Actually it would be interesting to know how it is with higher IO depths as well (I assume this is what you are doing now?) Basically I want to know if single-op latency changes and whether or not it gets hidden or exaggerated with lots of concurrent IO. dumpling: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/85224K /s] [0 /21.4K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/79064K /s] [0 /19.8K iops] [eta 00m:00s] firefly: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/55781K /s] [0 /15.4K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/46055K /s] [0 /11.6K iops] [eta 00m:00s] Does enabling/disabling RBD cache have any effect? I've it enabled on both through qemu write back setting. It'd be great if you could do the above test both with WB RBD cache and with it turned off. Test with cache off: dumpling: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/85111K /s] [0 /21.3K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/88984K /s] [0 /22.3K iops] [eta 00m:00s] firefly: ioengine=libaio and iodepth=32 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/46479K /s] [0 /11.7K iops] [eta 00m:00s] ioengine=libaio and iodepth=1 with 32 threads: Jobs: 32 (f=32): [] [100.0% done] [0K/46019K /s] [0 /11.6K iops] [eta 00m:00s] How's CPU usage? (Does perf report show anything useful?) Can you get trace data? I'm not familiar with trace or perf - what should do exactly? you may need extra packages. Basically on VM host, during the test with each library you'd do: sudo perf record -a -g dwarf -F 99 (ctrl+c after a while) sudo perf report --stdio foo.txt if you are on a kernel that doesn't have libunwind support: sudo perf record -a -g (ctrl+c after a while) sudo perf report --stdio foo.txt Then look and see what's different. This may not catch anything though. Don't have unwind. Output is only full of hex values. Stefan You should also try Greg's suggestion looking at the performance counters to see if any interesting differences show up between the runs. Where / how to check? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 11.02.2015 um 05:45 schrieb Mark Nelson: On 02/10/2015 04:18 PM, Stefan Priebe wrote: Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPSLatency wb onwb offwb onwb off dumpling10870536~100us~2ms firefly10350525~100us~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064, util=98.73% firefly: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/69096K /s] [0 /17.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=2982 write: io=1784.9MB, bw=60918KB/s, iops=15229 , runt= 30002msec slat (usec): min=1 , max=1389 , avg= 3.43, stdev= 5.32 clat (usec): min=117 , max=8235 , avg=2096.88, stdev=396.30 lat (usec): min=540 , max=8258 , avg=2100.43, stdev=396.61 clat percentiles (usec): | 1.00th=[ 1608], 5.00th=[ 1720], 10.00th=[ 1768], 20.00th=[ 1832], | 30.00th=[ 1896], 40.00th=[ 1944], 50.00th=[ 2008], 60.00th=[ 2064], | 70.00th=[ 2160], 80.00th=[ 2256], 90.00th=[ 2512], 95.00th=[ 2896], | 99.00th=[ 3600], 99.50th=[ 3792], 99.90th=[ 5088], 99.95th=[ 6304], | 99.99th=[ 6752] bw (KB/s) : min=36717, max=73712, per=99.94%, avg=60879.92, stdev=8302.27 lat (usec) : 250=0.01%, 750=0.01% lat (msec) : 2=48.56%, 4=51.18%, 10=0.26% cpu : usr=2.03%, sys=5.48%, ctx=20440, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=456918/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=1784.9MB, aggrb=60918KB/s, minb=60918KB/s, maxb=60918KB/s, mint=30002msec, maxt=30002msec Disk stats (read/write): sdb: ios=166/455574, merge=0/0, ticks=12/897748, in_queue=897696, util=98.96% Ok, so it looks like as you increase concurrency the effect increases (ie contention?). Does the same thing happen without cache enabled? here again without rbd cache: dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/83488K /s] [0 /20.9K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3000 write: io=2449.2MB, bw=83583KB/s, iops=20895 , runt= 30005msec slat (usec): min=1 , max=975 , avg= 4.50, stdev= 5.25 clat (usec): min=364 , max=80566 , avg=1525.87, stdev=1194.57 lat (usec): min=519 , max=80568 , avg=1530.51, stdev=1194.44 clat percentiles (usec): | 1.00th=[ 660], 5.00th=[ 780], 10.00th=[ 876], 20.00th=[ 1032], | 30.00th=[ 1144], 40.00th=[ 1240], 50.00th=[ 1304], 60.00th=[ 1384], | 70.00th=[ 1480], 80.00th=[ 1640], 90.00th=[ 2096], 95.00th=[ 2960], | 99.00th=[ 6816], 99.50th=[ 7840
Re: speed decrease since firefly,giant,hammer the 2nd try
Am 10.02.2015 um 22:38 schrieb Mark Nelson: On 02/10/2015 03:11 PM, Stefan Priebe wrote: mhm i installed librbd1-dbg and librados2-dbg - but the output still looks useless to me. Should i upload it somewhere? Meh, if it's all just symbols it's probably not that helpful. I've summarized your results here: 1 concurrent 4k write (libaio, direct=1, iodepth=1) IOPSLatency wb onwb offwb onwb off dumpling10870536~100us~2ms firefly10350525~100us~2ms So in single op tests dumpling and firefly are far closer. Now let's see each of these cases with iodepth=32 (still 1 thread for now). dumpling: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/72812K /s] [0 /18.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=3011 write: io=2060.6MB, bw=70329KB/s, iops=17582 , runt= 30001msec slat (usec): min=1 , max=3517 , avg= 3.42, stdev= 7.30 clat (usec): min=93 , max=7475 , avg=1815.72, stdev=233.43 lat (usec): min=219 , max=7477 , avg=1819.27, stdev=233.52 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1576], 10.00th=[ 1608], 20.00th=[ 1672], | 30.00th=[ 1704], 40.00th=[ 1752], 50.00th=[ 1800], 60.00th=[ 1832], | 70.00th=[ 1896], 80.00th=[ 1960], 90.00th=[ 2064], 95.00th=[ 2128], | 99.00th=[ 2352], 99.50th=[ 2448], 99.90th=[ 4704], 99.95th=[ 5344], | 99.99th=[ 7072] bw (KB/s) : min=59696, max=77840, per=100.00%, avg=70351.27, stdev=4783.25 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.53% lat (msec) : 2=85.02%, 4=14.31%, 10=0.13% cpu : usr=1.96%, sys=6.71%, ctx=22791, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=527487/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2060.6MB, aggrb=70329KB/s, minb=70329KB/s, maxb=70329KB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): sdb: ios=166/526079, merge=0/0, ticks=24/890120, in_queue=890064, util=98.73% firefly: file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 2.0.8 Starting 1 thread Jobs: 1 (f=1): [w] [100.0% done] [0K/69096K /s] [0 /17.3K iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=2982 write: io=1784.9MB, bw=60918KB/s, iops=15229 , runt= 30002msec slat (usec): min=1 , max=1389 , avg= 3.43, stdev= 5.32 clat (usec): min=117 , max=8235 , avg=2096.88, stdev=396.30 lat (usec): min=540 , max=8258 , avg=2100.43, stdev=396.61 clat percentiles (usec): | 1.00th=[ 1608], 5.00th=[ 1720], 10.00th=[ 1768], 20.00th=[ 1832], | 30.00th=[ 1896], 40.00th=[ 1944], 50.00th=[ 2008], 60.00th=[ 2064], | 70.00th=[ 2160], 80.00th=[ 2256], 90.00th=[ 2512], 95.00th=[ 2896], | 99.00th=[ 3600], 99.50th=[ 3792], 99.90th=[ 5088], 99.95th=[ 6304], | 99.99th=[ 6752] bw (KB/s) : min=36717, max=73712, per=99.94%, avg=60879.92, stdev=8302.27 lat (usec) : 250=0.01%, 750=0.01% lat (msec) : 2=48.56%, 4=51.18%, 10=0.26% cpu : usr=2.03%, sys=5.48%, ctx=20440, majf=0, minf=133 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, =64=0.0% issued: total=r=0/w=456918/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=1784.9MB, aggrb=60918KB/s, minb=60918KB/s, maxb=60918KB/s, mint=30002msec, maxt=30002msec Disk stats (read/write): sdb: ios=166/455574, merge=0/0, ticks=12/897748, in_queue=897696, util=98.96% Stefan Mark Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new dev cluster - using giant or hammer git?
Am 06.02.2015 um 15:06 schrieb Sage Weil s...@newdream.net: On Fri, 6 Feb 2015, Stefan Priebe - Profihost AG wrote: Hi, for deploying a new ceph dev cluster can anybody recommand which git branch to use? hammer or giant-backport? Hi Stefan! If it's dev I'd recommend hammer. If all of your clients will be new I'd also recommand 'ceph osd crush tunables hammer' as there is a new and improve crush bucket type. Hi sage, thanks that's good for a test cluster too? What would you recommand for something which can Crash but i don't want to loose data? Stefan sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
new dev cluster - using giant or hammer git?
Hi, for deploying a new ceph dev cluster can anybody recommand which git branch to use? hammer or giant-backport? -- Mit freundlichen Grüßen Stefan Priebe Bachelor of Science in Computer Science (BSCS) Vorstand (CTO) --- Profihost AG Expo Plaza 1 30539 Hannover Deutschland Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282 URL: http://www.profihost.com | E-Mail: i...@profihost.com Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827 Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350 Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 10 times higher disk load with btrfs
Hi, Am 06.01.2015 um 04:44 schrieb Alexandre DERUMIER: Hi Stefan, Do you see a difference if you force filestore journal writeahead for btrfs instead parrallel ? filestore journal writeahead = 1 filestore journal parallel = 0 i already tested filestore btrfs snap = false which automatically disabled parallel write. Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Mark Nelson mnel...@redhat.com, Sage Weil s...@newdream.net Cc: ceph-devel ceph-devel@vger.kernel.org Envoyé: Lundi 5 Janvier 2015 21:33:22 Objet: Re: 10 times higher disk load with btrfs Am 05.01.2015 um 21:29 schrieb Mark Nelson: On 01/05/2015 02:20 PM, Stefan Priebe wrote: Hi Sage, Am 05.01.2015 um 20:25 schrieb Sage Weil: On Mon, 5 Jan 2015, Stefan Priebe wrote: Am 05.01.2015 um 19:36 schrieb Stefan Priebe: Hi devs, while btrfs is now declared as stable ;-) i wanted to retest btrfs on our production cluster on 2 out of 54 osds. So if they crash it doesn't hurt. While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s. Why does just another filesystem raises the disk load by a factor of 10? OK this seems to happen cause ceph is creating every 5s a new subvolume / snap. Is this really expected / needed? You can disable it with filestore btrfs snap = false I'm curious how much this drops the load down; originally the snaps were no more expensive than a regular sync but perhaps this has changed... - with XFS the average write is at 9Mb/s - with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s - with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s Is that the average and not the spikes? It looks like before the spikes were 20MB/s and 190MB/s? Yes these are average values. Spikes: - with XFS the spike write is at 20Mb/s - with btrfs (filestore_btrfs_snap=true) spike write is 200Mb/s - with btrfs (filestore_btrfs_snap=false) spike is still 185Mb/s but avg is 1/2 (20Mb/s) see above Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 10 times higher disk load with btrfs
Hi Sage, Am 05.01.2015 um 20:25 schrieb Sage Weil: On Mon, 5 Jan 2015, Stefan Priebe wrote: Am 05.01.2015 um 19:36 schrieb Stefan Priebe: Hi devs, while btrfs is now declared as stable ;-) i wanted to retest btrfs on our production cluster on 2 out of 54 osds. So if they crash it doesn't hurt. While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s. Why does just another filesystem raises the disk load by a factor of 10? OK this seems to happen cause ceph is creating every 5s a new subvolume / snap. Is this really expected / needed? You can disable it with filestore btrfs snap = false I'm curious how much this drops the load down; originally the snaps were no more expensive than a regular sync but perhaps this has changed... - with XFS the average write is at 9Mb/s - with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s - with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 10 times higher disk load with btrfs
Am 05.01.2015 um 21:29 schrieb Mark Nelson: On 01/05/2015 02:20 PM, Stefan Priebe wrote: Hi Sage, Am 05.01.2015 um 20:25 schrieb Sage Weil: On Mon, 5 Jan 2015, Stefan Priebe wrote: Am 05.01.2015 um 19:36 schrieb Stefan Priebe: Hi devs, while btrfs is now declared as stable ;-) i wanted to retest btrfs on our production cluster on 2 out of 54 osds. So if they crash it doesn't hurt. While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s. Why does just another filesystem raises the disk load by a factor of 10? OK this seems to happen cause ceph is creating every 5s a new subvolume / snap. Is this really expected / needed? You can disable it with filestore btrfs snap = false I'm curious how much this drops the load down; originally the snaps were no more expensive than a regular sync but perhaps this has changed... - with XFS the average write is at 9Mb/s - with btrfs (filestore_btrfs_snap=true) write is at 40Mb/s - with btrfs (filestore_btrfs_snap=false) write is at 20Mb/s Is that the average and not the spikes? It looks like before the spikes were 20MB/s and 190MB/s? Yes these are average values. Spikes: - with XFS the spike write is at 20Mb/s - with btrfs (filestore_btrfs_snap=true) spike write is 200Mb/s - with btrfs (filestore_btrfs_snap=false) spike is still 185Mb/s but avg is 1/2 (20Mb/s) see above Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
10 times higher disk load with btrfs
Hi devs, while btrfs is now declared as stable ;-) i wanted to retest btrfs on our production cluster on 2 out of 54 osds. So if they crash it doesn't hurt. While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s. Why does just another filesystem raises the disk load by a factor of 10? I'm running dumpling. Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 10 times higher disk load with btrfs
Am 05.01.2015 um 19:36 schrieb Stefan Priebe: Hi devs, while btrfs is now declared as stable ;-) i wanted to retest btrfs on our production cluster on 2 out of 54 osds. So if they crash it doesn't hurt. While if those OSDs run XFS have spikes of 20MB/s every 4-7s. The same OSDs after formatting them with btrfs have spikes of 190MB/s every 4-7s. Why does just another filesystem raises the disk load by a factor of 10? OK this seems to happen cause ceph is creating every 5s a new subvolume / snap. Is this really expected / needed? Stefan I'm running dumpling. Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
Am 02.01.2015 um 17:49 schrieb Samuel Just: Odd, sounds like it might be rbd client side? -Sam That one was already on list: https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg19091.html Sadly there was no result as it was unseen for 2 weeks and i didn't had the test equipment anymore. Greets, Stefan On Thu, Jan 1, 2015 at 1:30 AM, Stefan Priebe s.pri...@profihost.ag wrote: hi, Am 31.12.2014 um 17:21 schrieb Wido den Hollander: Hi, Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly 0.80.7 and after the upgrade there was a severe performance drop on the cluster. It started raining slow requests after the upgrade and most of them included a 'snapc' in the request. That lead me to investigate the RBD snapshots and I found that a rogue process had created ~1800 snapshots spread out over 200 volumes. One image even had 181 snapshots! As the snapshots weren't used I removed them all and after the snapshots were removed the performance of the cluster came back to normal level again. I'm wondering what changed between Dumpling and Firefly which caused this? I saw OSDs spiking to 100% disk util constantly under Firefly where this didn't happen with Dumpling. Did something change in the way OSDs handle RBD snapshots which causes them to create more disk I/O? I saw the same and addionally a slowdown in librbd too, that's why i'm still on dumpling and won't upgrade until hammer. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
hi, Am 31.12.2014 um 17:21 schrieb Wido den Hollander: Hi, Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly 0.80.7 and after the upgrade there was a severe performance drop on the cluster. It started raining slow requests after the upgrade and most of them included a 'snapc' in the request. That lead me to investigate the RBD snapshots and I found that a rogue process had created ~1800 snapshots spread out over 200 volumes. One image even had 181 snapshots! As the snapshots weren't used I removed them all and after the snapshots were removed the performance of the cluster came back to normal level again. I'm wondering what changed between Dumpling and Firefly which caused this? I saw OSDs spiking to 100% disk util constantly under Firefly where this didn't happen with Dumpling. Did something change in the way OSDs handle RBD snapshots which causes them to create more disk I/O? I saw the same and addionally a slowdown in librbd too, that's why i'm still on dumpling and won't upgrade until hammer. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: inode64 mount option for XFS
Am 03.11.2014 um 13:28 schrieb Wido den Hollander: Hi, While look at init-ceph and ceph-disk I noticed a discrepancy between them. init-ceph mounts XFS filesystems with rw,noatime,inode64, but ceph-disk(-active) with rw,noatime As inode64 gives the best performance, shouldn't ceph-disk do the same? Any implications if we add inode64 on running deployments? Isn't inode64 XFS default anyway? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 10/7/2014 Weekly Ceph Performance Meeting: kernel boot params
Hi, as mentioned during today's meeting, here are the kernel boot parameters which I found to provide the basis for good performance results: processor.max_cstate=0 intel_idle.max_cstate=0 I understand these to basically turn off any power saving modes of the CPU; the CPU's we are using are like Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz At the BIOS level, we - turn off Hyperthraeding - turn off Turbo mode (in order ot not leave the specifications) - turn on frequency floor override We also assert that /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor is set to performance Using above we see a constant frequency at the maximum level allowed by the CPU (except Turbo mode). How much performance do we gain by this? Till now i thought it's just 1-3% so i'm still running ondemand govenor plus power savings. Greets, Stefan Best Regards Andreas Bluemle On Wed, 8 Oct 2014 02:51:21 +0200 Mark Nelson mark.nel...@inktank.com wrote: Hi All, Just a remind that the weekly performance meeting is on Wednesdays at 8AM PST. Same bat time, same bat channel! Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: severe librbd performance degradation in Giant
Am 19.09.2014 03:08, schrieb Shu, Xinxin: I also observed performance degradation on my full SSD setup , I can got ~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only got ~12K IOPS This are impressive numbers. Can you tell me how many OSDs you have and which SSDs you use? Thanks, Stefan Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Friday, September 19, 2014 2:03 AM To: Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Alexandre, What tool are you using ? I used fio rbd. Also, I hope you have Giant package installed in the client side as well and rbd_cache =true is set on the client conf file. FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had to make sure fio rbd is really loading giant librbd (if you have multiple copies around , which was in my case) for reproducing it. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, September 18, 2014 2:49 AM To: Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly based librbd/librados. For one client it is giving ~11-12K iops (4K RR). 2. Single OSD is running Giant and client is running fio rbd on top of Giant based librbd/librados. For one client it is giving ~1.9K iops (4K RR). 3. Single OSD is running latest Giant and client is running Giant based ceph_smaiobench on top of giant librados. For one client it is giving ~11-12K iops (4K RR). 4. Giant RGW on top of Giant OSD is also scaling. So, it is obvious from the above that recent
Re: severe librbd performance degradation in Giant
Am 19.09.2014 um 15:02 schrieb Shu, Xinxin: 12 x Intel DC 3700 200GB, every SSD has two OSDs. Crazy, I've 56 SSDs and canÄt go above 20 000 iops. Grüße Stefan Cheers, xinxin -Original Message- From: Stefan Priebe [mailto:s.pri...@profihost.ag] Sent: Friday, September 19, 2014 2:54 PM To: Shu, Xinxin; Somnath Roy; Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant Am 19.09.2014 03:08, schrieb Shu, Xinxin: I also observed performance degradation on my full SSD setup , I can got ~270K IOPS for 4KB random read with 0.80.4 , but with latest master , I only got ~12K IOPS This are impressive numbers. Can you tell me how many OSDs you have and which SSDs you use? Thanks, Stefan Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Friday, September 19, 2014 2:03 AM To: Alexandre DERUMIER; Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Alexandre, What tool are you using ? I used fio rbd. Also, I hope you have Giant package installed in the client side as well and rbd_cache =true is set on the client conf file. FYI, firefly librbd + librados and Giant cluster will work seamlessly and I had to make sure fio rbd is really loading giant librbd (if you have multiple copies around , which was in my case) for reproducing it. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Thursday, September 18, 2014 2:49 AM To: Haomai Wang Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org; Somnath Roy Subject: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? Hi, on my side, I don't see any degradation performance on read (seq or rand) with or without. firefly : around 12000iops (with or without rbd_cache) giant : around 12000iops (with or without rbd_cache) (and I can reach around 2-3 iops on giant with disabling optracker). rbd_cache only improve write performance for me (4k block ) - Mail original - De: Haomai Wang haomaiw...@gmail.com À: Somnath Roy somnath@sandisk.com Cc: Sage Weil sw...@redhat.com, Josh Durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 18 Septembre 2014 04:27:56 Objet: Re: severe librbd performance degradation in Giant According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the old performance. Now, it is similar to firefly throughput ! So, loks like rbd_cache=true was the culprit. Thanks Josh ! Regards Somnath -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Wednesday, September 17, 2014 2:20 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate
Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?
Am 02.07.2014 00:51, schrieb Gregory Farnum: On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi Greg, Am 26.06.2014 02:17, schrieb Gregory Farnum: Sorry we let this drop; we've all been busy traveling and things. There have been a lot of changes to librados between Dumpling and Firefly, but we have no idea what would have made it slower. Can you provide more details about how you were running these tests? it's just a normal fio run: fio --ioengine=rbd --bs=4k --name=foo --invalidate=0 --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor --runtime=90 --numjobs=32 --direct=1 --group Running one time with firefly libs and one time with dumpling libs. Traget is always the same pool on a firefly ceph storage. What's the backing cluster you're running against? What kind of CPU usage do you see with both? 25k IOPS is definitely getting up there, but I'd like some guidance about whether we're looking for a reduction in parallelism, or an increase in per-op costs, or something else. Hi Greg, i don't have that test cluster anymore. It had to go into production with dumpling. So i can't tell you. Sorry. Stefan -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?
Am 02.07.2014 15:07, schrieb Haomai Wang: Could you give some perf counter from rbd client side? Such as op latency? Sorry haven't any counters. As this mail was some days unseen - i thought nobody has an idea or could help. Stefan On Wed, Jul 2, 2014 at 9:01 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 02.07.2014 00:51, schrieb Gregory Farnum: On Thu, Jun 26, 2014 at 11:49 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi Greg, Am 26.06.2014 02:17, schrieb Gregory Farnum: Sorry we let this drop; we've all been busy traveling and things. There have been a lot of changes to librados between Dumpling and Firefly, but we have no idea what would have made it slower. Can you provide more details about how you were running these tests? it's just a normal fio run: fio --ioengine=rbd --bs=4k --name=foo --invalidate=0 --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor --runtime=90 --numjobs=32 --direct=1 --group Running one time with firefly libs and one time with dumpling libs. Traget is always the same pool on a firefly ceph storage. What's the backing cluster you're running against? What kind of CPU usage do you see with both? 25k IOPS is definitely getting up there, but I'd like some guidance about whether we're looking for a reduction in parallelism, or an increase in per-op costs, or something else. Hi Greg, i don't have that test cluster anymore. It had to go into production with dumpling. So i can't tell you. Sorry. Stefan -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?
Hi Greg, Am 02.07.2014 21:36, schrieb Gregory Farnum: On Wed, Jul 2, 2014 at 12:00 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 02.07.2014 16:00, schrieb Gregory Farnum: Yeah, it's fighting for attention with a lot of other urgent stuff. :( Anyway, even if you can't look up any details or reproduce at this time, I'm sure you know what shape the cluster was (number of OSDs, running on SSDs or hard drives, etc), and that would be useful guidance. :) Sure Number of OSDs: 24 Each OSD has an SSD capable of doing tested with fio before installing ceph (70.000 iop/s 4k write, 580MB/s seq. write 1MB blocks) Single Xeon E5-1620 v2 @ 3.70GHz 48GB RAM Awesome, thanks. I went through the changelogs on the librados/, osdc/, and msg/ directories to see if I could find any likely change candidates between Dumpling and Firefly and couldn't see any issues. :( But I suspect that the sharding changes coming will more than make up the difference, so you might want to plan on checking that out when it arrives, even if you don't want to deploy it to production.n To which changes do you refer? Will they be part or backported of/to firefly? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Why is librbd1 / librados2 from Firefly 20% slower than the one from dumpling?
Hi Greg, Am 26.06.2014 02:17, schrieb Gregory Farnum: Sorry we let this drop; we've all been busy traveling and things. There have been a lot of changes to librados between Dumpling and Firefly, but we have no idea what would have made it slower. Can you provide more details about how you were running these tests? it's just a normal fio run: fio --ioengine=rbd --bs=4k --name=foo --invalidate=0 --readwrite=randwrite --iodepth=32 --rbdname=fio_test2 --pool=teststor --runtime=90 --numjobs=32 --direct=1 --group Running one time with firefly libs and one time with dumpling libs. Traget is always the same pool on a firefly ceph storage. Stefan -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Jun 13, 2014 at 7:59 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, while testint firefly i cam into the sitation where i had a client where the latest dumpling packages where installed (0.67.9). As my pool has hashppool false and the tunables are set to default it can talk to my firefly ceph sotrage. For random 4k writes using fio with librbd and 32 jobs and an iodepth of 32. I get these results: librbd / librados2 from dumpling: write: io=3020.9MB, bw=103083KB/s, iops=25770, runt= 30008msec WRITE: io=3020.9MB, aggrb=103082KB/s, minb=103082KB/s, maxb=103082KB/s, mint=30008msec, maxt=30008msec librbd / librados2 from firefly: write: io=7344.3MB, bw=83537KB/s, iops=20884, runt= 90026msec WRITE: io=7344.3MB, aggrb=83537KB/s, minb=83537KB/s, maxb=83537KB/s, mint=90026msec, maxt=90026msec Stefan ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Share]Performance tunning on Ceph FileStore with SSD backend
Am 27.05.2014 06:42, schrieb Haomai Wang: On Tue, May 27, 2014 at 4:29 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Haomai, regarding the FDCache problems you're seeing. Isn't this branch interesting for you? Have you ever tested it? http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html Yes, I noticed it. But my main job is improving performance on 0.67.5 version. Before this branch, my improvement on this problem is avoid lfn_find in omap* methods with FileStore class.(https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg18505.html) Avoids mean just remove them? Are they not needed? Do you have a branch for this? Greets, Stefan Am 09.04.2014 12:05, schrieb Haomai Wang: Hi all, I would like to share some ideas about how to improve performance on ceph with SSD. Not much preciseness. Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD). ceph version is 0.67.5(Dumping) At first, we find three bottleneck on filestore: 1. fdcache_lock(changed in Firely release) 2. lfn_find in omap_* methods 3. DBObjectMap header According to my understanding and the docs in ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h), I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully sure the correctness of this change, but it works well still now. DBObjectMap header patch is on the pull request queue and may be merged in the next feature merge window. With things above done, we get much performance improvement in disk util and benchmark results(3x-4x). Next, we find fdcache size become the main bottleneck. For example, if hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot data range is 1TB, we need 25(1000GB/4MB) fd to cache. With increase filestore_fd_cache_size, the cost of lookup(FDCache) and cache miss is expensive and can't be afford. The implementation of FDCache isn't O(1). So we only can get high performance on fdcache hit range(maybe 100GB with 10240 fdcache size) and more data exceed the size of fdcaceh will be disaster. If you want to cache more fd(102400 fdcache size), the implementation of FDCache will bring on extra CPU cost(can't be ignore) for each op. Because of the capacity of SSD(several hundreds GB), we try to increase the size of rbd object(16MB) so less fd cache is needed. As for FDCache implementation, we simply discard SimpleLRU but introduce RandomCache. Now we can set much larger fdcache size(near cache all fd) with little overload. With these, we achieve 3x-4x performance improvements on filestore with SSD. Maybe it exists something I missed or something wrong, hope can correct me. I hope it can help to improve FileStore on SSD and push into master branch. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Share]Performance tunning on Ceph FileStore with SSD backend
Am 27.05.2014 08:37, schrieb Haomai Wang: I'm not full sure the correctness of changes although it seemed ok to me. And I apply these changes to product env and no problems. Do you have a branch in your yuyuyu github account for this? On Tue, May 27, 2014 at 2:05 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 27.05.2014 06:42, schrieb Haomai Wang: On Tue, May 27, 2014 at 4:29 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Haomai, regarding the FDCache problems you're seeing. Isn't this branch interesting for you? Have you ever tested it? http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html Yes, I noticed it. But my main job is improving performance on 0.67.5 version. Before this branch, my improvement on this problem is avoid lfn_find in omap* methods with FileStore class.(https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg18505.html) Avoids mean just remove them? Are they not needed? Do you have a branch for this? Greets, Stefan Am 09.04.2014 12:05, schrieb Haomai Wang: Hi all, I would like to share some ideas about how to improve performance on ceph with SSD. Not much preciseness. Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD). ceph version is 0.67.5(Dumping) At first, we find three bottleneck on filestore: 1. fdcache_lock(changed in Firely release) 2. lfn_find in omap_* methods 3. DBObjectMap header According to my understanding and the docs in ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h), I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully sure the correctness of this change, but it works well still now. DBObjectMap header patch is on the pull request queue and may be merged in the next feature merge window. With things above done, we get much performance improvement in disk util and benchmark results(3x-4x). Next, we find fdcache size become the main bottleneck. For example, if hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot data range is 1TB, we need 25(1000GB/4MB) fd to cache. With increase filestore_fd_cache_size, the cost of lookup(FDCache) and cache miss is expensive and can't be afford. The implementation of FDCache isn't O(1). So we only can get high performance on fdcache hit range(maybe 100GB with 10240 fdcache size) and more data exceed the size of fdcaceh will be disaster. If you want to cache more fd(102400 fdcache size), the implementation of FDCache will bring on extra CPU cost(can't be ignore) for each op. Because of the capacity of SSD(several hundreds GB), we try to increase the size of rbd object(16MB) so less fd cache is needed. As for FDCache implementation, we simply discard SimpleLRU but introduce RandomCache. Now we can set much larger fdcache size(near cache all fd) with little overload. With these, we achieve 3x-4x performance improvements on filestore with SSD. Maybe it exists something I missed or something wrong, hope can correct me. I hope it can help to improve FileStore on SSD and push into master branch. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Share]Performance tunning on Ceph FileStore with SSD backend
Hi Haomai, regarding the FDCache problems you're seeing. Isn't this branch interesting for you? Have you ever tested it? http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-January/007399.html Greets, Stefan Am 09.04.2014 12:05, schrieb Haomai Wang: Hi all, I would like to share some ideas about how to improve performance on ceph with SSD. Not much preciseness. Our ssd is 500GB and each OSD own a SSD(journal is on the same SSD). ceph version is 0.67.5(Dumping) At first, we find three bottleneck on filestore: 1. fdcache_lock(changed in Firely release) 2. lfn_find in omap_* methods 3. DBObjectMap header According to my understanding and the docs in ObjectStore.h(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h), I simply remove lfn_find in omap_* and fdcache_lock. I'm not fully sure the correctness of this change, but it works well still now. DBObjectMap header patch is on the pull request queue and may be merged in the next feature merge window. With things above done, we get much performance improvement in disk util and benchmark results(3x-4x). Next, we find fdcache size become the main bottleneck. For example, if hot data range is 100GB, we need 25000(100GB/4MB) fd to cache. If hot data range is 1TB, we need 25(1000GB/4MB) fd to cache. With increase filestore_fd_cache_size, the cost of lookup(FDCache) and cache miss is expensive and can't be afford. The implementation of FDCache isn't O(1). So we only can get high performance on fdcache hit range(maybe 100GB with 10240 fdcache size) and more data exceed the size of fdcaceh will be disaster. If you want to cache more fd(102400 fdcache size), the implementation of FDCache will bring on extra CPU cost(can't be ignore) for each op. Because of the capacity of SSD(several hundreds GB), we try to increase the size of rbd object(16MB) so less fd cache is needed. As for FDCache implementation, we simply discard SimpleLRU but introduce RandomCache. Now we can set much larger fdcache size(near cache all fd) with little overload. With these, we achieve 3x-4x performance improvements on filestore with SSD. Maybe it exists something I missed or something wrong, hope can correct me. I hope it can help to improve FileStore on SSD and push into master branch. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Performance] Improvement on DB Performance
Am 21.05.2014 um 20:41 schrieb Sage Weil s...@inktank.com: On Wed, 21 May 2014, Stefan Priebe - Profihost AG wrote: Hi sage, what about cuttlefish customers? We stopped backporting fixes to cuttlefish a while ago. Please upgrade to dumpling! Did I miss an information from inktank to update to dumpling? I thought we should stay at cuttlefish and then upgrade to firefly. That said, this patch should apply cleanly to cuttlefish. sage Greets, Stefan Excuse my typo sent from my mobile phone. Am 21.05.2014 um 18:15 schrieb Sage Weil s...@inktank.com: On Wed, 21 May 2014, Mike Dawson wrote: Haomai, Thanks for finding this! Sage, We have a client that runs an io intensive, closed-source software package that seems to issue overzealous flushes which may benefit from this patch (or the other methods you mention). If you were to spin a wip build based on Dumpling, I'll be a willing tester. Pushed wip-librbd-flush-dumpling, should be built shortly. sage Thanks, Mike Dawson On 5/21/2014 11:23 AM, Sage Weil wrote: On Wed, 21 May 2014, Haomai Wang wrote: I pushed the commit to fix this problem(https://github.com/ceph/ceph/pull/1848). With test program(Each sync request is issued with ten write request), a significant improvement is noticed. aio_flush sum: 914750 avg: 1239 count: 738 max: 4714 min: 1011 flush_set sum: 904200 avg: 1225 count: 738 max: 4698 min: 999 flush sum: 641648 avg: 173count: 3690 max: 1340 min: 128 Compared to last mail, it reduce each aio_flush request to 1239 ns instead of 24145 ns. Good catch! That's a great improvement. The patch looks clearly correct. We can probably do even better by putting the Objects on a list when they get the first dirty buffer so that we only cycle through the dirty ones. Or, have a global list of dirty buffers (instead of dirty objects - dirty buffers). sage I hope it's the root cause for db on rbd performance. On Wed, May 21, 2014 at 6:15 PM, Haomai Wang haomaiw...@gmail.com wrote: Hi all, I remember there exists discuss about DB(mysql) performance on rbd. Recently I test mysql-bench with rbd and found awful performance. So I dive into it and find that main cause is flush request from guest. As we know, applications such as mysql, ceph has own journal for durable and journal usually send syncdirect io. If fs barrier is on, each sync io operation make kernel issue sync(barrier) request to block device. Here, qemu will call rbd_aio_flush to apply. Via systemtap, I found a amazing thing: aio_flush sum: 4177085avg: 24145 count: 173 max: 28172 min: 22747 flush_set sum: 4172116avg: 24116 count: 173 max: 28034 min: 22733 flush sum: 3029910avg: 4 count: 670477 max: 1893 min: 3 This statistic info is gathered in 5s. Most
Re: [Performance] Improvement on DB Performance
*arg* sorry missed emperor with dumpling.. sorry. Stefan Am 21.05.2014 20:51, schrieb Stefan Priebe - Profihost AG: Am 21.05.2014 um 20:41 schrieb Sage Weil s...@inktank.com: On Wed, 21 May 2014, Stefan Priebe - Profihost AG wrote: Hi sage, what about cuttlefish customers? We stopped backporting fixes to cuttlefish a while ago. Please upgrade to dumpling! Did I miss an information from inktank to update to dumpling? I thought we should stay at cuttlefish and then upgrade to firefly. That said, this patch should apply cleanly to cuttlefish. sage Greets, Stefan Excuse my typo sent from my mobile phone. Am 21.05.2014 um 18:15 schrieb Sage Weil s...@inktank.com: On Wed, 21 May 2014, Mike Dawson wrote: Haomai, Thanks for finding this! Sage, We have a client that runs an io intensive, closed-source software package that seems to issue overzealous flushes which may benefit from this patch (or the other methods you mention). If you were to spin a wip build based on Dumpling, I'll be a willing tester. Pushed wip-librbd-flush-dumpling, should be built shortly. sage Thanks, Mike Dawson On 5/21/2014 11:23 AM, Sage Weil wrote: On Wed, 21 May 2014, Haomai Wang wrote: I pushed the commit to fix this problem(https://github.com/ceph/ceph/pull/1848). With test program(Each sync request is issued with ten write request), a significant improvement is noticed. aio_flush sum: 914750 avg: 1239 count: 738 max: 4714 min: 1011 flush_set sum: 904200 avg: 1225 count: 738 max: 4698 min: 999 flush sum: 641648 avg: 173count: 3690 max: 1340 min: 128 Compared to last mail, it reduce each aio_flush request to 1239 ns instead of 24145 ns. Good catch! That's a great improvement. The patch looks clearly correct. We can probably do even better by putting the Objects on a list when they get the first dirty buffer so that we only cycle through the dirty ones. Or, have a global list of dirty buffers (instead of dirty objects - dirty buffers). sage I hope it's the root cause for db on rbd performance. On Wed, May 21, 2014 at 6:15 PM, Haomai Wang haomaiw...@gmail.com wrote: Hi all, I remember there exists discuss about DB(mysql) performance on rbd. Recently I test mysql-bench with rbd and found awful performance. So I dive into it and find that main cause is flush request from guest. As we know, applications such as mysql, ceph has own journal for durable and journal usually send syncdirect io. If fs barrier is on, each sync io operation make kernel issue sync(barrier) request to block device. Here, qemu will call rbd_aio_flush to apply. Via systemtap, I found a amazing thing: aio_flush sum: 4177085avg: 24145 count: 173 max: 28172 min: 22747 flush_set sum: 4172116avg: 24116 count: 173 max: 28034 min: 22733 flush sum: 3029910avg: 4 count: 670477 max: 1893 min: 3 This statistic info is gathered
Re: default filestore max sync interval
H Greg, Am 29.04.2014 22:23, schrieb Gregory Farnum: On Tue, Apr 29, 2014 at 1:10 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi all, Why is the default max sync interval only 5 seconds? Today we realized what a huge difference that increasing this to 30 or 60s can do for the small write latency. Basically, with a 5s interval our 4k write latency is above 30-35ms and once we increase it to 30s we can get under 10ms (using spinning disks for journal and data.) See the attached plot for the affect of this on a running cluster (the plot shows the max, avg, min write latency from a short rados bench every 10 mins). The change from 5s to 60s was applied at noon today. (And our journals are large enough, don't worry). In the interest of having sensible defaults, is there any reason not to increase this to 30s? If you've got reasonable confidence in the quality of your measurements across the workloads you serve, you should bump it up. Part of what might be happening here is simply that fewer of your small-io writes are running into a sync interval. I suspect that most users will see improvement by bumping up the limits and occasionally agitate to change the defaults, but Sam has always pushed back against doing so for reasons I don't entirely recall. :) (The potential for a burstier throughput profile?) -Greg What is about those? filestore queue max ops = 500 filestore_queue_committing_max_ops = 5000 filestore_queue_max_bytes = 419430400 filestore_queue_committing_max_bytes = 419430400 filestore_wbthrottle_xfs_bytes_start_flusher = 125829120 filestore_wbthrottle_xfs_bytes_hard_limit = 419430400 filestore_wbthrottle_xfs_ios_start_flusher = 5000 filestore_wbthrottle_xfs_ios_hard_limit = 5 filestore_wbthrottle_xfs_inodes_start_flusher = 1000 filestore_wbthrottle_xfs_inodes_hard_limit = 1 They should be adjusted too? right? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: default filestore max sync interval
Hi Dan, Am 29.04.2014 22:10, schrieb Dan Van Der Ster: Hi all, Why is the default max sync interval only 5 seconds? Today we realized what a huge difference that increasing this to 30 or 60s can do for the small write latency. Basically, with a 5s interval our 4k write latency is above 30-35ms and once we increase it to 30s we can get under 10ms (using spinning disks for journal and data.) See the attached plot for the affect of this on a running cluster (the plot shows the max, avg, min write latency from a short rados bench every 10 mins). The change from 5s to 60s was applied at noon today. (And our journals are large enough, don't worry). In the interest of having sensible defaults, is there any reason not to increase this to 30s? I was playing with them too but didn't get any viewable results. How do you get / graph the ceph latency? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: firefly timing
Hi Sage, i really would like to test the tiering. Is there any detailed documentation about it and how it works? Greets, Stefan Am 18.03.2014 05:45, schrieb Sage Weil: Hi everyone, It's taken longer than expected, but the tests for v0.78 are calming down and it looks like we'll be able to get the release out this week. However, we've decided NOT to make this release firefly. It will be a normal development release. This will be the first release that includes some key new functionality (erasure coding and cache tiering) and although it is passing our tests we'd like to have some operational experience with it in more users' hands before we commit to supporting it long term. The tentative plan is to freeze and then release v0.79 after a normal two week cycle. This will serve as a 'release candidate' that shaves off a few rough edges from the pending release (including some improvements with the API for setting up erasure coded pools). It is possible that 0.79 will turn into firefly, but more likely that we will opt for another two weeks of hardening and make 0.80 the release we name firefly and maintain for the long term. Long story short: 0.78 will be out soon, and you should test it! It is will vary from the final firefly in a few subtle ways, but any feedback or usability and bug reports at this point will be very helpful in shaping things. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: firefly timing
Am 18.03.2014 um 17:06 schrieb Sage Weil s...@inktank.com: On Tue, 18 Mar 2014, Stefan Priebe - Profihost AG wrote: Hi Sage, i really would like to test the tiering. Is there any detailed documentation about it and how it works? Great! Here is a quick synopiss on how to set it up: http://ceph.com/docs/master/dev/cache-pool/ What I'm missing is a documentation about the cache settings? sage Greets, Stefan Am 18.03.2014 05:45, schrieb Sage Weil: Hi everyone, It's taken longer than expected, but the tests for v0.78 are calming down and it looks like we'll be able to get the release out this week. However, we've decided NOT to make this release firefly. It will be a normal development release. This will be the first release that includes some key new functionality (erasure coding and cache tiering) and although it is passing our tests we'd like to have some operational experience with it in more users' hands before we commit to supporting it long term. The tentative plan is to freeze and then release v0.79 after a normal two week cycle. This will serve as a 'release candidate' that shaves off a few rough edges from the pending release (including some improvements with the API for setting up erasure coded pools). It is possible that 0.79 will turn into firefly, but more likely that we will opt for another two weeks of hardening and make 0.80 the release we name firefly and maintain for the long term. Long story short: 0.78 will be out soon, and you should test it! It is will vary from the final firefly in a few subtle ways, but any feedback or usability and bug reports at this point will be very helpful in shaping things. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph cli delay when one mon is down
Am 15.01.2014 um 08:33 schrieb Dietmar Maurer diet...@proxmox.com: You can avoid this, and speed things up in general, by using the interactive mode: #!/bin/sh ceph EOM do something do something else EOM Above is a bit clumsy. Especially you don't know which command fails in which way and produced which exitcode or output. To be honest, I want to do things with perl, so I guess it is better to use perl bindings for librados. Are perl bindings already available? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for adding disable FileJournal option
I had the samt question in the past but there seems no way to change it for the ceph team Stefan This mail was sent with my iPhone. Am 09.01.2014 um 18:28 schrieb Gregory Farnum g...@inktank.com: The FileJournal is also for data safety whenever we're using write ahead. To disable it we need a backing store that we know can provide us consistent checkpoints (i.e., we can use parallel journaling mode — so for the FileJournal, we're using btrfs, or maybe zfs someday). But for those systems you can already configure the system not to use a journal. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Jan 9, 2014 at 12:13 AM, Haomai Wang haomaiw...@gmail.com wrote: Hi all, We know FileJournal plays a important role in FileStore backend, it can hugely reduce write latency and improve small write operations. But in practice, there exists exceptions such as we already use FlashCache or cachepool(although it's not ready). If cachepool enabled, we may use use journal in cache_pool but may not like to use journal in base_pool. The main reason why drop journal in base_pool is that journal take over a single physical device and waste too much in base_pool. Like above, if I enable FlashCache or other cache, I'd not like to enable journal in OSD layer. So is it necessary to disable journal in special(not really special) case? Best regards, Wheats -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rocksdb Seen today - replacement for leveldb?
Hi, while googles leveldb was too slow for facebook they created rocksdb (http://rocksdb.org/) may be interesting for Ceph? It's already production quality. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] rocksdb Seen today - replacement for leveldb?
the performance comparisions are very impressive: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks Stefan Am 27.11.2013 11:55, schrieb Stefan Priebe - Profihost AG: Hi, while googles leveldb was too slow for facebook they created rocksdb (http://rocksdb.org/) may be interesting for Ceph? It's already production quality. Greets, Stefan ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 520/530 SSD for ceph
Hi, Am 21.11.2013 01:29, schrieb m...@linuxbox.com: On Tue, Nov 19, 2013 at 09:02:41AM +0100, Stefan Priebe wrote: ... You might be able to vary this behavior by experimenting with sdparm, smartctl or other tools, or possibly with different microcode in the drive. Which values or which settings do you think of? ... Off-hand, I don't know. Probably the first thing would be to compare the configuration of your 520 530; anything that's different is certainly worth investigating. This should display all pages, sdparm --all --long /dev/sdX the 520 only appears to have 3 pages, which can be fetched directly w/ sdparm --page=ca --long /dev/sdX sdparm --page=co --long /dev/sdX sdparm --page=rw --long /dev/sdX The sample machine I'm looking has an intel 520, and on ours, most options show as 0 except for AWRE1 [cha: n, def: 1] Automatic write reallocation enabled WCE 1 [cha: y, def: 1] Write cache enable DRA 1 [cha: n, def: 1] Disable read ahead GLTSD 1 [cha: n, def: 1] Global logging target save disable BTP-1 [cha: n, def: -1] Busy timeout period (100us) ESTCT 30 [cha: n, def: 30] Extended self test completion time (sec) Perhaps that's an interesting data point to compare with yours. Figuring out if you have up-to-date intel firmware appears to require burning and running an iso image from https://downloadcenter.intel.com/Detail_Desc.aspx?agr=YDwnldID=18455 The results of sdparm --page=whatever --long /dev/sdc show the intel firmware, but this labels it better: smartctl -i /dev/sdc Our 520 has firmware 400i loaded. Firmware is up2date and all values are the same. I expect that the 520 firmware just ignores CMD_FLUSH commands and the 530 does not. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 520/530 SSD for ceph
Hi Marcus, Am 18.11.2013 23:51, schrieb m...@linuxbox.com: On Mon, Nov 18, 2013 at 02:38:42PM +0100, Stefan Priebe - Profihost AG wrote: You may actually be doing O_SYNC - recent kernels implement O_DSYNC, but glibc maps O_DSYNC into O_SYNC. But since you're writing to the block device this won't matter much. No difference regarding O_DSYNC or O_SYNC the values are the same. Also I'm using 3.10.19 as a kernel so it is recent enough. I believe the effect of O_DIRECT by itself is just to bypass the buffer cache, which is not going to make much difference for your dd case. (It will mainly affect other applications that are also using the buffer cache...) O_SYNC should be causing the writes to block until a response is received from the disk. Without O_SYNC, the writes will just queue operations and return - potentially very fast. Your dd is probably writing enough data that there is some throttling by the system as it runs out of disk buffers and has to wait for some previous data to be written to the drive, but the delay for any individual block is not likely to matter. With O_SYNC, you are measuring the delay for each block directly, and you have absolutely removed the ability for the disk to perform any sort of parallism. That's correct but ceph uses O_DSYNC for his journal and may be other stuff so it is important to have devices performing well with O_DSYNC. Sounds like the intel 530 is has a much larger block write latency, but can make up for it by performing more overlapped operations. You might be able to vary this behavior by experimenting with sdparm, smartctl or other tools, or possibly with different microcode in the drive. Which values or which settings do you think of? Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANN] ceph-deploy 1.3 released!
Hi, i didn't found something in the changelog so i just would like to ask if this is planned. Right now you can already create a new cluster using hostA:IPA HostB:IPB ... but it does not use these IPs as mon addr also the hostA hostB names need to match the hostname this is pretty bad as you cannot change IPs or hosts of mons later easily, so i tend to use special names and Ips which i can move later to different machines. The normal ceph config supports: [mon.a] host name = abc mon addr = 85.58.34.12 Thanks, Stefan Am 01.11.2013 13:54, schrieb Alfredo Deza: Hi all, A new version (1.3) of ceph-deploy is now out, a lot of fixes went into this release including the addition of a more robust library to connect to remote hosts and it removed the one extra dependency we used. Installation should be simpler. The complete changelog can be found at: https://github.com/ceph/ceph-deploy/blob/master/docs/source/changelog.rst The highlights for this release are: * We now allow to use `--username` to connect on remote hosts, specifying something different than the current user or the SSH config. * Global timeouts for remote commands to be able to disconnect if there is no input received (defaults to 5 minutes), but still allowing other more granular timeouts for some commands that need to just run a simple command without output expectation. Please make sure you update (install instructions: http://github.com/ceph/ceph-deploy/#installation) and use the latest version! Thanks, Alfredo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Am 22.08.2013 05:34, schrieb Samuel Just: It's not really possible at this time to control that limit because changing the primary is actually fairly expensive and doing it unnecessarily would probably make the situation much worse I'm sorry but remapping or backfilling is far less expensive on all of my machines than recovering. While backfilling i've around 8-10% I/O waits while under recovery i have 40%-50% (it's mostly necessary for backfilling, which is expensive anyway). It seems like forwarding IO on an object which needs to be recovered to a replica with the object would be the next step. Certainly something to consider for the future. Yes this would be the solution. Stefan -Sam On Wed, Aug 21, 2013 at 12:37 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Sam, Am 21.08.2013 21:13, schrieb Samuel Just: As long as the request is for an object which is up to date on the primary, the request will be served without waiting for recovery. Sure but remember if you have VM random 4K workload a lot of objects go out of date pretty soon. A request only waits on recovery if the particular object being read or written must be recovered. Yes but on 4k load this can be a lot. Your issue was that recovering the particular object being requested was unreasonably slow due to silliness in the recovery code which you disabled by disabling osd_recover_clone_overlap. Yes and no. It's better now but far away from being good or perfect. My VMs do not crash anymore but i still have a bunch of slow requests (just around 10 messages) and still a VERY high I/O load on the disks during recovery. In cases where the primary osd is significantly behind, we do make one of the other osds primary during recovery in order to expedite requests (pgs in this state are shown as remapped). oh never seen that but at least in my case even 60s are a very long timeframe and the OSD is very stressed during recovery. Is it possible for me to set this value? Stefan -Sam On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 21.08.2013 17:32, schrieb Samuel Just: Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. This might sound a bug harsh but maybe due to my limited english skills ;-) I still think that Cephs recovery system is broken by design. If an OSD comes back (was offline) all write requests regarding PGs where this one is primary are targeted immediatly to this OSD. If this one is not up2date for an PG it tries to recover that one immediatly which costs 4MB / block. If you have a lot of small write all over your OSDs and PGs you're sucked as your OSD has to recover ALL it's PGs immediatly or at least lots of them WHICH can't work. This is totally crazy. I think the right way would be: 1.) if an OSD goes down the replicas got primaries or 2.) an OSD which does not have an up2date PG should redirect to the OSD holding the secondary or third replica. Both results in being able to have a really smooth and slow recovery without any stress even under heavy 4K workloads like rbd backed VMs. Thanks for reading! Greets Stefan -Sam On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com wrote: Sam/Josh, We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning, hoping it would improve this situation, but there was no appreciable change. One node in our cluster fsck'ed after a reboot and got a bit behind. Our instances backed by RBD volumes were OK at that point, but once the node booted fully and the OSDs started, all Windows instances with rbd volumes experienced very choppy performance and were unable to ingest video surveillance traffic and commit it to disk. Once the cluster got back to HEALTH_OK, they resumed normal operation. I tried for a time with conservative recovery settings (osd max backfills = 1, osd recovery op priority = 1, and osd recovery max active = 1). No improvement for the guests. So I went to more aggressive settings to get things moving faster. That decreased the duration of the outage. During the entire period of recovery/backfill, the network looked fine...no where close to saturation. iowait on all drives look fine as well. Any ideas? Thanks, Mike Dawson On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: the same problem still occours. Will need to check when i've time to gather logs again. Am 14.08.2013 01:11, schrieb Samuel Just: I'm not sure, but your logs did show that you had 16 recovery ops in flight, so it's worth a try. If it doesn't help, you should collect the same set of logs I'll look again. Also, there are a few other patches between 61.7 and current cuttlefish which may help. -Sam On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j
Re: still recovery issues with cuttlefish
Am 21.08.2013 17:32, schrieb Samuel Just: Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. This might sound a bug harsh but maybe due to my limited english skills ;-) I still think that Cephs recovery system is broken by design. If an OSD comes back (was offline) all write requests regarding PGs where this one is primary are targeted immediatly to this OSD. If this one is not up2date for an PG it tries to recover that one immediatly which costs 4MB / block. If you have a lot of small write all over your OSDs and PGs you're sucked as your OSD has to recover ALL it's PGs immediatly or at least lots of them WHICH can't work. This is totally crazy. I think the right way would be: 1.) if an OSD goes down the replicas got primaries or 2.) an OSD which does not have an up2date PG should redirect to the OSD holding the secondary or third replica. Both results in being able to have a really smooth and slow recovery without any stress even under heavy 4K workloads like rbd backed VMs. Thanks for reading! Greets Stefan -Sam On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com wrote: Sam/Josh, We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning, hoping it would improve this situation, but there was no appreciable change. One node in our cluster fsck'ed after a reboot and got a bit behind. Our instances backed by RBD volumes were OK at that point, but once the node booted fully and the OSDs started, all Windows instances with rbd volumes experienced very choppy performance and were unable to ingest video surveillance traffic and commit it to disk. Once the cluster got back to HEALTH_OK, they resumed normal operation. I tried for a time with conservative recovery settings (osd max backfills = 1, osd recovery op priority = 1, and osd recovery max active = 1). No improvement for the guests. So I went to more aggressive settings to get things moving faster. That decreased the duration of the outage. During the entire period of recovery/backfill, the network looked fine...no where close to saturation. iowait on all drives look fine as well. Any ideas? Thanks, Mike Dawson On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: the same problem still occours. Will need to check when i've time to gather logs again. Am 14.08.2013 01:11, schrieb Samuel Just: I'm not sure, but your logs did show that you had 16 recovery ops in flight, so it's worth a try. If it doesn't help, you should collect the same set of logs I'll look again. Also, there are a few other patches between 61.7 and current cuttlefish which may help. -Sam On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com: I just backported a couple of patches from next to fix a bug where we weren't respecting the osd_recovery_max_active config in some cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the current cuttlefish branch or wait for a 61.8 release. Thanks! Are you sure that this is the issue? I don't believe that but i'll give it a try. I already tested a branch from sage where he fixed a race regarding max active some weeks ago. So active recovering was max 1 but the issue didn't went away. Stefan -Sam On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote: I got swamped today. I should be able to look tomorrow. Sorry! -Sam On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Did you take a look? Stefan Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com: Great! I'll take a look on Monday. -Sam On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Samual, Am 09.08.2013 23:44, schrieb Samuel Just: I think Stefan's problem is probably distinct from Mike's. Stefan: Can you reproduce the problem with debug osd = 20 debug filestore = 20 debug ms = 1 debug optracker = 20 on a few osds (including the restarted osd), and upload those osd logs along with the ceph.log from before killing the osd until after the cluster becomes clean again? done - you'll find the logs at cephdrop folder: slow_requests_recovering_cuttlefish osd.52 was the one recovering Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send
Re: still recovery issues with cuttlefish
Hi Sam, Am 21.08.2013 21:13, schrieb Samuel Just: As long as the request is for an object which is up to date on the primary, the request will be served without waiting for recovery. Sure but remember if you have VM random 4K workload a lot of objects go out of date pretty soon. A request only waits on recovery if the particular object being read or written must be recovered. Yes but on 4k load this can be a lot. Your issue was that recovering the particular object being requested was unreasonably slow due to silliness in the recovery code which you disabled by disabling osd_recover_clone_overlap. Yes and no. It's better now but far away from being good or perfect. My VMs do not crash anymore but i still have a bunch of slow requests (just around 10 messages) and still a VERY high I/O load on the disks during recovery. In cases where the primary osd is significantly behind, we do make one of the other osds primary during recovery in order to expedite requests (pgs in this state are shown as remapped). oh never seen that but at least in my case even 60s are a very long timeframe and the OSD is very stressed during recovery. Is it possible for me to set this value? Stefan -Sam On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 21.08.2013 17:32, schrieb Samuel Just: Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. This might sound a bug harsh but maybe due to my limited english skills ;-) I still think that Cephs recovery system is broken by design. If an OSD comes back (was offline) all write requests regarding PGs where this one is primary are targeted immediatly to this OSD. If this one is not up2date for an PG it tries to recover that one immediatly which costs 4MB / block. If you have a lot of small write all over your OSDs and PGs you're sucked as your OSD has to recover ALL it's PGs immediatly or at least lots of them WHICH can't work. This is totally crazy. I think the right way would be: 1.) if an OSD goes down the replicas got primaries or 2.) an OSD which does not have an up2date PG should redirect to the OSD holding the secondary or third replica. Both results in being able to have a really smooth and slow recovery without any stress even under heavy 4K workloads like rbd backed VMs. Thanks for reading! Greets Stefan -Sam On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson mike.daw...@cloudapt.com wrote: Sam/Josh, We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning, hoping it would improve this situation, but there was no appreciable change. One node in our cluster fsck'ed after a reboot and got a bit behind. Our instances backed by RBD volumes were OK at that point, but once the node booted fully and the OSDs started, all Windows instances with rbd volumes experienced very choppy performance and were unable to ingest video surveillance traffic and commit it to disk. Once the cluster got back to HEALTH_OK, they resumed normal operation. I tried for a time with conservative recovery settings (osd max backfills = 1, osd recovery op priority = 1, and osd recovery max active = 1). No improvement for the guests. So I went to more aggressive settings to get things moving faster. That decreased the duration of the outage. During the entire period of recovery/backfill, the network looked fine...no where close to saturation. iowait on all drives look fine as well. Any ideas? Thanks, Mike Dawson On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: the same problem still occours. Will need to check when i've time to gather logs again. Am 14.08.2013 01:11, schrieb Samuel Just: I'm not sure, but your logs did show that you had 16 recovery ops in flight, so it's worth a try. If it doesn't help, you should collect the same set of logs I'll look again. Also, there are a few other patches between 61.7 and current cuttlefish which may help. -Sam On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com: I just backported a couple of patches from next to fix a bug where we weren't respecting the osd_recovery_max_active config in some cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the current cuttlefish branch or wait for a 61.8 release. Thanks! Are you sure that this is the issue? I don't believe that but i'll give it a try. I already tested a branch from sage where he fixed a race regarding max active some weeks ago. So active recovering was max 1 but the issue didn't went away. Stefan -Sam On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote: I got swamped today. I should be able to look tomorrow. Sorry! -Sam On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Did you take a look? Stefan Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j
[PATCH] debian/control libgoogle-perftools-dev (= 2.0-2)
--- debian/control |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/debian/control b/debian/control index 5c14ebb..b39579f 100644 --- a/debian/control +++ b/debian/control @@ -25,7 +25,7 @@ Build-Depends: autoconf, libexpat1-dev, libfcgi-dev, libfuse-dev, - libgoogle-perftools-dev [i386 amd64], + libgoogle-perftools-dev (= 2.0-2) [i386 amd64], libkeyutils-dev, libleveldb-dev, libnss3-dev, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] allow also curl openssl binding
--- debian/control |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/debian/control b/debian/control index b39579f..957727d 100644 --- a/debian/control +++ b/debian/control @@ -20,7 +20,7 @@ Build-Depends: autoconf, libboost-program-options-dev (= 1.42), libboost-thread-dev (= 1.42), libboost-system-dev (= 1.42), - libcurl4-gnutls-dev, + libcurl4-gnutls-dev | libcurl4-openssl-dev, libedit-dev, libexpat1-dev, libfcgi-dev, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
the same problem still occours. Will need to check when i've time to gather logs again. Am 14.08.2013 01:11, schrieb Samuel Just: I'm not sure, but your logs did show that you had 16 recovery ops in flight, so it's worth a try. If it doesn't help, you should collect the same set of logs I'll look again. Also, there are a few other patches between 61.7 and current cuttlefish which may help. -Sam On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com: I just backported a couple of patches from next to fix a bug where we weren't respecting the osd_recovery_max_active config in some cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the current cuttlefish branch or wait for a 61.8 release. Thanks! Are you sure that this is the issue? I don't believe that but i'll give it a try. I already tested a branch from sage where he fixed a race regarding max active some weeks ago. So active recovering was max 1 but the issue didn't went away. Stefan -Sam On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote: I got swamped today. I should be able to look tomorrow. Sorry! -Sam On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Did you take a look? Stefan Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com: Great! I'll take a look on Monday. -Sam On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Samual, Am 09.08.2013 23:44, schrieb Samuel Just: I think Stefan's problem is probably distinct from Mike's. Stefan: Can you reproduce the problem with debug osd = 20 debug filestore = 20 debug ms = 1 debug optracker = 20 on a few osds (including the restarted osd), and upload those osd logs along with the ceph.log from before killing the osd until after the cluster becomes clean again? done - you'll find the logs at cephdrop folder: slow_requests_recovering_cuttlefish osd.52 was the one recovering Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Am 13.08.2013 um 22:43 schrieb Samuel Just sam.j...@inktank.com: I just backported a couple of patches from next to fix a bug where we weren't respecting the osd_recovery_max_active config in some cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the current cuttlefish branch or wait for a 61.8 release. Thanks! Are you sure that this is the issue? I don't believe that but i'll give it a try. I already tested a branch from sage where he fixed a race regarding max active some weeks ago. So active recovering was max 1 but the issue didn't went away. Stefan -Sam On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just sam.j...@inktank.com wrote: I got swamped today. I should be able to look tomorrow. Sorry! -Sam On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Did you take a look? Stefan Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com: Great! I'll take a look on Monday. -Sam On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Samual, Am 09.08.2013 23:44, schrieb Samuel Just: I think Stefan's problem is probably distinct from Mike's. Stefan: Can you reproduce the problem with debug osd = 20 debug filestore = 20 debug ms = 1 debug optracker = 20 on a few osds (including the restarted osd), and upload those osd logs along with the ceph.log from before killing the osd until after the cluster becomes clean again? done - you'll find the logs at cephdrop folder: slow_requests_recovering_cuttlefish osd.52 was the one recovering Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Did you take a look? Stefan Am 11.08.2013 um 05:50 schrieb Samuel Just sam.j...@inktank.com: Great! I'll take a look on Monday. -Sam On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Samual, Am 09.08.2013 23:44, schrieb Samuel Just: I think Stefan's problem is probably distinct from Mike's. Stefan: Can you reproduce the problem with debug osd = 20 debug filestore = 20 debug ms = 1 debug optracker = 20 on a few osds (including the restarted osd), and upload those osd logs along with the ceph.log from before killing the osd until after the cluster becomes clean again? done - you'll find the logs at cephdrop folder: slow_requests_recovering_cuttlefish osd.52 was the one recovering Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Hi Samual, Am 09.08.2013 23:44, schrieb Samuel Just: I think Stefan's problem is probably distinct from Mike's. Stefan: Can you reproduce the problem with debug osd = 20 debug filestore = 20 debug ms = 1 debug optracker = 20 on a few osds (including the restarted osd), and upload those osd logs along with the ceph.log from before killing the osd until after the cluster becomes clean again? done - you'll find the logs at cephdrop folder: slow_requests_recovering_cuttlefish osd.52 was the one recovering Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Hi Mike, Am 08.08.2013 16:05, schrieb Mike Dawson: Stefan, I see the same behavior and I theorize it is linked to an issue detailed in another thread [0]. Do your VM guests ever hang while your cluster is HEALTH_OK like described in that other thread? [0] http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2982 mhm no can't see that. All our VMs are working fine even under high load while ceph is OK. A few observations: - The VMs that hang do lots of writes (video surveillance). - I use rbd and qemu. The problem exists in both qemu 1.4.x and 1.5.2. - The problem exists with or without joshd's qemu async flush patch. - Windows VMs seem to be more vulnerable than Linux VMs. - If I restart the qemu-system-x86_64 process, the guest will come back to life. - A partial workaround seems to be console input (NoVNC or 'virsh screenshot'), but restarting qemu-system-x86_64 works better. - The issue of VMs hanging seems worse with RBD writeback cache enabled - I typically run virtio, but I believe I've seen it with e1000, too. - VM guests hang at different times, not all at once on a host (or across all hosts). - I co-mingle VM guests on servers that host ceph OSDs. Oliver, If your cluster has to recover/backfill, do your guest VMs hang with more frequency than under normal HEALTH_OK conditions, even if you prioritize client IO as Sam wrote below? Sam, Turning down all the settings you mentioned certainly does slow the recover/backfill process, but it doesn't prevent the VM guests backed by RBD volumes from hanging. In fact, I often try to prioritize recovery/backfill because my guests tend to hang until I get back to HEALTH_OK. Given this apparent bug, completing recovery/backfill quicker leads to less total outage, it seems. Josh, How can I help you investigate if RBD is the common source of both of these issues? Thanks, Mike Dawson On 8/2/2013 2:46 PM, Stefan Priebe wrote: Hi, osd recovery max active = 1 osd max backfills = 1 osd recovery op priority = 5 still no difference... Stefan Am 02.08.2013 20:21, schrieb Samuel Just: Also, you have osd_recovery_op_priority at 50. That is close to the priority of client IO. You want it below 10 (defaults to 10), perhaps at 1. You can also adjust down osd_recovery_max_active. -Sam On Fri, Aug 2, 2013 at 11:16 AM, Stefan Priebe s.pri...@profihost.ag wrote: I already tried both values this makes no difference. The drives are not the bottleneck. Am 02.08.2013 19:35, schrieb Samuel Just: You might try turning osd_max_backfills to 2 or 1. -Sam On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd settings? sudo ceph --admin-daemon ceph-osd.osdid.asok config show Sure. { name: osd.0, cluster: ceph, none: 0\/5, lockdep: 0\/0, context: 0\/0, crush: 0\/0, mds: 0\/0, mds_balancer: 0\/0, mds_locker: 0\/0, mds_log: 0\/0, mds_log_expire: 0\/0, mds_migrator: 0\/0, buffer: 0\/0, timer: 0\/0, filer: 0\/0, striper: 0\/1, objecter: 0\/0, rados: 0\/0, rbd: 0\/0, journaler: 0\/0, objectcacher: 0\/0, client: 0\/0, osd: 0\/0, optracker: 0\/0, objclass: 0\/0, filestore: 0\/0, journal: 0\/0, ms: 0\/0, mon: 0\/0, monc: 0\/0, paxos: 0\/0, tp: 0\/0, auth: 0\/0, crypto: 1\/5, finisher: 0\/0, heartbeatmap: 0\/0, perfcounter: 0\/0, rgw: 0\/0, hadoop: 0\/0, javaclient: 1\/5, asok: 0\/0, throttle: 0\/0, host: cloud1-1268, fsid: ----, public_addr: 10.255.0.90:0\/0, cluster_addr: 10.255.0.90:0\/0, public_network: 10.255.0.1\/24, cluster_network: 10.255.0.1\/24, num_client: 1, monmap: , mon_host: , lockdep: false, run_dir: \/var\/run\/ceph, admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok, daemonize: true, pid_file: \/var\/run\/ceph\/osd.0.pid, chdir: \/, max_open_files: 0, fatal_signal_handlers: true, log_file: \/var\/log\/ceph\/ceph-osd.0.log, log_max_new: 1000, log_max_recent: 1, log_to_stderr: false, err_to_stderr: true, log_to_syslog: false, err_to_syslog: false, log_flush_on_exit: true, log_stop_at_utilization: 0.97, clog_to_monitors: true, clog_to_syslog: false, clog_to_syslog_level: info, clog_to_syslog_facility: daemon, mon_cluster_log_to_syslog: false, mon_cluster_log_to_syslog_level: info, mon_cluster_log_to_syslog_facility: daemon, mon_cluster_log_file: \/var\/log\/ceph\/ceph.log, key: , keyfile: , keyring: \/etc\/ceph\/osd.0.keyring, heartbeat_interval: 5, heartbeat_file: , heartbeat_inject_failure: 0, perf: true, ms_tcp_nodelay: true, ms_tcp_rcvbuf: 0, ms_initial_backoff: 0.2, ms_max_backoff: 15, ms_nocrc: false, ms_die_on_bad_msg: false
Re: still recovery issues with cuttlefish
, rgw_socket_path: , rgw_host: , rgw_port: , rgw_dns_name: , rgw_script_uri: , rgw_request_uri: , rgw_swift_url: , rgw_swift_url_prefix: swift, rgw_swift_auth_url: , rgw_swift_auth_entry: auth, rgw_keystone_url: , rgw_keystone_admin_token: , rgw_keystone_accepted_roles: Member, admin, rgw_keystone_token_cache_size: 1, rgw_keystone_revocation_interval: 900, rgw_admin_entry: admin, rgw_enforce_swift_acls: true, rgw_swift_token_expiration: 86400, rgw_print_continue: true, rgw_remote_addr_param: REMOTE_ADDR, rgw_op_thread_timeout: 600, rgw_op_thread_suicide_timeout: 0, rgw_thread_pool_size: 100, rgw_num_control_oids: 8, rgw_zone_root_pool: .rgw.root, rgw_log_nonexistent_bucket: false, rgw_log_object_name: %Y-%m-%d-%H-%i-%n, rgw_log_object_name_utc: false, rgw_usage_max_shards: 32, rgw_usage_max_user_shards: 1, rgw_enable_ops_log: false, rgw_enable_usage_log: false, rgw_ops_log_rados: true, rgw_ops_log_socket_path: , rgw_ops_log_data_backlog: 5242880, rgw_usage_log_flush_threshold: 1024, rgw_usage_log_tick_interval: 30, rgw_intent_log_object_name: %Y-%m-%d-%i-%n, rgw_intent_log_object_name_utc: false, rgw_init_timeout: 300, rgw_mime_types_file: \/etc\/mime.types, rgw_gc_max_objs: 32, rgw_gc_obj_min_wait: 7200, rgw_gc_processor_max_time: 3600, rgw_gc_processor_period: 3600, rgw_s3_success_create_obj_status: 0, rgw_resolve_cname: false, rgw_obj_stripe_size: 4194304, rgw_extended_http_attrs: , rgw_exit_timeout_secs: 120, rgw_get_obj_window_size: 16777216, rgw_get_obj_max_req_size: 4194304, rgw_relaxed_s3_bucket_names: false, rgw_list_buckets_max_chunk: 1000, mutex_perf_counter: false, internal_safe_to_start_threads: true} Stefan -Sam On Thu, Aug 1, 2013 at 12:07 PM, Stefan Priebe s.pri...@profihost.ag wrote: Mike we already have the async patch running. Yes it helps but only helps it does not solve. It just hides the issue ... Am 01.08.2013 20:54, schrieb Mike Dawson: I am also seeing recovery issues with 0.61.7. Here's the process: - ceph osd set noout - Reboot one of the nodes hosting OSDs - VMs mounted from RBD volumes work properly - I see the OSD's boot messages as they re-join the cluster - Start seeing active+recovery_wait, peering, and active+recovering - VMs mounted from RBD volumes become unresponsive. - Recovery completes - VMs mounted from RBD volumes regain responsiveness - ceph osd unset noout Would joshd's async patch for qemu help here, or is there something else going on? Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 2:34 PM, Samuel Just wrote: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
I already tried both values this makes no difference. The drives are not the bottleneck. Am 02.08.2013 19:35, schrieb Samuel Just: You might try turning osd_max_backfills to 2 or 1. -Sam On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd settings? sudo ceph --admin-daemon ceph-osd.osdid.asok config show Sure. { name: osd.0, cluster: ceph, none: 0\/5, lockdep: 0\/0, context: 0\/0, crush: 0\/0, mds: 0\/0, mds_balancer: 0\/0, mds_locker: 0\/0, mds_log: 0\/0, mds_log_expire: 0\/0, mds_migrator: 0\/0, buffer: 0\/0, timer: 0\/0, filer: 0\/0, striper: 0\/1, objecter: 0\/0, rados: 0\/0, rbd: 0\/0, journaler: 0\/0, objectcacher: 0\/0, client: 0\/0, osd: 0\/0, optracker: 0\/0, objclass: 0\/0, filestore: 0\/0, journal: 0\/0, ms: 0\/0, mon: 0\/0, monc: 0\/0, paxos: 0\/0, tp: 0\/0, auth: 0\/0, crypto: 1\/5, finisher: 0\/0, heartbeatmap: 0\/0, perfcounter: 0\/0, rgw: 0\/0, hadoop: 0\/0, javaclient: 1\/5, asok: 0\/0, throttle: 0\/0, host: cloud1-1268, fsid: ----, public_addr: 10.255.0.90:0\/0, cluster_addr: 10.255.0.90:0\/0, public_network: 10.255.0.1\/24, cluster_network: 10.255.0.1\/24, num_client: 1, monmap: , mon_host: , lockdep: false, run_dir: \/var\/run\/ceph, admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok, daemonize: true, pid_file: \/var\/run\/ceph\/osd.0.pid, chdir: \/, max_open_files: 0, fatal_signal_handlers: true, log_file: \/var\/log\/ceph\/ceph-osd.0.log, log_max_new: 1000, log_max_recent: 1, log_to_stderr: false, err_to_stderr: true, log_to_syslog: false, err_to_syslog: false, log_flush_on_exit: true, log_stop_at_utilization: 0.97, clog_to_monitors: true, clog_to_syslog: false, clog_to_syslog_level: info, clog_to_syslog_facility: daemon, mon_cluster_log_to_syslog: false, mon_cluster_log_to_syslog_level: info, mon_cluster_log_to_syslog_facility: daemon, mon_cluster_log_file: \/var\/log\/ceph\/ceph.log, key: , keyfile: , keyring: \/etc\/ceph\/osd.0.keyring, heartbeat_interval: 5, heartbeat_file: , heartbeat_inject_failure: 0, perf: true, ms_tcp_nodelay: true, ms_tcp_rcvbuf: 0, ms_initial_backoff: 0.2, ms_max_backoff: 15, ms_nocrc: false, ms_die_on_bad_msg: false, ms_die_on_unhandled_msg: false, ms_dispatch_throttle_bytes: 104857600, ms_bind_ipv6: false, ms_bind_port_min: 6800, ms_bind_port_max: 7100, ms_rwthread_stack_bytes: 1048576, ms_tcp_read_timeout: 900, ms_pq_max_tokens_per_priority: 4194304, ms_pq_min_cost: 65536, ms_inject_socket_failures: 0, ms_inject_delay_type: , ms_inject_delay_max: 1, ms_inject_delay_probability: 0, ms_inject_internal_delays: 0, mon_data: \/var\/lib\/ceph\/mon\/ceph-0, mon_initial_members: , mon_sync_fs_threshold: 5, mon_compact_on_start: false, mon_compact_on_bootstrap: false, mon_compact_on_trim: true, mon_tick_interval: 5, mon_subscribe_interval: 300, mon_osd_laggy_halflife: 3600, mon_osd_laggy_weight: 0.3, mon_osd_adjust_heartbeat_grace: true, mon_osd_adjust_down_out_interval: true, mon_osd_auto_mark_in: false, mon_osd_auto_mark_auto_out_in: true, mon_osd_auto_mark_new_in: true, mon_osd_down_out_interval: 300, mon_osd_down_out_subtree_limit: rack, mon_osd_min_up_ratio: 0.3, mon_osd_min_in_ratio: 0.3, mon_stat_smooth_intervals: 2, mon_lease: 5, mon_lease_renew_interval: 3, mon_lease_ack_timeout: 10, mon_clock_drift_allowed: 0.05, mon_clock_drift_warn_backoff: 5, mon_timecheck_interval: 300, mon_accept_timeout: 10, mon_pg_create_interval: 30, mon_pg_stuck_threshold: 300, mon_osd_full_ratio: 0.95, mon_osd_nearfull_ratio: 0.85, mon_globalid_prealloc: 100, mon_osd_report_timeout: 900, mon_force_standby_active: true, mon_min_osdmap_epochs: 500, mon_max_pgmap_epochs: 500, mon_max_log_epochs: 500, mon_max_osd: 1, mon_probe_timeout: 2, mon_slurp_timeout: 10, mon_slurp_bytes: 262144, mon_client_bytes: 104857600, mon_daemon_bytes: 419430400, mon_max_log_entries_per_event: 4096, mon_health_data_update_interval: 60, mon_data_avail_crit: 5, mon_data_avail_warn: 30, mon_config_key_max_entry_size: 4096, mon_sync_trim_timeout: 30, mon_sync_heartbeat_timeout: 30, mon_sync_heartbeat_interval: 5, mon_sync_backoff_timeout: 30, mon_sync_timeout: 30, mon_sync_max_retries: 5, mon_sync_max_payload_size: 1048576, mon_sync_debug: false, mon_sync_debug_leader: -1, mon_sync_debug_provider: -1, mon_sync_debug_provider_fallback: -1, mon_debug_dump_transactions: false, mon_debug_dump_location: \/var\/log\/ceph\/ceph-osd.0.tdump, mon_sync_leader_kill_at: 0
Re: still recovery issues with cuttlefish
Hi, osd recovery max active = 1 osd max backfills = 1 osd recovery op priority = 5 still no difference... Stefan Am 02.08.2013 20:21, schrieb Samuel Just: Also, you have osd_recovery_op_priority at 50. That is close to the priority of client IO. You want it below 10 (defaults to 10), perhaps at 1. You can also adjust down osd_recovery_max_active. -Sam On Fri, Aug 2, 2013 at 11:16 AM, Stefan Priebe s.pri...@profihost.ag wrote: I already tried both values this makes no difference. The drives are not the bottleneck. Am 02.08.2013 19:35, schrieb Samuel Just: You might try turning osd_max_backfills to 2 or 1. -Sam On Fri, Aug 2, 2013 at 12:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 01.08.2013 23:23, schrieb Samuel Just: Can you dump your osd settings? sudo ceph --admin-daemon ceph-osd.osdid.asok config show Sure. { name: osd.0, cluster: ceph, none: 0\/5, lockdep: 0\/0, context: 0\/0, crush: 0\/0, mds: 0\/0, mds_balancer: 0\/0, mds_locker: 0\/0, mds_log: 0\/0, mds_log_expire: 0\/0, mds_migrator: 0\/0, buffer: 0\/0, timer: 0\/0, filer: 0\/0, striper: 0\/1, objecter: 0\/0, rados: 0\/0, rbd: 0\/0, journaler: 0\/0, objectcacher: 0\/0, client: 0\/0, osd: 0\/0, optracker: 0\/0, objclass: 0\/0, filestore: 0\/0, journal: 0\/0, ms: 0\/0, mon: 0\/0, monc: 0\/0, paxos: 0\/0, tp: 0\/0, auth: 0\/0, crypto: 1\/5, finisher: 0\/0, heartbeatmap: 0\/0, perfcounter: 0\/0, rgw: 0\/0, hadoop: 0\/0, javaclient: 1\/5, asok: 0\/0, throttle: 0\/0, host: cloud1-1268, fsid: ----, public_addr: 10.255.0.90:0\/0, cluster_addr: 10.255.0.90:0\/0, public_network: 10.255.0.1\/24, cluster_network: 10.255.0.1\/24, num_client: 1, monmap: , mon_host: , lockdep: false, run_dir: \/var\/run\/ceph, admin_socket: \/var\/run\/ceph\/ceph-osd.0.asok, daemonize: true, pid_file: \/var\/run\/ceph\/osd.0.pid, chdir: \/, max_open_files: 0, fatal_signal_handlers: true, log_file: \/var\/log\/ceph\/ceph-osd.0.log, log_max_new: 1000, log_max_recent: 1, log_to_stderr: false, err_to_stderr: true, log_to_syslog: false, err_to_syslog: false, log_flush_on_exit: true, log_stop_at_utilization: 0.97, clog_to_monitors: true, clog_to_syslog: false, clog_to_syslog_level: info, clog_to_syslog_facility: daemon, mon_cluster_log_to_syslog: false, mon_cluster_log_to_syslog_level: info, mon_cluster_log_to_syslog_facility: daemon, mon_cluster_log_file: \/var\/log\/ceph\/ceph.log, key: , keyfile: , keyring: \/etc\/ceph\/osd.0.keyring, heartbeat_interval: 5, heartbeat_file: , heartbeat_inject_failure: 0, perf: true, ms_tcp_nodelay: true, ms_tcp_rcvbuf: 0, ms_initial_backoff: 0.2, ms_max_backoff: 15, ms_nocrc: false, ms_die_on_bad_msg: false, ms_die_on_unhandled_msg: false, ms_dispatch_throttle_bytes: 104857600, ms_bind_ipv6: false, ms_bind_port_min: 6800, ms_bind_port_max: 7100, ms_rwthread_stack_bytes: 1048576, ms_tcp_read_timeout: 900, ms_pq_max_tokens_per_priority: 4194304, ms_pq_min_cost: 65536, ms_inject_socket_failures: 0, ms_inject_delay_type: , ms_inject_delay_max: 1, ms_inject_delay_probability: 0, ms_inject_internal_delays: 0, mon_data: \/var\/lib\/ceph\/mon\/ceph-0, mon_initial_members: , mon_sync_fs_threshold: 5, mon_compact_on_start: false, mon_compact_on_bootstrap: false, mon_compact_on_trim: true, mon_tick_interval: 5, mon_subscribe_interval: 300, mon_osd_laggy_halflife: 3600, mon_osd_laggy_weight: 0.3, mon_osd_adjust_heartbeat_grace: true, mon_osd_adjust_down_out_interval: true, mon_osd_auto_mark_in: false, mon_osd_auto_mark_auto_out_in: true, mon_osd_auto_mark_new_in: true, mon_osd_down_out_interval: 300, mon_osd_down_out_subtree_limit: rack, mon_osd_min_up_ratio: 0.3, mon_osd_min_in_ratio: 0.3, mon_stat_smooth_intervals: 2, mon_lease: 5, mon_lease_renew_interval: 3, mon_lease_ack_timeout: 10, mon_clock_drift_allowed: 0.05, mon_clock_drift_warn_backoff: 5, mon_timecheck_interval: 300, mon_accept_timeout: 10, mon_pg_create_interval: 30, mon_pg_stuck_threshold: 300, mon_osd_full_ratio: 0.95, mon_osd_nearfull_ratio: 0.85, mon_globalid_prealloc: 100, mon_osd_report_timeout: 900, mon_force_standby_active: true, mon_min_osdmap_epochs: 500, mon_max_pgmap_epochs: 500, mon_max_log_epochs: 500, mon_max_osd: 1, mon_probe_timeout: 2, mon_slurp_timeout: 10, mon_slurp_bytes: 262144, mon_client_bytes: 104857600, mon_daemon_bytes: 419430400, mon_max_log_entries_per_event: 4096, mon_health_data_update_interval
still recovery issues with cuttlefish
Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
m 01.08.2013 20:34, schrieb Samuel Just: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam Sure which log levels? On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Mike we already have the async patch running. Yes it helps but only helps it does not solve. It just hides the issue ... Am 01.08.2013 20:54, schrieb Mike Dawson: I am also seeing recovery issues with 0.61.7. Here's the process: - ceph osd set noout - Reboot one of the nodes hosting OSDs - VMs mounted from RBD volumes work properly - I see the OSD's boot messages as they re-join the cluster - Start seeing active+recovery_wait, peering, and active+recovering - VMs mounted from RBD volumes become unresponsive. - Recovery completes - VMs mounted from RBD volumes regain responsiveness - ceph osd unset noout Would joshd's async patch for qemu help here, or is there something else going on? Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 2:34 PM, Samuel Just wrote: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and re backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Upgrading from 0.61.5 to 0.61.6 ended in disaster
Hi, today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug. But this ended in a complete desaster. What i've done: 1.) recompiled ceph tagged with 0.61.6 2.) installed new ceph version on all machines 3.) JUST tried to restart ONE mon this failed with: [1774]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** 2013-07-24 08:41:43.088090 7f53c185d700 0 quorum service shutdown 2013-07-24 08:41:43.088094 7f53c185d700 0 mon.a@0(???).health(3840) HealthMonitor::service_shutdown 1 services 2013-07-24 08:41:43.088097 7f53c185d700 0 quorum service shutdown 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process ceph-mon, pid 29871 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fae6384a780 time 2013-07-24 08:41:56.096683 mon/OSDMonitor.cc: 156: FAILED assert(latest_full 0) ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3) 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -13 2013-07-24 08:41:44.222821 7fae6384a780 5 asok(0x2698000) register_command perfcounters_dump hook 0x2682010 -12 2013-07-24 08:41:44.222835 7fae6384a780 5 asok(0x2698000) register_command 1 hook 0x2682010 -11 2013-07-24 08:41:44.222837 7fae6384a780 5 asok(0x2698000) register_command perf dump hook 0x2682010 -10 2013-07-24 08:41:44.222842 7fae6384a780 5 asok(0x2698000) register_command perfcounters_schema hook 0x2682010 -9 2013-07-24 08:41:44.222845 7fae6384a780 5 asok(0x2698000) register_command 2 hook 0x2682010 -8 2013-07-24 08:41:44.222847 7fae6384a780 5 asok(0x2698000) register_command perf schema hook 0x2682010 -7 2013-07-24 08:41:44.222849 7fae6384a780 5 asok(0x2698000) register_command config show hook 0x2682010 -6 2013-07-24 08:41:44.222852 7fae6384a780 5 asok(0x2698000) register_command config set hook 0x2682010 -5 2013-07-24 08:41:44.222854 7fae6384a780 5 asok(0x2698000) register_command log flush hook 0x2682010 -4 2013-07-24 08:41:44.222856 7fae6384a780 5 asok(0x2698000) register_command log dump hook 0x2682010 -3 2013-07-24 08:41:44.222859 7fae6384a780 5 asok(0x2698000) register_command log reopen hook 0x2682010 -2 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process ceph-mon, pid 29871 -1 2013-07-24 08:41:44.224397 7fae6384a780 1 finished global_init_daemonize 0 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fae6384a780 time 2013-07-24 08:41:56.096683 mon/OSDMonitor.cc: 156: FAILED assert(latest_full 0) ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3) 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs were still trying to reach mon.a 2013-07-24 08:41:43.088997 7f011268f700 0 monclient: hunting for new mon 2013-07-24 08:41:56.792449 7f0109e7e700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:02.792990 7f0116b6c700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:11.793525 7f0109d7d700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:23.794315 7f0109e7e700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:27.621336 7f0122d2e700 0 log [WRN] : 5 slow requests, 5 included below; oldest blocked for 30.378391 secs 2013-07-24 08:42:27.621344 7f0122d2e700 0 log [WRN] : slow request 30.378391 seconds old, received at 2013-07-24 08:41:57.242902: osd_op(client.14727601.0:3839848
Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
Hi, i uploaded my ceph mon store to cephdrop /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz. So hopefully someone can find the culprit soon. It fails in OSDMonitor.cc here: // if we trigger this, then there's something else going with the store // state, and we shouldn't want to work around it without knowing what // exactly happened. assert(latest_full 0); Stefan Am 24.07.2013 09:05, schrieb Stefan Priebe - Profihost AG: Hi, today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug. But this ended in a complete desaster. What i've done: 1.) recompiled ceph tagged with 0.61.6 2.) installed new ceph version on all machines 3.) JUST tried to restart ONE mon this failed with: [1774]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** 2013-07-24 08:41:43.088090 7f53c185d700 0 quorum service shutdown 2013-07-24 08:41:43.088094 7f53c185d700 0 mon.a@0(???).health(3840) HealthMonitor::service_shutdown 1 services 2013-07-24 08:41:43.088097 7f53c185d700 0 quorum service shutdown 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process ceph-mon, pid 29871 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fae6384a780 time 2013-07-24 08:41:56.096683 mon/OSDMonitor.cc: 156: FAILED assert(latest_full 0) ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3) 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -13 2013-07-24 08:41:44.222821 7fae6384a780 5 asok(0x2698000) register_command perfcounters_dump hook 0x2682010 -12 2013-07-24 08:41:44.222835 7fae6384a780 5 asok(0x2698000) register_command 1 hook 0x2682010 -11 2013-07-24 08:41:44.222837 7fae6384a780 5 asok(0x2698000) register_command perf dump hook 0x2682010 -10 2013-07-24 08:41:44.222842 7fae6384a780 5 asok(0x2698000) register_command perfcounters_schema hook 0x2682010 -9 2013-07-24 08:41:44.222845 7fae6384a780 5 asok(0x2698000) register_command 2 hook 0x2682010 -8 2013-07-24 08:41:44.222847 7fae6384a780 5 asok(0x2698000) register_command perf schema hook 0x2682010 -7 2013-07-24 08:41:44.222849 7fae6384a780 5 asok(0x2698000) register_command config show hook 0x2682010 -6 2013-07-24 08:41:44.222852 7fae6384a780 5 asok(0x2698000) register_command config set hook 0x2682010 -5 2013-07-24 08:41:44.222854 7fae6384a780 5 asok(0x2698000) register_command log flush hook 0x2682010 -4 2013-07-24 08:41:44.222856 7fae6384a780 5 asok(0x2698000) register_command log dump hook 0x2682010 -3 2013-07-24 08:41:44.222859 7fae6384a780 5 asok(0x2698000) register_command log reopen hook 0x2682010 -2 2013-07-24 08:41:44.224104 7fae6384a780 0 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process ceph-mon, pid 29871 -1 2013-07-24 08:41:44.224397 7fae6384a780 1 finished global_init_daemonize 0 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7fae6384a780 time 2013-07-24 08:41:56.096683 mon/OSDMonitor.cc: 156: FAILED assert(latest_full 0) ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3) 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs were still trying to reach mon.a 2013-07-24 08:41:43.088997 7f011268f700 0 monclient: hunting for new mon 2013-07-24 08:41:56.792449 7f0109e7e700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:02.792990 7f0116b6c700 0 -- 10.255.0.82:6802/29397 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault 2013-07-24 08:42:11.793525 7f0109d7d700 0
Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
Am 24.07.2013 13:11, schrieb Joao Eduardo Luis: On 07/24/2013 08:37 AM, Stefan Priebe - Profihost AG wrote: Hi, i uploaded my ceph mon store to cephdrop /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz. So hopefully someone can find the culprit soon. It fails in OSDMonitor.cc here: // if we trigger this, then there's something else going with the store // state, and we shouldn't want to work around it without knowing what // exactly happened. assert(latest_full 0); Wrong variable being used in a loop as part of a workaround for 5704. Opened a bug for this on http://tracker.ceph.com/issues/5737 A fix is available on wip-5737 (next) and wip-5737-cuttlefish. Tested the mon against your store and it worked flawlessly. Also tested it against the same stores used during the original fix and also they worked just fine. My question now is how the hell those stores worked fine although the original fix was grabbing what should have been a non-existent version, or how did they not trigger that assert. Which is what I'm going to investigate next. What i don't understand is WHY the hell the OSDs haven't used the 2nd or 3rd monitor which wasn't restarted? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] v0.61.5 Cuttlefish update released
All mons do not work anymore: === mon.a === Starting Ceph mon.a on ccad... [21207]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' Stefan Am 19.07.2013 07:59, schrieb Sage Weil: A note on upgrading: One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature bits. We did not realize it before, but the fix will prevent 0.61.4 (or earlier) from forming a quorum with 0.61.5. This is similar to the upgrade from bobtail (and the future upgrade to dumpling). As such, we recommend you upgrade all monitors at once to avoid the potential for discruption in service. I'm adding a note to the release notes. Thanks! sage On Thu, 18 Jul 2013, Sage Weil wrote: We've prepared another update for the Cuttlefish v0.61.x series. This release primarily contains monitor stability improvements, although there are also some important fixes for ceph-osd for large clusters and a few important CephFS fixes. We recommend that all v0.61.x users upgrade. * mon: misc sync improvements (faster, more reliable, better tuning) * mon: enable leveldb cache by default (big performance improvement) * mon: new scrub feature (primarily for diagnostic, testing purposes) * mon: fix occasional leveldb assertion on startup * mon: prevent reads until initial state is committed * mon: improved logic for trimming old osdmaps * mon: fix pick_addresses bug when expanding mon cluster * mon: several small paxos fixes, improvements * mon: fix bug osdmap trim behavior * osd: fix several bugs with PG stat reporting * osd: limit number of maps shared with peers (which could cause domino failures) * rgw: fix radosgw-admin buckets list (for all buckets) * mds: fix occasional client failure to reconnect * mds: fix bad list traversal after unlink * mds: fix underwater dentry cleanup (occasional crash after mds restart) * libcephfs, ceph-fuse: fix occasional hangs on umount * libcephfs, ceph-fuse: fix old bug with O_LAZY vs O_NOATIME confusion * ceph-disk: more robust journal device detection on RHEL/CentOS * ceph-disk: better, simpler locking * ceph-disk: do not inadvertantely mount over existing osd mounts * ceph-disk: better handling for unusual device names * sysvinit, upstart: handle symlinks in /var/lib/ceph/* Please also refer to the complete release notes: http://ceph.com/docs/master/release-notes/#v0-61-5-cuttlefish You can get v0.61.5 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.61.5.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] v0.61.5 Cuttlefish update released
crash is this one: 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -13 2013-07-19 08:59:32.136172 7f484a872780 5 asok(0x131a000) register_command perfcounters_dump hook 0x1304010 -12 2013-07-19 08:59:32.136191 7f484a872780 5 asok(0x131a000) register_command 1 hook 0x1304010 -11 2013-07-19 08:59:32.136194 7f484a872780 5 asok(0x131a000) register_command perf dump hook 0x1304010 -10 2013-07-19 08:59:32.136200 7f484a872780 5 asok(0x131a000) register_command perfcounters_schema hook 0x1304010 -9 2013-07-19 08:59:32.136204 7f484a872780 5 asok(0x131a000) register_command 2 hook 0x1304010 -8 2013-07-19 08:59:32.136206 7f484a872780 5 asok(0x131a000) register_command perf schema hook 0x1304010 -7 2013-07-19 08:59:32.136208 7f484a872780 5 asok(0x131a000) register_command config show hook 0x1304010 -6 2013-07-19 08:59:32.136211 7f484a872780 5 asok(0x131a000) register_command config set hook 0x1304010 -5 2013-07-19 08:59:32.136214 7f484a872780 5 asok(0x131a000) register_command log flush hook 0x1304010 -4 2013-07-19 08:59:32.136216 7f484a872780 5 asok(0x131a000) register_command log dump hook 0x1304010 -3 2013-07-19 08:59:32.136219 7f484a872780 5 asok(0x131a000) register_command log reopen hook 0x1304010 -2 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 -1 2013-07-19 08:59:32.137967 7f484a872780 1 finished global_init_daemonize 0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Mi Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG: All mons do not work anymore: === mon.a === Starting Ceph mon.a on ccad... [21207]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' Stefan Am 19.07.2013 07:59, schrieb Sage Weil: A note on upgrading: One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature bits. We did not realize it before, but the fix will prevent 0.61.4 (or earlier) from forming a quorum with 0.61.5. This is similar to the upgrade from bobtail (and the future upgrade to dumpling). As such, we recommend you upgrade all monitors at once to avoid the potential for discruption in service. I'm adding a note to the release notes. Thanks! sage On Thu, 18 Jul 2013, Sage Weil wrote: We've prepared another update for the Cuttlefish v0.61.x series. This release primarily contains monitor stability improvements, although there are also some important fixes for ceph-osd for large clusters and a few important CephFS fixes. We recommend that all v0.61.x users upgrade. * mon: misc sync improvements (faster, more reliable, better tuning) * mon: enable leveldb cache by default (big performance improvement) * mon: new scrub feature (primarily for diagnostic, testing purposes) * mon: fix occasional leveldb assertion on startup * mon: prevent reads until initial state is committed * mon: improved logic for trimming old osdmaps * mon: fix pick_addresses bug when expanding mon cluster * mon: several small paxos fixes, improvements * mon: fix bug osdmap trim behavior * osd: fix several bugs with PG stat reporting * osd: limit number of maps shared with peers (which could cause domino
Re: [ceph-users] v0.61.5 Cuttlefish update released
Complete Output / log with debug mon 20 here: http://pastebin.com/raw.php?i=HzegqkFz Stefan Am 19.07.2013 09:00, schrieb Stefan Priebe - Profihost AG: crash is this one: 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -13 2013-07-19 08:59:32.136172 7f484a872780 5 asok(0x131a000) register_command perfcounters_dump hook 0x1304010 -12 2013-07-19 08:59:32.136191 7f484a872780 5 asok(0x131a000) register_command 1 hook 0x1304010 -11 2013-07-19 08:59:32.136194 7f484a872780 5 asok(0x131a000) register_command perf dump hook 0x1304010 -10 2013-07-19 08:59:32.136200 7f484a872780 5 asok(0x131a000) register_command perfcounters_schema hook 0x1304010 -9 2013-07-19 08:59:32.136204 7f484a872780 5 asok(0x131a000) register_command 2 hook 0x1304010 -8 2013-07-19 08:59:32.136206 7f484a872780 5 asok(0x131a000) register_command perf schema hook 0x1304010 -7 2013-07-19 08:59:32.136208 7f484a872780 5 asok(0x131a000) register_command config show hook 0x1304010 -6 2013-07-19 08:59:32.136211 7f484a872780 5 asok(0x131a000) register_command config set hook 0x1304010 -5 2013-07-19 08:59:32.136214 7f484a872780 5 asok(0x131a000) register_command log flush hook 0x1304010 -4 2013-07-19 08:59:32.136216 7f484a872780 5 asok(0x131a000) register_command log dump hook 0x1304010 -3 2013-07-19 08:59:32.136219 7f484a872780 5 asok(0x131a000) register_command log reopen hook 0x1304010 -2 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 -1 2013-07-19 08:59:32.137967 7f484a872780 1 finished global_init_daemonize 0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Mi Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG: All mons do not work anymore: === mon.a === Starting Ceph mon.a on ccad... [21207]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' Stefan Am 19.07.2013 07:59, schrieb Sage Weil: A note on upgrading: One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature bits. We did not realize it before, but the fix will prevent 0.61.4 (or earlier) from forming a quorum with 0.61.5. This is similar to the upgrade from bobtail (and the future upgrade to dumpling). As such, we recommend you upgrade all monitors at once to avoid the potential for discruption in service. I'm adding a note to the release notes. Thanks! sage On Thu, 18 Jul 2013, Sage Weil wrote: We've prepared another update for the Cuttlefish v0.61.x series. This release primarily contains monitor stability improvements, although there are also some important fixes for ceph-osd for large clusters and a few important CephFS fixes. We recommend that all v0.61.x users upgrade. * mon: misc sync improvements (faster, more reliable, better tuning) * mon: enable leveldb cache by default (big performance improvement) * mon: new scrub feature (primarily for diagnostic, testing purposes) * mon: fix occasional leveldb assertion on startup * mon: prevent reads until initial state is committed * mon: improved logic for trimming old osdmaps * mon: fix pick_addresses bug when expanding
Re: [ceph-users] v0.61.5 Cuttlefish update released
Am 19.07.2013 09:56, schrieb Dan van der Ster: Was that 0.61.4 - 0.61.5? Our upgrade of all mons and osds on SL6.4 went without incident. It was from a git version in between 0.61.4 / 0.61.5 to 0.61.5. Stefan -- Dan van der Ster CERN IT-DSS On Friday, July 19, 2013 at 9:00 AM, Stefan Priebe - Profihost AG wrote: crash is this one: 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc http://OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc http://OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -13 2013-07-19 08:59:32.136172 7f484a872780 5 asok(0x131a000) register_command perfcounters_dump hook 0x1304010 -12 2013-07-19 08:59:32.136191 7f484a872780 5 asok(0x131a000) register_command 1 hook 0x1304010 -11 2013-07-19 08:59:32.136194 7f484a872780 5 asok(0x131a000) register_command perf dump hook 0x1304010 -10 2013-07-19 08:59:32.136200 7f484a872780 5 asok(0x131a000) register_command perfcounters_schema hook 0x1304010 -9 2013-07-19 08:59:32.136204 7f484a872780 5 asok(0x131a000) register_command 2 hook 0x1304010 -8 2013-07-19 08:59:32.136206 7f484a872780 5 asok(0x131a000) register_command perf schema hook 0x1304010 -7 2013-07-19 08:59:32.136208 7f484a872780 5 asok(0x131a000) register_command config show hook 0x1304010 -6 2013-07-19 08:59:32.136211 7f484a872780 5 asok(0x131a000) register_command config set hook 0x1304010 -5 2013-07-19 08:59:32.136214 7f484a872780 5 asok(0x131a000) register_command log flush hook 0x1304010 -4 2013-07-19 08:59:32.136216 7f484a872780 5 asok(0x131a000) register_command log dump hook 0x1304010 -3 2013-07-19 08:59:32.136219 7f484a872780 5 asok(0x131a000) register_command log reopen hook 0x1304010 -2 2013-07-19 08:59:32.137646 7f484a872780 0 ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b), process ceph-mon, pid 22172 -1 2013-07-19 08:59:32.137967 7f484a872780 1 finished global_init_daemonize 0 2013-07-19 08:59:32.173975 7f484a872780 -1 mon/OSDMonitor.cc http://OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f484a872780 time 2013-07-19 08:59:32.173506 mon/OSDMonitor.cc http://OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-17-g83f8b88 (83f8b88e5be41371cb77b39c0966e79cad92087b) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d341] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4bba79] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f48489cec8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Mi Am 19.07.2013 08:58, schrieb Stefan Priebe - Profihost AG: All mons do not work anymore: === mon.a === Starting Ceph mon.a on ccad... [21207]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf ' Stefan Am 19.07.2013 07:59, schrieb Sage Weil: A note on upgrading: One of the fixes in 0.61.5 is with a 32bit vs 64bit bug with the feature bits. We did not realize it before, but the fix will prevent 0.61.4 (or earlier) from forming a quorum with 0.61.5. This is similar to the upgrade from bobtail (and the future upgrade to dumpling). As such, we recommend you upgrade all monitors at once to avoid the potential for discruption in service. I'm adding a note to the release notes. Thanks! sage On Thu, 18 Jul 2013, Sage Weil wrote: We've prepared another update for the Cuttlefish v0.61.x series. This release primarily contains monitor stability improvements, although there are also some important fixes for ceph-osd for large clusters and a few important CephFS fixes. We recommend that all v0.61.x users upgrade. * mon: misc sync improvements (faster, more reliable, better tuning) * mon: enable leveldb cache by default (big performance improvement) * mon: new scrub feature (primarily for diagnostic, testing purposes) * mon: fix occasional leveldb assertion
[PATCH] mon: use first_commited instead of latest_full map if latest_bl.length() == 0
this fixes a failure like: 0 2013-07-19 09:29:16.803918 7f7fb5f31780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f7fb5f31780 time 2013-07-19 09:29:16.803439 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-15-g72c7c74 (72c7c74e1f160e6be39b6edf30bce09b770fa777) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d121] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2a46] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4b1cf9] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f7fb408dc8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- src/mon/OSDMonitor.cc |6 ++ 1 file changed, 6 insertions(+) diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 9c854cd..ab3b8ec 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -129,6 +129,12 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap) if ((latest_full 0) (latest_full osdmap.epoch)) { bufferlist latest_bl; get_version_full(latest_full, latest_bl); + +if (latest_bl.length() == 0 latest_full != 0 get_first_committed() 1) { +dout(0) __func__ latest_bl.length() == 0 use first_commited instead of latest_full dendl; +latest_full = get_first_committed(); +get_version_full(latest_full, latest_bl); +} assert(latest_bl.length() != 0); dout(7) __func__ loading latest full map e latest_full dendl; osdmap.decode(latest_bl); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mon: use first_commited instead of latest_full map if latest_bl.length() == 0
Hi, sorry as all my mons were down with the same error - i was in a hurry made sadly no copy of the mons and workaround by hack ;-( but i posted a log to pastebin with debug mon 20. (see last email) Stefan Mit freundlichen Grüßen Stefan Priebe Bachelor of Science in Computer Science (BSCS) Vorstand (CTO) --- Profihost AG Am Mittelfelde 29 30519 Hannover Deutschland Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282 URL: http://www.profihost.com | E-Mail: i...@profihost.com Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827 Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350 Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender) Am 19.07.2013 14:54, schrieb Joao Eduardo Luis: On 07/19/2013 09:31 AM, Stefan Priebe wrote: this fixes a failure like: 0 2013-07-19 09:29:16.803918 7f7fb5f31780 -1 mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f7fb5f31780 time 2013-07-19 09:29:16.803439 mon/OSDMonitor.cc: 132: FAILED assert(latest_bl.length() != 0) ceph version 0.61.5-15-g72c7c74 (72c7c74e1f160e6be39b6edf30bce09b770fa777) 1: (OSDMonitor::update_from_paxos(bool*)+0x16e1) [0x51d121] 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2a46] 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7] 4: (Monitor::init_paxos()+0xe5) [0x48f955] 5: (Monitor::preinit()+0x679) [0x4b1cf9] 6: (main()+0x36b0) [0x484bb0] 7: (__libc_start_main()+0xfd) [0x7f7fb408dc8d] 8: /usr/bin/ceph-mon() [0x4801e9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- src/mon/OSDMonitor.cc |6 ++ 1 file changed, 6 insertions(+) diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 9c854cd..ab3b8ec 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -129,6 +129,12 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap) if ((latest_full 0) (latest_full osdmap.epoch)) { bufferlist latest_bl; get_version_full(latest_full, latest_bl); + +if (latest_bl.length() == 0 latest_full != 0 get_first_committed() 1) { latest_full is always 0 here, following the previous if check. +dout(0) __func__ latest_bl.length() == 0 use first_commited instead of latest_full dendl; +latest_full = get_first_committed(); +get_version_full(latest_full, latest_bl); +} assert(latest_bl.length() != 0); dout(7) __func__ loading latest full map e latest_full dendl; osdmap.decode(latest_bl); Although appreciated, this patch fixes the symptom leading to the crash. The bug itself seems to be that there is a latest_full version that is empty. Until we know for sure what is happening and what is leading to such state, fixing the symptom is not advisable, as it is not only masking the real issue but it may also have unforeseen long-term effects. Stefan, do you still have the store state on which this was triggered? If so, can you share it with us (or dig a bit into it yourself if you can't share the store, in which case I'll let you know what to look for). -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow request problem
Hello list, might this be a problem due to having too much PGs? I've 370 per OSD instead of having 33 / OSD (OSDs*100/3). Is there any plan for PG merging? Stefan Hello list, anyone else here who always has problems bringing back an offline OSD? Since cuttlefish i'm seeing slow requests for the first 2-5 minutes after bringing an OSD oinline again but that's so long that the VMs crash as they think their disk is offline... Under bobtail i never had any problems with that. Please HELP! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] slow request problem
Hi sage, Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com: On Sun, 14 Jul 2013, Stefan Priebe wrote: Hello list, might this be a problem due to having too much PGs? I've 370 per OSD instead of having 33 / OSD (OSDs*100/3). That might exacerbate it. Can you try setting osd min pg log entries = 50 osd max pg log entries = 100 What does that exactly do? And why is a restart of all osds needed. Thanks! across your cluster, restarting your osds, and see if that makes a difference? I'm wondering if this is a problem with pg log rewrites after peering. Note that adding that option and restarting isn't enough to trigger the trim; you have to hit the cluster with some IO too, and (if this is the source of your problem) the trim itself might be expensive. So add it, restart, do a bunch of io (to all pools/pgs if you can), and then see if the problem is still present? Will try can't produce a write to every pg. it's a prod. Cluster with KVM rbd. But it has 800-1200 iop/s per second. Also note that the lower osd min pg log entries means that the osd cannot be down as long without requiring a backfill (50 ios per pg). These probably aren't the values that we want, but I'd like to find out whether the pg log rewrites after peering in cuttlefish are the culprit here. Thanks! Is there any plan for PG merging? Not right now. :( I'll talk to Sam, though, to see how difficult it would be given the split approach we settled on. Thanks! sage Stefan Hello list, anyone else here who always has problems bringing back an offline OSD? Since cuttlefish i'm seeing slow requests for the first 2-5 minutes after bringing an OSD oinline again but that's so long that the VMs crash as they think their disk is offline... Under bobtail i never had any problems with that. Please HELP! Greets, Stefan ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] slow request problem
Am 14.07.2013 18:19, schrieb Sage Weil: On Sun, 14 Jul 2013, Stefan Priebe - Profihost AG wrote: Hi sage, Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com: On Sun, 14 Jul 2013, Stefan Priebe wrote: Hello list, might this be a problem due to having too much PGs? I've 370 per OSD instead of having 33 / OSD (OSDs*100/3). That might exacerbate it. Can you try setting osd min pg log entries = 50 osd max pg log entries = 100 What does that exactly do? And why is a restart of all osds needed. Thanks! This limits the size of the pg log. across your cluster, restarting your osds, and see if that makes a difference? I'm wondering if this is a problem with pg log rewrites after peering. Note that adding that option and restarting isn't enough to trigger the trim; you have to hit the cluster with some IO too, and (if this is the source of your problem) the trim itself might be expensive. So add it, restart, do a bunch of io (to all pools/pgs if you can), and then see if the problem is still present? Will try can't produce a write to every pg. it's a prod. Cluster with KVM rbd. But it has 800-1200 iop/s per second. Hmm, if this is a production cluster, I would be careful, then! Setting the pg logs too short can lead to backfill, which is very expensive (as you know). The defaults are 3000 / 1, so maybe try something less aggressive like changing min to 500? I've lowered the values to 500 / 1500 and it seems to lower the impact but does not seem to solve that one. Stefan Also, I think ceph osd tell \* injectargs '--osd-min-pg-log-entries 500' should work as well. But again, be aware that lowering the value will incur a trim that may in itself be a bit expensive (if this is the source of the problem). It is probably worth watching ceph pg dump | grep $some_random_pg and watching the 'v' column over time (say, a minute or two) to see how quickly pg events are being generated on your cluster. This will give you a sense of how much time 500 (or however many) pg log entries covers! sage Also note that the lower osd min pg log entries means that the osd cannot be down as long without requiring a backfill (50 ios per pg). These probably aren't the values that we want, but I'd like to find out whether the pg log rewrites after peering in cuttlefish are the culprit here. Thanks! Is there any plan for PG merging? Not right now. :( I'll talk to Sam, though, to see how difficult it would be given the split approach we settled on. Thanks! sage Stefan Hello list, anyone else here who always has problems bringing back an offline OSD? Since cuttlefish i'm seeing slow requests for the first 2-5 minutes after bringing an OSD oinline again but that's so long that the VMs crash as they think their disk is offline... Under bobtail i never had any problems with that. Please HELP! Greets, Stefan ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] slow request problem
Am 14.07.2013 21:05, schrieb Sage Weil: On Sun, 14 Jul 2013, Stefan Priebe wrote: Am 14.07.2013 18:19, schrieb Sage Weil: On Sun, 14 Jul 2013, Stefan Priebe - Profihost AG wrote: Hi sage, Am 14.07.2013 um 17:01 schrieb Sage Weil s...@inktank.com: On Sun, 14 Jul 2013, Stefan Priebe wrote: Hello list, might this be a problem due to having too much PGs? I've 370 per OSD instead of having 33 / OSD (OSDs*100/3). That might exacerbate it. Can you try setting osd min pg log entries = 50 osd max pg log entries = 100 What does that exactly do? And why is a restart of all osds needed. Thanks! This limits the size of the pg log. across your cluster, restarting your osds, and see if that makes a difference? I'm wondering if this is a problem with pg log rewrites after peering. Note that adding that option and restarting isn't enough to trigger the trim; you have to hit the cluster with some IO too, and (if this is the source of your problem) the trim itself might be expensive. So add it, restart, do a bunch of io (to all pools/pgs if you can), and then see if the problem is still present? Will try can't produce a write to every pg. it's a prod. Cluster with KVM rbd. But it has 800-1200 iop/s per second. Hmm, if this is a production cluster, I would be careful, then! Setting the pg logs too short can lead to backfill, which is very expensive (as you know). The defaults are 3000 / 1, so maybe try something less aggressive like changing min to 500? I've lowered the values to 500 / 1500 and it seems to lower the impact but does not seem to solve that one. This suggests that the problem is the pg log rewrites that are an inherent part of cuttlefish. This is replaced with improved rewrite logic in 0.66 or so, so dumpling will be better. I suspect that having a large number of pgs is exacerbating the issue for you. We think there is still a different peering performance problem that Sam and paravoid have been trying to track down, but I believe in that case reducing the pg log sizes didn't have much effect. (Maybe one of them can chime in here.) This was unfortunately something we failed to catch before cuttlefish was released. One of the main focuses right now is in creating large clusters and observing peering and recovery to make sure we don't repeat the same sort of mistake for dumpling! Thanks Sage for these information. I had some OSD restarts which went better with the new settings but others which don't. But it's hard to measure and compare restart OSD.X with OSD.Y. Do you have any recommandations for me? Wait for dumpling and hope that nothing fails until then? Or upgrading to 0.66? Or trying to move all data to a new pool having fewer PGs? Thanks! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow request problem
Hello list, anyone else here who always has problems bringing back an offline OSD? Since cuttlefish i'm seeing slow requests for the first 2-5 minutes after bringing an OSD oinline again but that's so long that the VMs crash as they think their disk is offline... Under bobtail i never had any problems with that. Please HELP! Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
still cuttlefish recovery problems
Hello, since the peering problems are gone with 0.61.4 (Bug 5232) i'm still having heavy problems with recovering after OSD or host restart. I'm seeing a lot of slow requests and stucked I/O from clients. I've opened i bug report here: http://tracker.ceph.com/issues/5401 I really would like to know if i'm the only one. Should i update to 0.66? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
flatten rbd export / export-diff ?
Hi, is there a way to flatten the rbd export-diff to a new image FILE. Or do i always have to: rbd import OLD BASE IMAGE rbd import-diff diff1 rbd import-diff diff1-2 rbd import-diff diff2-3 rbd import-diff diff3-4 rbd import-diff diff4-5 ... and so on? I would like to apply the diffs on local disk and then import the new file. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flatten rbd export / export-diff ?
Am 04.06.2013 17:23, schrieb Sage Weil: On Tue, 4 Jun 2013, Stefan Priebe - Profihost AG wrote: Hi, is there a way to flatten the rbd export-diff to a new image FILE. Or do i always have to: rbd import OLD BASE IMAGE rbd import-diff diff1 rbd import-diff diff1-2 rbd import-diff diff2-3 rbd import-diff diff3-4 rbd import-diff diff4-5 ... and so on? I would like to apply the diffs on local disk and then import the new file. Not currently. The format is very simple, though; it should be pretty simple to implement a subcommand in the rbd tool to do it. Oh my C skills are more than limited ;-( i could do it in perl ;-) Is there a format description? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html