Re: [ceph-users] RBD performance slowly degrades :-(
Hi Irek, Thanks for the link. I have removed the SSD's for now and performance is up to 30MB/s on a benchmark now. To be honest, I new the Samsung SSD weren't great but did not expect them to be worse then just plain hard disks. Pieter On Aug 12, 2015, at 01:09 PM, Irek Fasikhov malm...@gmail.com wrote: Hi. Read this thread here: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17360.html С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-12 14:52 GMT+03:00 Pieter Koorts pieter.koo...@me.com: Hi Something that's been bugging me for a while is I am trying to diagnose iowait time within KVM guests. Guests doing reads or writes tend do about 50% to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the result is the guests are extremely slow. I currently run 3x hosts each with a single SSD and single HDD OSD in cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a great one it should at least perform reasonably compared to a hard disk and doing some direct SSD tests I get approximately 100MB/s write and 200MB/s read on each SSD. When I run rados bench though, the benchmark starts with a not great but okay speed and as the benchmark progresses it just gets slower and slower till it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB RAW) and in use at about 90GB. I have tried tuning the XFS mount options as well but it has had little effect. Understandably the server spec is not great but I don't expect performance to be that bad. OSD config: [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M Servers spec: Dual Quad Core XEON E5410 and 32GB RAM in each server 10GBE @ 10G speed with 8000byte Jumbo Frames. Rados bench result: (starts at 50MB/s average and plummets down to 11MB/s) sudo rados bench -p rbd 50 write --no-cleanup -t 1 Maintaining 1 concurrent writes of 4194304 bytes for up to 50 seconds or 0 objects Object prefix: benchmark_data_osc-mgmt-1_10007 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 1 14 13 51.9906 52 0.0671911 0.074661 2 1 27 26 51.9908 52 0.0631836 0.0751152 3 1 37 36 47.9921 40 0.0691167 0.0802425 4 1 51 50 49.9922 56 0.0816432 0.0795869 5 1 56 55 43.9934 20 0.208393 0.088523 6 1 61 60 39.994 20 0.241164 0.0999179 7 1 64 63 35.9934 12 0.239001 0.106577 8 1 66 65 32.4942 8 0.214354 0.122767 9 1 72 71 31.55 24 0.132588 0.125438 10 1 77 76 30.3948 20 0.256474 0.128548 11 1 79 78 28.3589 8 0.183564 0.138354 12 1 82 81 26.9956 12 0.345809 0.145523 13 1 85 84 25.842 12 0.373247 0.151291 14 1 86 85 24.2819 4 0.950586 0.160694 15 1 86 85 22.6632 0 - 0.160694 16 1 90 89 22.2466 8 0.204714 0.178352 17 1 94 93 21.8791 16 0.282236 0.180571 18 1 98 97 21.5524 16 0.262566 0.183742 19 1 101 100 21.0495 12 0.357659 0.187477 20 1 104 103 20.597 12 0.369327 0.192479 21 1 105 104 19.8066 4 0.373233 0.194217 22 1 105 104 18.9064 0 - 0.194217 23 1 106 105 18.2582 2 2.35078 0.214756 24 1 107 106 17.6642 4 0.680246 0.219147 25 1 109 108 17.2776 8 0.677688 0.229222 26 1 113 112 17.2283 16 0.29171 0.230487 27 1 117 116 17.1828 16 0.255915 0.231101 28 1 120 119 16.9976 12 0.412411 0.235122 29 1 120 119 16.4115 0 - 0.235122 30 1 120 119 15.8645 0 - 0.235122 31 1 120 119 15.3527 0 - 0.235122 32 1 122 121 15.1229 2 0.319309 0.262822 33 1 124 123 14.9071 8 0.344094 0.266201 34 1 127 126 14.8215 12 0.33534 0.267913 35 1 129 128 14.6266 8 0.355403 0.269241 36 1 132 131 14.5536 12 0.581528 0.274327 37 1 132 131 14.1603 0 - 0.274327
Re: [ceph-users] Is there a limit for object size in CephFS?
4.0.6-300.fc22.x86_64 On Tue, Aug 11, 2015 at 10:24 PM, Yan, Zheng uker...@gmail.com wrote: On Wed, Aug 12, 2015 at 5:33 AM, Hadi Montakhabi h...@cs.uh.edu wrote: [sequential read] readwrite=read size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 # causes the kernel buffer and page cache to be invalidated #nrfiles=1 [sequential write] readwrite=write # randread randwrite size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 [random read] readwrite=randread size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 [random write] readwrite=randwrite size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 I just tried 4.2-rc kernel, everything went well. which version of kernel were you using? On Sun, Aug 9, 2015 at 9:27 PM, Yan, Zheng uker...@gmail.com wrote: On Sun, Aug 9, 2015 at 8:57 AM, Hadi Montakhabi h...@cs.uh.edu wrote: I am using fio. I use the kernel module to Mount CephFS. please send fio job file to us On Aug 8, 2015 10:52 AM, Ketor D d.ke...@gmail.com wrote: Hi Haidi, Which bench tool do you use? And how you mount CephFS, ceph-fuse or kernel-cephfs? On Fri, Aug 7, 2015 at 11:50 PM, Hadi Montakhabi h...@cs.uh.edu wrote: Hello Cephers, I am benchmarking CephFS. In one of my experiments, I change the object size. I start from 64kb. Everytime I do different block size reads and writes. By increasing the object size to 64MB and increasing the block size to 64MB, CephFS crashes (shown in the chart below). What I mean by crash is when I do ceph -s or ceph -w it gets into constantly reporting me reads, but it never finishes the operation (even after a few days!). I have repeated this experiment for different underlying file systems (xfs and btrfs), and the same thing happens in both cases. What could be the reason for crashing CephFS? Is there a limit for object size in CephFS? Thank you, Hadi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD performance slowly degrades :-(
Здравствуйте! On Wed, Aug 12, 2015 at 02:30:59PM +, pieter.koorts wrote: Hi Irek, Thanks for the link. I have removed the SSD's for now and performance is up to 30MB/s on a benchmark now. To be honest, I new the Samsung SSD weren't great but did not expect them to be worse then just plain hard disks. I had the same trouble with Samsung 840 EVO 1TB. 15 of 16 disks was terribly slow (about 3000 iops and up to 200 MBps per drive). All the drives were replased by 850 EVO 250 GB and problem was fixed. My ssds had the latest firmware and was brand new at the moment of test. Pieter Something that's been bugging me for a while is I am trying to diagnose iowait time within KVM guests. Guests doing reads or writes tend do about 50% to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the result is the guests are extremely slow. I currently run 3x hosts each with a single SSD and single HDD OSD in cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a great one it should at least perform reasonably compared to a hard disk and doing some direct SSD tests I get approximately 100MB/s write and 200MB/s read on each SSD. When I run rados bench though, the benchmark starts with a not great but okay speed and as the benchmark progresses it just gets slower and slower till it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB RAW) and in use at about 90GB. I have tried tuning the XFS mount options as well but it has had little effect. -- WBR, Max A. Krasilnikov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd rename snaps?
Hi, for mds there is the ability to rename snapshots. But for rbd i can't see one. Is there a way to rename a snapshot? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CEPH cache layer. Very slow
Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12 disks on each, 10 HDD, 2 SSD) Also we cover this with custom crushmap with 2 root leaf ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -100 5.0 root ssd -102 1.0 host ix-s2-ssd 2 1.0 osd.2 up 1.0 1.0 9 1.0 osd.9 up 1.0 1.0 -103 1.0 host ix-s3-ssd 3 1.0 osd.3 up 1.0 1.0 7 1.0 osd.7 up 1.0 1.0 -104 1.0 host ix-s5-ssd 1 1.0 osd.1 up 1.0 1.0 6 1.0 osd.6 up 1.0 1.0 -105 1.0 host ix-s6-ssd 4 1.0 osd.4 up 1.0 1.0 8 1.0 osd.8 up 1.0 1.0 -106 1.0 host ix-s7-ssd 0 1.0 osd.0 up 1.0 1.0 5 1.0 osd.5 up 1.0 1.0 -1 5.0 root platter -2 1.0 host ix-s2-platter 13 1.0 osd.13 up 1.0 1.0 17 1.0 osd.17 up 1.0 1.0 21 1.0 osd.21 up 1.0 1.0 27 1.0 osd.27 up 1.0 1.0 32 1.0 osd.32 up 1.0 1.0 37 1.0 osd.37 up 1.0 1.0 44 1.0 osd.44 up 1.0 1.0 48 1.0 osd.48 up 1.0 1.0 55 1.0 osd.55 up 1.0 1.0 59 1.0 osd.59 up 1.0 1.0 -3 1.0 host ix-s3-platter 14 1.0 osd.14 up 1.0 1.0 18 1.0 osd.18 up 1.0 1.0 23 1.0 osd.23 up 1.0 1.0 28 1.0 osd.28 up 1.0 1.0 33 1.0 osd.33 up 1.0 1.0 39 1.0 osd.39 up 1.0 1.0 43 1.0 osd.43 up 1.0 1.0 47 1.0 osd.47 up 1.0 1.0 54 1.0 osd.54 up 1.0 1.0 58 1.0 osd.58 up 1.0 1.0 -4 1.0 host ix-s5-platter 11 1.0 osd.11 up 1.0 1.0 16 1.0 osd.16 up 1.0 1.0 22 1.0 osd.22 up 1.0 1.0 26 1.0 osd.26 up 1.0 1.0 31 1.0 osd.31 up 1.0 1.0 36 1.0 osd.36 up 1.0 1.0 41 1.0 osd.41 up 1.0 1.0 46 1.0 osd.46 up 1.0 1.0 51 1.0 osd.51 up 1.0 1.0 56 1.0 osd.56 up 1.0 1.0 -5 1.0 host ix-s6-platter 12 1.0 osd.12 up 1.0 1.0 19 1.0 osd.19 up 1.0 1.0 24 1.0 osd.24 up 1.0 1.0 29 1.0 osd.29 up 1.0 1.0 34 1.0 osd.34 up 1.0 1.0 38 1.0 osd.38 up 1.0 1.0 42 1.0 osd.42 up 1.0 1.0 50 1.0 osd.50 up 1.0 1.0 53 1.0 osd.53 up 1.0 1.0 57 1.0 osd.57 up 1.0 1.0 -6 1.0 host ix-s7-platter 10 1.0 osd.10 up 1.0 1.0 15 1.0 osd.15 up 1.0 1.0 20 1.0 osd.20 up 1.0 1.0 25 1.0 osd.25 up 1.0 1.0 30 1.0 osd.30 up 1.0 1.0 35 1.0 osd.35 up 1.0 1.0 40 1.0 osd.40 up 1.0 1.0 45 1.0 osd.45 up 1.0 1.0 49 1.0 osd.49 up 1.0 1.0 52 1.0 osd.52 up 1.0 1.0 Then create 2 pools, 1 on HDD (platters), 1 on SSD/ and put SSD pul in from of HDD pool (cache tier) now we receive very bad performance results from cluster. Even with rados
[ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped
I ran a ceph osd reweight-by-utilization yesterday and partway through had a network interruption. After the network was restored the cluster continued to rebalance but this morning the cluster has stopped rebalance and status will not change from: # ceph status cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c health HEALTH_WARN 1 pgs degraded 1 pgs stuck degraded 2 pgs stuck unclean 1 pgs stuck undersized 1 pgs undersized recovery 8163/66089054 objects degraded (0.012%) recovery 8194/66089054 objects misplaced (0.012%) monmap e24: 3 mons at {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0} election epoch 250, quorum 0,1,2 mon1,mon2,mon3 osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects 251 TB used, 111 TB / 363 TB avail 8163/66089054 objects degraded (0.012%) 8194/66089054 objects misplaced (0.012%) 4142 active+clean 1 active+undersized+degraded 1 active+remapped # ceph health detail HEALTH_WARN 1 pgs degraded; 1 pgs stuck degraded; 2 pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized; recovery 8163/66089054 objects degraded (0.012%); recovery 8194/66089054 objects misplaced (0.012%) pg 2.e7f is stuck unclean for 65125.554509, current state active+remapped, last acting [58,5] pg 2.782 is stuck unclean for 65140.681540, current state active+undersized+degraded, last acting [76] pg 2.782 is stuck undersized for 60568.221461, current state active+undersized+degraded, last acting [76] pg 2.782 is stuck degraded for 60568.221549, current state active+undersized+degraded, last acting [76] pg 2.782 is active+undersized+degraded, acting [76] recovery 8163/66089054 objects degraded (0.012%) recovery 8194/66089054 objects misplaced (0.012%) # ceph pg 2.e7f query recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2015-08-11 15:43:09.190269, might_have_unfound: [], recovery_progress: { backfill_targets: [], waiting_on_backfill: [], last_backfill_started: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: [] }, peer_backfill_info: [], backfills_in_flight: [], recovering: [], pg_backend: { pull_from_peer: [], pushing: [] } }, scrub: { scrubber.epoch_start: 0, scrubber.active: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: [] } }, { name: Started, enter_time: 2015-08-11 15:43:04.955796 } ], # ceph pg 2.782 query recovery_state: [ { name: Started\/Primary\/Active, enter_time: 2015-08-11 15:42:42.178042, might_have_unfound: [ { osd: 5, status: not queried } ], recovery_progress: { backfill_targets: [], waiting_on_backfill: [], last_backfill_started: 0\/\/0\/\/-1, backfill_info: { begin: 0\/\/0\/\/-1, end: 0\/\/0\/\/-1, objects: [] }, peer_backfill_info: [], backfills_in_flight: [], recovering: [], pg_backend: { pull_from_peer: [], pushing: [] } }, scrub: { scrubber.epoch_start: 0, scrubber.active: 0, scrubber.waiting_on: 0, scrubber.waiting_on_whom: [] } }, { name: Started, enter_time: 2015-08-11 15:42:41.139709 } ], agent_state: {} I tried restarted osd.5/58/76 but no change. Any suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
If I am using a more recent client(kernel OR ceph-fuse), should I still be worried about the MDS's crashing? I have added RAM to my MDS hosts and its my understanding this will also help mitigate any issues, in addition to setting mds_bal_frag = true. Not having used cephfs before, do I always need to worry about my MDS servers crashing all the time, thus the need for setting mds_reconnect_timeout to 0? This is not ideal for us nor is the idea of clients not able to access their mounts after a MDS recovery. I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. thanks again for the help. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
I also encounter a problem,standby mds can not be altered to active when active mds service stopped,which bother me for serval days.Maybe MDS cluster can solve those problem,but ceph team haven't released this feature. yangyongp...@bwstor.com.cn From: Yan, Zheng Date: 2015-08-13 10:21 To: Bob Ababurko CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] mds server(s) crashed On Thu, Aug 13, 2015 at 7:05 AM, Bob Ababurko b...@ababurko.net wrote: If I am using a more recent client(kernel OR ceph-fuse), should I still be worried about the MDS's crashing? I have added RAM to my MDS hosts and its my understanding this will also help mitigate any issues, in addition to setting mds_bal_frag = true. Not having used cephfs before, do I always need to worry about my MDS servers crashing all the time, thus the need for setting mds_reconnect_timeout to 0? This is not ideal for us nor is the idea of clients not able to access their mounts after a MDS recovery. It's unlikely this issue will happen again. But I can't guarantee no other issue. no need to set mds_reconnect_timeout to 0. I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. One active MDS is the most stable setup. Adding a few standby MDS should not hurt stability. You can't set mds cache size to a number greater than files in the fs, it requires lots of memory. Yan, Zheng thanks again for the help. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Wed, Aug 12, 2015 at 7:21 PM, Yan, Zheng uker...@gmail.com wrote: On Thu, Aug 13, 2015 at 7:05 AM, Bob Ababurko b...@ababurko.net wrote: If I am using a more recent client(kernel OR ceph-fuse), should I still be worried about the MDS's crashing? I have added RAM to my MDS hosts and its my understanding this will also help mitigate any issues, in addition to setting mds_bal_frag = true. Not having used cephfs before, do I always need to worry about my MDS servers crashing all the time, thus the need for setting mds_reconnect_timeout to 0? This is not ideal for us nor is the idea of clients not able to access their mounts after a MDS recovery. It's unlikely this issue will happen again. But I can't guarantee no other issue. no need to set mds_reconnect_timeout to 0. ok, Good to know. I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. One active MDS is the most stable setup. Adding a few standby MDS should not hurt stability. You can't set mds cache size to a number greater than files in the fs, it requires lots of memory. I'm not sure what amount of RAM you consider to be 'lots' but I would really like to understand a bit more about this. Perhaps a rule of thumb? It there an advantage to more RAM large mds cache size? We plan on putting close to a billion small files in this pool via cephfs so what should we be considering when sizing our MDS hosts OR change to the MDS config? Basically, what should we OR should not be doing when we have a cluster with this many files? Thanks! Yan, Zheng thanks again for the help. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Thu, Aug 13, 2015 at 7:05 AM, Bob Ababurko b...@ababurko.net wrote: If I am using a more recent client(kernel OR ceph-fuse), should I still be worried about the MDS's crashing? I have added RAM to my MDS hosts and its my understanding this will also help mitigate any issues, in addition to setting mds_bal_frag = true. Not having used cephfs before, do I always need to worry about my MDS servers crashing all the time, thus the need for setting mds_reconnect_timeout to 0? This is not ideal for us nor is the idea of clients not able to access their mounts after a MDS recovery. It's unlikely this issue will happen again. But I can't guarantee no other issue. no need to set mds_reconnect_timeout to 0. I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. One active MDS is the most stable setup. Adding a few standby MDS should not hurt stability. You can't set mds cache size to a number greater than files in the fs, it requires lots of memory. Yan, Zheng thanks again for the help. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Wed, Aug 12, 2015 at 5:08 AM, Bob Ababurko b...@ababurko.net wrote: What is risky about enabling mds_bal_frag on a cluster with data and will there be any performance degradation if enabled? No specific gotchas, just that it is not something that has especially good coverage in our automated tests. We recently enabled it for our general purpose tests (i.e. not specifically exercising fragmentation, just normal fs workloads) and nothing's blown up horribly, but that's about as far as the assurance goes. Performance wise, there's a cost associated with splitting things up and merging them, but a benefit associated with having smaller fragments in general. Probably doesn't make a huge difference outside of the initial part where your existing large directories would all be getting fragmented all of a sudden. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Semi-reproducible crash of ceph-fuse
Jörg Henne hennejg@... writes: we are running ceph version 0.94.2 with a cephfs mounted using ceph-fuse on Ubuntu 14.04 LTS. I think we have found a bug that lets us semi-reprodicibly crash the ceph-fuse process. Reported as http://tracker.ceph.com/issues/12674 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd out
Hello. Could you please help me to remove osd from cluster; # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.02998 root default -2 0.00999 host ceph1 0 0.00999 osd.0 up 1.0 1.0 -3 0.00999 host ceph2 1 0.00999 osd.1 up 1.0 1.0 -4 0.00999 host ceph3 2 0.00999 osd.2 up 1.0 1.0 # ceph -s cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_OK monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e76: 3 osds: 3 up, 3 in pgmap v328: 128 pgs, 1 pools, 10 bytes data, 1 objects 120 MB used, 45926 MB / 46046 MB avail 128 active+clean # ceph osd out 0 marked out osd.0. # ceph -w cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_WARN 128 pgs stuck unclean recovery 1/3 objects misplaced (33.333%) monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e79: 3 osds: 3 up, 2 in; 128 remapped pgs pgmap v332: 128 pgs, 1 pools, 10 bytes data, 1 objects 89120 kB used, 30610 MB / 30697 MB avail 1/3 objects misplaced (33.333%) 128 active+remapped 2015-08-12 18:43:12.412286 mon.0 [INF] pgmap v332: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) 2015-08-12 18:43:20.362337 mon.0 [INF] HEALTH_WARN; 128 pgs stuck unclean; recovery 1/3 objects misplaced (33.333%) 2015-08-12 18:44:15.055825 mon.0 [INF] pgmap v333: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) and it never become active+clean . What I’m doing wrong ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10
An update: It seems that I am arriving at memory shortage. Even with 32 GB for 20 OSDs and 2 GB swap, ceph-osd uses all available memory. I created another swap device with 10 GB, and I managed to get the failed OSD running without crash, but consuming extra 5 GB. Are there known issues regarding memory on ceph osd? But I still get the problem of the incomplete+inactive PG. Regards. Gerd On 12-08-2015 10:11, Gerd Jakobovitsch wrote: I tried it, the error propagates to whichever OSD gets the errorred PG. For the moment, this is my worst problem. I have one PG incomplete+inactive, and the OSD with the highest priority in it gets 100 blocked requests (I guess that is the maximum), and, although running, doesn't get other requests - for example, ceph tell osd.21 injectargs '--osd-max-backfills 1'. After some time, it crashes, and the blocked requests go to the second OSD for the errorred PG. I can't get rid of these slow requests. I guessed a problem with leveldb, I checked, and had the default version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for wheezy-backports (1.17-1~bpo70+1), but the error was the same. I use regular wheezy kernel (3.2+46). On 11-08-2015 23:52, Haomai Wang wrote: it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitschg...@mandic.net.br wrote: Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem known? Is there any configuration that should be checked? Any way to try to recover these OSDs without losing all data? After that, setting the OSD to lost, I got one incomplete, inactive PG. Is there any way to recover it? Data still exists in crashed OSDs. Regards. [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 === osd.7 === 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found mon.spcsmp2 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, global_id 206010466 create-or-move updated item name 'osd.7' weight 3.64 at location {host=spcsnp3,root=default} to crush map Starting Ceph osd.7 on spcsnp3... 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-7/journal [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) dump_stop 2015-08-11 12:58:21.348291 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal /var/lib/ceph/osd/ceph-7/journal 2015-08-11 12:58:21.348326 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is 54c136da-c51c-4799-b2dc-b7988982ee00 2015-08-11 12:58:21.349010 7f200fa8f780 0 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) 2015-08-11 12:58:21.349026 7f200fa8f780 1 filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-08-11 12:58:21.353277 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-11 12:58:21.353302 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-11 12:58:21.362106 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-08-11 12:58:21.362195 7f200fa8f780 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf 2015-08-11 12:58:21.362701 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** in thread 7f200fa8f780 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7f200efcd0a0] 3: (gsignal()+0x35) [0x7f200db3f165] 4: (abort()+0x180) [0x7f200db423e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] 6: (()+0x63996) [0x7f200e393996] 7: (()+0x639c3) [0x7f200e3939c3] 8: (()+0x63bee) [0x7f200e393bee] 9: (tc_new()+0x48e) [0x7f200f213aee] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7f200e3ef999] 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x28) [0x7f200e3f0708] 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) [0x7f200f46ffa2] 15:
Re: [ceph-users] osd out
If you are using the default configuration to create the pool (3 replicas), after losing 1 OSD and having 2 left, CRUSH would not be able to find enough OSDs (at least 3) to map the PG thus it would stuck at unclean. Thanks, Guang From: chm...@yandex.ru Date: Wed, 12 Aug 2015 19:46:01 +0300 To: ceph-users@lists.ceph.com Subject: [ceph-users] osd out Hello. Could you please help me to remove osd from cluster; # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.02998 root default -2 0.00999 host ceph1 0 0.00999 osd.0 up 1.0 1.0 -3 0.00999 host ceph2 1 0.00999 osd.1 up 1.0 1.0 -4 0.00999 host ceph3 2 0.00999 osd.2 up 1.0 1.0 # ceph -s cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_OK monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e76: 3 osds: 3 up, 3 in pgmap v328: 128 pgs, 1 pools, 10 bytes data, 1 objects 120 MB used, 45926 MB / 46046 MB avail 128 active+clean # ceph osd out 0 marked out osd.0. # ceph -w cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_WARN 128 pgs stuck unclean recovery 1/3 objects misplaced (33.333%) monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e79: 3 osds: 3 up, 2 in; 128 remapped pgs pgmap v332: 128 pgs, 1 pools, 10 bytes data, 1 objects 89120 kB used, 30610 MB / 30697 MB avail 1/3 objects misplaced (33.333%) 128 active+remapped 2015-08-12 18:43:12.412286 mon.0 [INF] pgmap v332: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) 2015-08-12 18:43:20.362337 mon.0 [INF] HEALTH_WARN; 128 pgs stuck unclean; recovery 1/3 objects misplaced (33.333%) 2015-08-12 18:44:15.055825 mon.0 [INF] pgmap v333: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) and it never become active+clean . What I’m doing wrong ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH cache layer. Very slow
Hi Igor I suspect you have very much the same problem as me. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html Basically Samsung drives (like many SATA SSD's) are very much hit and miss so you will need to test them like described here to see if they are any good. http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ To give you an idea my average performance went from 11MB/s (with Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a very small cluster. Pieter On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12 disks on each, 10 HDD, 2 SSD) Also we cover this with custom crushmap with 2 root leaf ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -100 5.0 root ssd -102 1.0 host ix-s2-ssd 2 1.0 osd.2 up 1.0 1.0 9 1.0 osd.9 up 1.0 1.0 -103 1.0 host ix-s3-ssd 3 1.0 osd.3 up 1.0 1.0 7 1.0 osd.7 up 1.0 1.0 -104 1.0 host ix-s5-ssd 1 1.0 osd.1 up 1.0 1.0 6 1.0 osd.6 up 1.0 1.0 -105 1.0 host ix-s6-ssd 4 1.0 osd.4 up 1.0 1.0 8 1.0 osd.8 up 1.0 1.0 -106 1.0 host ix-s7-ssd 0 1.0 osd.0 up 1.0 1.0 5 1.0 osd.5 up 1.0 1.0 -1 5.0 root platter -2 1.0 host ix-s2-platter 13 1.0 osd.13 up 1.0 1.0 17 1.0 osd.17 up 1.0 1.0 21 1.0 osd.21 up 1.0 1.0 27 1.0 osd.27 up 1.0 1.0 32 1.0 osd.32 up 1.0 1.0 37 1.0 osd.37 up 1.0 1.0 44 1.0 osd.44 up 1.0 1.0 48 1.0 osd.48 up 1.0 1.0 55 1.0 osd.55 up 1.0 1.0 59 1.0 osd.59 up 1.0 1.0 -3 1.0 host ix-s3-platter 14 1.0 osd.14 up 1.0 1.0 18 1.0 osd.18 up 1.0 1.0 23 1.0 osd.23 up 1.0 1.0 28 1.0 osd.28 up 1.0 1.0 33 1.0 osd.33 up 1.0 1.0 39 1.0 osd.39 up 1.0 1.0 43 1.0 osd.43 up 1.0 1.0 47 1.0 osd.47 up 1.0 1.0 54 1.0 osd.54 up 1.0 1.0 58 1.0 osd.58 up 1.0 1.0 -4 1.0 host ix-s5-platter 11 1.0 osd.11 up 1.0 1.0 16 1.0 osd.16 up 1.0 1.0 22 1.0 osd.22 up 1.0 1.0 26 1.0 osd.26 up 1.0 1.0 31 1.0 osd.31 up 1.0 1.0 36 1.0 osd.36 up 1.0 1.0 41 1.0 osd.41 up 1.0 1.0 46 1.0 osd.46 up 1.0 1.0 51 1.0 osd.51 up 1.0 1.0 56 1.0 osd.56 up 1.0 1.0 -5 1.0 host ix-s6-platter 12 1.0 osd.12 up 1.0 1.0 19 1.0 osd.19 up 1.0 1.0 24 1.0 osd.24 up 1.0 1.0 29 1.0 osd.29 up 1.0 1.0 34 1.0 osd.34 up 1.0 1.0 38 1.0 osd.38 up 1.0 1.0 42 1.0 osd.42 up 1.0 1.0 50 1.0 osd.50 up 1.0 1.0 53 1.0 osd.53 up 1.0 1.0 57 1.0 osd.57 up 1.0 1.0 -6 1.0 host ix-s7-platter 10 1.0 osd.10 up 1.0 1.0 15 1.0 osd.15 up 1.0 1.0 20 1.0 osd.20 up 1.0 1.0 25 1.0
Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10
I tried it, the error propagates to whichever OSD gets the errorred PG. For the moment, this is my worst problem. I have one PG incomplete+inactive, and the OSD with the highest priority in it gets 100 blocked requests (I guess that is the maximum), and, although running, doesn't get other requests - for example, ceph tell osd.21 injectargs '--osd-max-backfills 1'. After some time, it crashes, and the blocked requests go to the second OSD for the errorred PG. I can't get rid of these slow requests. I guessed a problem with leveldb, I checked, and had the default version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for wheezy-backports (1.17-1~bpo70+1), but the error was the same. I use regular wheezy kernel (3.2+46). On 11-08-2015 23:52, Haomai Wang wrote: it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote: Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem known? Is there any configuration that should be checked? Any way to try to recover these OSDs without losing all data? After that, setting the OSD to lost, I got one incomplete, inactive PG. Is there any way to recover it? Data still exists in crashed OSDs. Regards. [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 === osd.7 === 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found mon.spcsmp2 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, global_id 206010466 create-or-move updated item name 'osd.7' weight 3.64 at location {host=spcsnp3,root=default} to crush map Starting Ceph osd.7 on spcsnp3... 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-7/journal [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) dump_stop 2015-08-11 12:58:21.348291 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal /var/lib/ceph/osd/ceph-7/journal 2015-08-11 12:58:21.348326 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is 54c136da-c51c-4799-b2dc-b7988982ee00 2015-08-11 12:58:21.349010 7f200fa8f780 0 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) 2015-08-11 12:58:21.349026 7f200fa8f780 1 filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-08-11 12:58:21.353277 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-11 12:58:21.353302 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-11 12:58:21.362106 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-08-11 12:58:21.362195 7f200fa8f780 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf 2015-08-11 12:58:21.362701 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** in thread 7f200fa8f780 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7f200efcd0a0] 3: (gsignal()+0x35) [0x7f200db3f165] 4: (abort()+0x180) [0x7f200db423e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] 6: (()+0x63996) [0x7f200e393996] 7: (()+0x639c3) [0x7f200e3939c3] 8: (()+0x63bee) [0x7f200e393bee] 9: (tc_new()+0x48e) [0x7f200f213aee] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7f200e3ef999] 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x28) [0x7f200e3f0708] 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) [0x7f200f46ffa2] 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*, unsigned long*)+0x180) [0x7f200f468360] 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) [0x7f200f46adf2] 17: (leveldb::DB::Open(leveldb::Options const, std::string const, leveldb::DB**)+0xff) [0x7f200f46b11f] 18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8] 19: (FileStore::mount()+0x18e0) [0x9b7080] 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
Re: [ceph-users] RBD performance slowly degrades :-(
Hi. Read this thread here: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17360.html С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-12 14:52 GMT+03:00 Pieter Koorts pieter.koo...@me.com: Hi Something that's been bugging me for a while is I am trying to diagnose iowait time within KVM guests. Guests doing reads or writes tend do about 50% to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the result is the guests are extremely slow. I currently run 3x hosts each with a single SSD and single HDD OSD in cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a great one it should at least perform reasonably compared to a hard disk and doing some direct SSD tests I get approximately 100MB/s write and 200MB/s read on each SSD. When I run rados bench though, the benchmark starts with a not great but okay speed and as the benchmark progresses it just gets slower and slower till it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB RAW) and in use at about 90GB. I have tried tuning the XFS mount options as well but it has had little effect. Understandably the server spec is not great but I don't expect performance to be that bad. *OSD config:* [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M *Servers spec:* Dual Quad Core XEON E5410 and 32GB RAM in each server 10GBE @ 10G speed with 8000byte Jumbo Frames. *Rados bench result:* (starts at 50MB/s average and plummets down to 11MB/s) sudo rados bench -p rbd 50 write --no-cleanup -t 1 Maintaining 1 concurrent writes of 4194304 bytes for up to 50 seconds or 0 objects Object prefix: benchmark_data_osc-mgmt-1_10007 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 11413 51.990652 0.0671911 0.074661 2 12726 51.990852 0.0631836 0.0751152 3 13736 47.992140 0.0691167 0.0802425 4 15150 49.992256 0.0816432 0.0795869 5 15655 43.993420 0.208393 0.088523 6 1616039.99420 0.241164 0.0999179 7 16463 35.993412 0.239001 0.106577 8 16665 32.4942 8 0.214354 0.122767 9 17271 31.5524 0.132588 0.125438 10 17776 30.394820 0.256474 0.128548 11 17978 28.3589 8 0.183564 0.138354 12 18281 26.995612 0.345809 0.145523 13 1858425.84212 0.373247 0.151291 14 18685 24.2819 4 0.950586 0.160694 15 18685 22.6632 0 - 0.160694 16 19089 22.2466 8 0.204714 0.178352 17 19493 21.879116 0.282236 0.180571 18 19897 21.552416 0.262566 0.183742 19 1 101 100 21.049512 0.357659 0.187477 20 1 104 10320.59712 0.369327 0.192479 21 1 105 104 19.8066 4 0.373233 0.194217 22 1 105 104 18.9064 0 - 0.194217 23 1 106 105 18.2582 2 2.35078 0.214756 24 1 107 106 17.6642 4 0.680246 0.219147 25 1 109 108 17.2776 8 0.677688 0.229222 26 1 113 112 17.228316 0.29171 0.230487 27 1 117 116 17.182816 0.255915 0.231101 28 1 120 119 16.997612 0.412411 0.235122 29 1 120 119 16.4115 0 - 0.235122 30 1 120 119 15.8645 0 - 0.235122 31 1 120 119 15.3527 0 - 0.235122 32 1 122 121 15.1229 2 0.319309 0.262822 33 1 124 123 14.9071 8 0.344094 0.266201 34 1 127 126 14.821512 0.33534 0.267913 35 1 129 128 14.6266 8 0.355403 0.269241 36 1 132 131 14.553612 0.581528 0.274327 37 1 132 131 14.1603 0 - 0.274327 38 1 133 132 13.8929 2 1.43621 0.28313 39 1 134 133 13.6392 4 0.894817 0.287729 40 1 134 133 13.2982 0 - 0.287729 41 1
[ceph-users] RBD performance slowly degrades :-(
Hi Something that's been bugging me for a while is I am trying to diagnose iowait time within KVM guests. Guests doing reads or writes tend do about 50% to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the result is the guests are extremely slow. I currently run 3x hosts each with a single SSD and single HDD OSD in cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a great one it should at least perform reasonably compared to a hard disk and doing some direct SSD tests I get approximately 100MB/s write and 200MB/s read on each SSD. When I run rados bench though, the benchmark starts with a not great but okay speed and as the benchmark progresses it just gets slower and slower till it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB RAW) and in use at about 90GB. I have tried tuning the XFS mount options as well but it has had little effect. Understandably the server spec is not great but I don't expect performance to be that bad. OSD config: [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M Servers spec: Dual Quad Core XEON E5410 and 32GB RAM in each server 10GBE @ 10G speed with 8000byte Jumbo Frames. Rados bench result: (starts at 50MB/s average and plummets down to 11MB/s) sudo rados bench -p rbd 50 write --no-cleanup -t 1 Maintaining 1 concurrent writes of 4194304 bytes for up to 50 seconds or 0 objects Object prefix: benchmark_data_osc-mgmt-1_10007 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 1 14 13 51.9906 52 0.0671911 0.074661 2 1 27 26 51.9908 52 0.0631836 0.0751152 3 1 37 36 47.9921 40 0.0691167 0.0802425 4 1 51 50 49.9922 56 0.0816432 0.0795869 5 1 56 55 43.9934 20 0.208393 0.088523 6 1 61 60 39.994 20 0.241164 0.0999179 7 1 64 63 35.9934 12 0.239001 0.106577 8 1 66 65 32.4942 8 0.214354 0.122767 9 1 72 71 31.55 24 0.132588 0.125438 10 1 77 76 30.3948 20 0.256474 0.128548 11 1 79 78 28.3589 8 0.183564 0.138354 12 1 82 81 26.9956 12 0.345809 0.145523 13 1 85 84 25.842 12 0.373247 0.151291 14 1 86 85 24.2819 4 0.950586 0.160694 15 1 86 85 22.6632 0 - 0.160694 16 1 90 89 22.2466 8 0.204714 0.178352 17 1 94 93 21.8791 16 0.282236 0.180571 18 1 98 97 21.5524 16 0.262566 0.183742 19 1 101 100 21.0495 12 0.357659 0.187477 20 1 104 103 20.597 12 0.369327 0.192479 21 1 105 104 19.8066 4 0.373233 0.194217 22 1 105 104 18.9064 0 - 0.194217 23 1 106 105 18.2582 2 2.35078 0.214756 24 1 107 106 17.6642 4 0.680246 0.219147 25 1 109 108 17.2776 8 0.677688 0.229222 26 1 113 112 17.2283 16 0.29171 0.230487 27 1 117 116 17.1828 16 0.255915 0.231101 28 1 120 119 16.9976 12 0.412411 0.235122 29 1 120 119 16.4115 0 - 0.235122 30 1 120 119 15.8645 0 - 0.235122 31 1 120 119 15.3527 0 - 0.235122 32 1 122 121 15.1229 2 0.319309 0.262822 33 1 124 123 14.9071 8 0.344094 0.266201 34 1 127 126 14.8215 12 0.33534 0.267913 35 1 129 128 14.6266 8 0.355403 0.269241 36 1 132 131 14.5536 12 0.581528 0.274327 37 1 132 131 14.1603 0 - 0.274327 38 1 133 132 13.8929 2 1.43621 0.28313 39 1 134 133 13.6392 4 0.894817 0.287729 40 1 134 133 13.2982 0 - 0.287729 41 1 135 134 13.0714 2 1.87878 0.299602 42 1 138 137 13.0459 12 0.309637 0.304601 43 1 140 139 12.9285 8 0.302935 0.304491 44 1 141 140 12.7256 4 1.5538 0.313415 45
[ceph-users] Cache tier best practices
Hi, I would like to hear from people who use cache tier in Ceph about best practices and things I should avoid. I remember hearing that it wasn't that stable back then. Has it changed in Hammer release? Any tips and tricks are much appreciated! Thanks Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache tier best practices
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dominik Zalewski Sent: 12 August 2015 14:40 To: ceph-us...@ceph.com Subject: [ceph-users] Cache tier best practices Hi, I would like to hear from people who use cache tier in Ceph about best practices and things I should avoid. I remember hearing that it wasn't that stable back then. Has it changed in Hammer release? It's not so much the stability, but the performance. If your working set will sit mostly in the cache tier and won't tend to change then you might be alright. Otherwise you will find that performance is very poor. Only tip I can really give is that I have found dropping the RBD block size down to 512kb-1MB helps quite a bit as it makes the cache more effective and also minimises the amount of data transferred on each promotion/flush. Any tips and tricks are much appreciated! Thanks Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd rename snaps?
There currently is no mechanism to rename snapshots without hex editing the RBD image header data structure. I created a new Ceph feature request [1] to add this ability in the future. [1] http://tracker.ceph.com/issues/12678 -- Jason Dillaman Red Hat Ceph Storage Engineering dilla...@redhat.com http://www.redhat.com - Original Message - From: Stefan Priebe - Profihost AG s.pri...@profihost.ag To: ceph-users@lists.ceph.com Sent: Wednesday, August 12, 2015 11:10:07 AM Subject: [ceph-users] rbd rename snaps? Hi, for mds there is the ability to rename snapshots. But for rbd i can't see one. Is there a way to rename a snapshot? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd out
Yeah. You are right. Thank you. On Aug 12, 2015, at 19:53, GuangYang yguan...@outlook.com wrote: If you are using the default configuration to create the pool (3 replicas), after losing 1 OSD and having 2 left, CRUSH would not be able to find enough OSDs (at least 3) to map the PG thus it would stuck at unclean. Thanks, Guang From: chm...@yandex.ru Date: Wed, 12 Aug 2015 19:46:01 +0300 To: ceph-users@lists.ceph.com Subject: [ceph-users] osd out Hello. Could you please help me to remove osd from cluster; # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.02998 root default -2 0.00999 host ceph1 0 0.00999 osd.0 up 1.0 1.0 -3 0.00999 host ceph2 1 0.00999 osd.1 up 1.0 1.0 -4 0.00999 host ceph3 2 0.00999 osd.2 up 1.0 1.0 # ceph -s cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_OK monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e76: 3 osds: 3 up, 3 in pgmap v328: 128 pgs, 1 pools, 10 bytes data, 1 objects 120 MB used, 45926 MB / 46046 MB avail 128 active+clean # ceph osd out 0 marked out osd.0. # ceph -w cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa health HEALTH_WARN 128 pgs stuck unclean recovery 1/3 objects misplaced (33.333%) monmap e1: 3 mons at {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0} election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3 osdmap e79: 3 osds: 3 up, 2 in; 128 remapped pgs pgmap v332: 128 pgs, 1 pools, 10 bytes data, 1 objects 89120 kB used, 30610 MB / 30697 MB avail 1/3 objects misplaced (33.333%) 128 active+remapped 2015-08-12 18:43:12.412286 mon.0 [INF] pgmap v332: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) 2015-08-12 18:43:20.362337 mon.0 [INF] HEALTH_WARN; 128 pgs stuck unclean; recovery 1/3 objects misplaced (33.333%) 2015-08-12 18:44:15.055825 mon.0 [INF] pgmap v333: 128 pgs: 128 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 objects misplaced (33.333%) and it never become active+clean . What I’m doing wrong ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com