[ceph-users] Samsung DC SV843 SSD
Hi Everyone, I'm looking for some SSD's for our cluster and I came across these Samsung DC SV843 SSD's and noticed in the mailing lists from awhile back some people were talking about them. Just wondering if anyone ended up using them and how they are going? Thanks in advance, Regards, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Single OSD down
r_entry done 2015-04-16 00:43:10.198391 7f963a08c780 20 -- 10.100.128.13:6800/43939 wait: stopped reaper thread 2015-04-16 00:43:10.198399 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: closing pipes 2015-04-16 00:43:10.198401 7f963a08c780 10 -- 10.100.128.13:6800/43939 reaper 2015-04-16 00:43:10.198406 7f963a08c780 10 -- 10.100.128.13:6800/43939 reaper done 2015-04-16 00:43:10.198409 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: waiting for pipes to close 2015-04-16 00:43:10.198411 7f963a08c780 10 -- 10.100.128.13:6800/43939 wait: done. 2015-04-16 00:43:10.198413 7f963a08c780 1 -- 10.100.128.13:6800/43939 shutdown complete. 2015-04-16 00:43:10.198416 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: waiting for dispatch queue 2015-04-16 00:43:10.198429 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: dispatch queue is stopped 2015-04-16 00:43:10.198433 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: stopping accepter thread 2015-04-16 00:43:10.198436 7f963a08c780 10 accepter.stop accepter 2015-04-16 00:43:10.198450 7f962558d700 20 accepter.accepter poll got 1 2015-04-16 00:43:10.198457 7f962558d700 20 accepter.accepter closing 2015-04-16 00:43:10.198465 7f962558d700 10 accepter.accepter stopping 2015-04-16 00:43:10.198495 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: stopped accepter thread 2015-04-16 00:43:10.198500 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: stopping reaper thread 2015-04-16 00:43:10.198517 7f96347d6700 10 -- 10.100.96.13:6830/43939 reaper_entry done 2015-04-16 00:43:10.198565 7f963a08c780 20 -- 10.100.96.13:6830/43939 wait: stopped reaper thread 2015-04-16 00:43:10.198578 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: closing pipes 2015-04-16 00:43:10.198581 7f963a08c780 10 -- 10.100.96.13:6830/43939 reaper 2015-04-16 00:43:10.198583 7f963a08c780 10 -- 10.100.96.13:6830/43939 reaper done 2015-04-16 00:43:10.198586 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: waiting for pipes to close 2015-04-16 00:43:10.198588 7f963a08c780 10 -- 10.100.96.13:6830/43939 wait: done. 2015-04-16 00:43:10.198590 7f963a08c780 1 -- 10.100.96.13:6830/43939 shutdown complete. Full OSD log below https://drive.google.com/file/d/0B578d6cBmDPYQ1lCMUR2Y0tLNTA/view?usp=sharing Regards, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
Hi Michal, Really nice work on the ZFS testing. I've been thinking about this myself from time to time, However I wasn't sure if ZoL was ready to use in production with Ceph. I would like to see instead of using multiple osd's in zfs/ceph but running say a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 400GB for the zil/l2arc with compression and going back to 2x replicas which then this could give us some pretty fast/safe/efficient storage. Now to find that money tree. Regards, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: Friday, 10 April 2015 5:15 AM To: Christian Balzer; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs I had surgery and have been off for a while. Had to rebuild test ceph+openstack cluster with whatever spare parts I had. I apologize for the delay for anyone who's been interested. Here are the results; == Hardware/Software 3 node CEPH cluster, 3 OSDs (one OSD per node) -- CPU = 1x E5-2670 v1 RAM = 8GB OS Disk = 500GB SATA OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB (sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device) - ceph 0.87 ZoL 0.63 CentOS 7.0 2 node KVM/Openstack cluster CPU = 2x Xeon X5650 RAM = 24 GB OS Disk = 500GB SATA - Ubuntu 14.04 OpenStack Juno the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 IOPS == Compression; (cut out unneeded details) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph03 ~]# zfs get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 used 586G - SAS1 compressratio 1.50x - SAS1 recordsize32Klocal SAS1 checksum on default SAS1 compression lz4local SAS1 refcompressratio 1.50x - SAS1 written 586G - SAS1 logicalused 877G - == Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only be viewed at a pool level - bit odd I know) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph01 ~]# zpool get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 size 836G - SAS1 capacity 70%- SAS1 dedupratio 1.02x - SAS1 free 250G - SAS1 allocated 586G - == Bitrot/Corruption; Injected random data to random locations (changed seek to random value) of sdc with; dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1 Results; 1. ZFS detects error on disk affecting PG files, being as this is a single vdev (no zraid or mirror) it cannot automatically fix. It blocks all(but delete) access to the entire files(inaccessible). *note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), ZFS status will no longer list filename after it has been repaired/deleted/cleared* [root@ceph01 ~]# zpool status -v pool: SAS1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub in progress since Thu Apr 9 13:04:54 2015 153G scanned out of 586G at 40.3M/s, 3h3m to go 0 repaired, 26.05% done config: NAME STATE READ WRITE CKSUM SAS1 ONLINE 0 035 sdc ONLINE 0 070 logs sdb2ONLINE 0 0 0 cache sdd ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.24cc__head_6153260E__5 2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub /var/log/ceph/ceph-osd.2.log 2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.1
Re: [ceph-users] Consumer Grade SSD Clusters
Hi Nick, Agreed, I see your point of basically once your past the 150TBW or whatever that number maybe, your just waiting for failure effectively but aren't we anyway? I guess it depends on your use case at the end of the day. I wonder what the likes of Amazon, Rackspace etc are doing in the way of SSD's, either they are buying them so cheap per GB due to the "volume" or they are possibly using "consumer grade" SSD'. hmm.. using consumer grade SSD's it may be an interesting option if you have descent monitoring and alerting using SMART you should be able to still see how much spare flash you have available. As suggested by Wido using multiple brands would help remove the possible cascading failure affect which I guess we all should be doing anyway on our spinners. I guess we have to decide is it worth the extra effort in the long run vs running enterprise ssds. Regards, Quenten Grasso From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: Saturday, 24 January 2015 7:33 PM To: Quenten Grasso; ceph-users@lists.ceph.com Subject: RE: Consumer Grade SSD Clusters Hi Quenten, There is no real answer to your question. It really depends on how busy your storage will be and particularly if it is mainly reads or writes. I wouldn't pay too much attention to that SSD endurance test, whilst it's great to know that they have a lot more headroom than their official spec's, you run the risk of having a spectacular multiple disk failure if you intend to run them all that high. You can probably guarantee that as 1 SSD starts to fail the increase in workload to re-balance the cluster will cause failures on the rest. I guess it really comes down to how important is the availability of your data. Whilst an average pc user might balk at the price of paying 4 times per GB more for a S3700 SSD, in the enterprise world they are still comparatively cheap. The other thing you need to be aware of is that most consumer SSD's don't have power loss protection, again if you are mainly doing reads and cost is more important than availability, there may be an argument to use them. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso Sent: 24 January 2015 09:13 To: ceph-users@lists.ceph.com Subject: [ceph-users] Consumer Grade SSD Clusters Hi Everyone, Just wondering if anyone has had any experience in using consumer grade SSD's for a Ceph cluster? I came across this article http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3<http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM> They have been testing different SSD's write endurance and they have been able to write up to 1PB+ to a Samsung 840 Pro 256GB which is only "rated" at 150TBW and of course other SSD's have failed well before 1PBW, So defiantly worth a read. So I've been thinking about using consumer grade SSD's for OSD's and Enterprise SSD's for journals. Reasoning is enterprise SSD's are a lot faster at journaling then consumer grade drives plus this would effectively half the overall write requirements on the consumer grade disks. This also could be a cost effective alternative to using enterprise SSD's as OSD's however it seems if your happy to use 2 x replication it's a pretty good cost saving however 3x replication not so much. Cheers, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD removal rebalancing again
Hi Christian, Ahh yes, The overall host weight changed when removing the OSD as all OSD's make up the host weight in turn removal of the OSD then decreased the host weight which then triggered the rebalancing. I guess it would have made more sense if setting the osd as "out" caused the same affect earlier instead of after removing the already emptied disk. *frustrating* So would it be possible/recommended to "statically" set the host weight as 11 in this case and once removal from crush happens it shouldn't cause a rebalance because its already been rebalanced anyway? Regards, Quenten Grasso -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: Tuesday, 27 January 2015 11:53 AM To: ceph-users@lists.ceph.com Cc: Quenten Grasso Subject: Re: [ceph-users] OSD removal rebalancing again On Tue, 27 Jan 2015 01:37:52 + Quenten Grasso wrote: > Hi Christian, > > As you'll probably notice we have 11,22,33,44 marked as out as well. > but here's our tree. > > all of the OSD's in question had already been rebalanced/emptied from > the hosts. osd.0 existed on pbnerbd01 > Ah, lemme re-phrase that then, I was assuming a simpler scenario. Same reasoning, by removing the ODS the weight (not reweight) of the host changed (from 11 to 10) and that then triggered the re-balancing. Clear as mud? ^.^ Christian > > # ceph osd tree > # idweight type name up/down reweight > -1 54 root default > -3 54 rack unknownrack > -2 10 host pbnerbd01 > 1 1 osd.1 up 1 > 10 1 osd.10 up 1 > 2 1 osd.2 up 1 > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > 6 1 osd.6 up 1 > 7 1 osd.7 up 1 > 8 1 osd.8 up 1 > 9 1 osd.9 up 1 > -4 11 host pbnerbd02 > 11 1 osd.11 up 0 > 12 1 osd.12 up 1 > 13 1 osd.13 up 1 > 14 1 osd.14 up 1 > 15 1 osd.15 up 1 > 16 1 osd.16 up 1 > 17 1 osd.17 up 1 > 18 1 osd.18 up 1 > 19 1 osd.19 up 1 > 20 1 osd.20 up 1 > 21 1 osd.21 up 1 > -5 11 host pbnerbd03 > 22 1 osd.22 up 0 > 23 1 osd.23 up 1 > 24 1 osd.24 up 1 > 25 1 osd.25 up 1 > 26 1 osd.26 up 1 > 27 1 osd.27 up 1 > 28 1 osd.28 up 1 > 29 1 osd.29 up 1 > 30 1 osd.30 up 1 > 31 1 osd.31 up 1 > 32 1 osd.32 up 1 > -6 11 host pbnerbd04 > 33 1 osd.33 up 0 > 34 1 osd.34 up 1 > 35 1 osd.35 up 1 > 36 1 osd.36 up 1 > 37 1 osd.37 up 1 > 38 1 osd.38 up 1 > 39 1 osd.39 up 1 > 40 1 osd.40 up 1 > 41 1 osd.41 up 1 > 42 1 osd.42 up 1 > 43 1 osd.43 up 1 > -7 11 host pbnerbd05 > 44 1 osd.44 up 0 > 45 1 osd.45 up 1 > 46 1 osd.46 up 1 > 47 1 osd.47 up 1 > 48 1 osd.48 up 1 > 49 1 osd.49 up 1 > 50 1 osd.50
Re: [ceph-users] OSD removal rebalancing again
Hi Christian, As you'll probably notice we have 11,22,33,44 marked as out as well. but here's our tree. all of the OSD's in question had already been rebalanced/emptied from the hosts. osd.0 existed on pbnerbd01 # ceph osd tree # idweight type name up/down reweight -1 54 root default -3 54 rack unknownrack -2 10 host pbnerbd01 1 1 osd.1 up 1 10 1 osd.10 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -4 11 host pbnerbd02 11 1 osd.11 up 0 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 20 1 osd.20 up 1 21 1 osd.21 up 1 -5 11 host pbnerbd03 22 1 osd.22 up 0 23 1 osd.23 up 1 24 1 osd.24 up 1 25 1 osd.25 up 1 26 1 osd.26 up 1 27 1 osd.27 up 1 28 1 osd.28 up 1 29 1 osd.29 up 1 30 1 osd.30 up 1 31 1 osd.31 up 1 32 1 osd.32 up 1 -6 11 host pbnerbd04 33 1 osd.33 up 0 34 1 osd.34 up 1 35 1 osd.35 up 1 36 1 osd.36 up 1 37 1 osd.37 up 1 38 1 osd.38 up 1 39 1 osd.39 up 1 40 1 osd.40 up 1 41 1 osd.41 up 1 42 1 osd.42 up 1 43 1 osd.43 up 1 -7 11 host pbnerbd05 44 1 osd.44 up 0 45 1 osd.45 up 1 46 1 osd.46 up 1 47 1 osd.47 up 1 48 1 osd.48 up 1 49 1 osd.49 up 1 50 1 osd.50 up 1 51 1 osd.51 up 1 52 1 osd.52 up 1 53 1 osd.53 up 1 54 1 osd.54 up 1 Regards, Quenten Grasso -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: Tuesday, 27 January 2015 11:33 AM To: ceph-users@lists.ceph.com Cc: Quenten Grasso Subject: Re: [ceph-users] OSD removal rebalancing again Hello, A "ceph -s" and "ceph osd tree" would have been nice, but my guess is that osd.0 was the only osd on that particular storage server? In that case the removal of the bucket (host) by removing the last OSD in it also triggered a re-balancing. Not really/well documented AFAIK and annoying, but OTOH both expected (from a CRUSH perspective) and harmless. Christian On Tue, 27 Jan 2015 01:21:28 + Quenten Grasso wrote: > Hi All, > > I just removed an OSD from our cluster following the steps on > http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ > > First I set the OSD as out, > > ceph osd out osd.0 > > This emptied the OSD and eventually health of the cluster came back to > normal/ok. and OSD was up and out. (took about 2-3 hours) (OSD.0 used > space before setting as OUT was 900~ GB after rebalance took place OSD > Usage was ~150MB) > > Once this was all ok I then proceeded to STOP the OSD. > > se
[ceph-users] OSD removal rebalancing again
Hi All, I just removed an OSD from our cluster following the steps on http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ First I set the OSD as out, ceph osd out osd.0 This emptied the OSD and eventually health of the cluster came back to normal/ok. and OSD was up and out. (took about 2-3 hours) (OSD.0 used space before setting as OUT was 900~ GB after rebalance took place OSD Usage was ~150MB) Once this was all ok I then proceeded to STOP the OSD. service ceph stop osd.0 checked cluster health and all looked ok, then I decided to remove the osd using the following commands. ceph osd crush remove osd.0 ceph auth del osd.0 ceph osd rm 0 Now our cluster says health HEALTH_WARN 414 pgs backfill; 12 pgs backfilling; 19 pgs recovering; 344 pgs recovery_wait; 789 pgs stuck unclean; recovery 390967/10986568 objects degraded (3.559%) before using the removal procedure everything was "ok" and the osd.0 had been emptied and seemingly rebalanced. Any ideas why its rebalancing again? we're using Ubuntu 12.04 w/ Ceph 80.8 & Kernel 3.13.0-43-generic #72~precise1-Ubuntu SMP Tue Dec 9 12:14:18 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Regards, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Consumer Grade SSD Clusters
Hi Everyone, Just wondering if anyone has had any experience in using consumer grade SSD's for a Ceph cluster? I came across this article http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3 They have been testing different SSD's write endurance and they have been able to write up to 1PB+ to a Samsung 840 Pro 256GB which is only "rated" at 150TBW and of course other SSD's have failed well before 1PBW, So defiantly worth a read. So I've been thinking about using consumer grade SSD's for OSD's and Enterprise SSD's for journals. Reasoning is enterprise SSD's are a lot faster at journaling then consumer grade drives plus this would effectively half the overall write requirements on the consumer grade disks. This also could be a cost effective alternative to using enterprise SSD's as OSD's however it seems if your happy to use 2 x replication it's a pretty good cost saving however 3x replication not so much. Cheers, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO wait spike in VM
Hi Alexandre, No problem, I hope this saves you some pain It's probably worth going for a larger journal probably around 20Gig if you wish to play with tuning of "filestore max sync interval" could be have some interesting results. Also probably already know this however most of us when starting with ceph, use xfs/file for the journal instead of a using a partition using a "raw partition" this removes file system overhead on the journal. I highly recommend looking into dedicated journals for your systems as your spinning disks are going to work very hard trying to keep up with all the read/write seeking on these disks particularly if you're going to be using for vm's. Also you'll get about 1/3 of the write performance as a "best case scenario" using journals on the same disk and this comes down to the disks IOPS. Depending on your hardware & budget you could look into using one of these options for dedicated journals Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write (haven't tested these myself however are looking to use these in our additional nodes) Intel DC S3700 200GB these are good for about ~360mb/s write At the time we used the Intel DC S3700 100GB these drives don't have enough throughput so I'd recommend you stay away from this particular 100GB model. So if you have spare hard disk slots in your servers the 200GB DC S3700 is the best bang for buck. Usually I run 6 spinning disks to 1 SSD in an ideal world I'd like to cut this back to 4 instead of 6 tho when using the 200GB disks. Both of these SSD options would do nicely and have on board capacitors and very high write/wear rates as well. Cheers, Quenten Grasso -Original Message- From: Bécholey Alexandre [mailto:alexandre.becho...@nagra.com] Sent: Monday, 29 September 2014 4:15 PM To: Quenten Grasso; ceph-users@lists.ceph.com Cc: Aviolat Romain Subject: RE: [ceph-users] IO wait spike in VM Hello Quenten, Thanks for your reply. We have a 5GB journal for each OSD on the same disk. Right now, we are migrating our OSD to XFS and we'll add a 5th monitor. We will perform the benchmarks afterwards. Cheers, Alexandre -Original Message- From: Quenten Grasso [mailto:qgra...@onq.com.au] Sent: lundi 29 septembre 2014 01:56 To: Bécholey Alexandre; ceph-users@lists.ceph.com Cc: Aviolat Romain Subject: RE: [ceph-users] IO wait spike in VM G'day Alexandre I'm not sure if this is causing your issues, however it could be contributing to them. I noticed you have 4 Mon's, this could contributing to your problems as its recommended due to paxos algorithm which ceph uses for achieving quorum of mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth it's mentioning running 4 mon's would still only give you a possible failure of 1 mon without an outage. Spec wise the machines look pretty good, only thing I can see is the lack of journals and using btrfs at this stage. You could try some iperf testing between the machines to make sure the networking is working as expected. If you do rados benches for extended time what kind of stats do you see? For example, Write) ceph osd pool create benchmark1 ceph osd pool set benchmark1 size 3 rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32 * I suggest you create a 2nd benchmark pool and write for another 180 seconds or so to ensure nothing is cached then do a read test. Read) rados bench -p benchmark1 180 seq --concurrent-ios=32 You can also try the same using 4k blocks rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096 As you may know increasing the concurrent io's will increase cpu/disk load. = Total PG = OSD * 100 / Replicas Ie: 50 OSD System with 3 replicas would be around 1600 Hope this helps a little, Cheers, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bécholey Alexandre Sent: Thursday, 25 September 2014 1:27 AM To: ceph-users@lists.ceph.com Cc: Aviolat Romain Subject: [ceph-users] IO wait spike in VM Dear Ceph guru, We have a Ceph cluster (version 0.80.5 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) used as a backend storage for libvirt. Hosts: Ubuntu 14.04 CPU: 2 Xeon X5650 RAM: 48 GB (no swap) No SSD for journals HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition for the journal, the rest for the OSD) FS: btrfs (I know it's not recommended in the doc and I hope it's not the culprit) Network: dedicated 10GbE As we added some VMs to the cluster, we saw some sporadic huge IO wait on the VM. The hosts running the OSDs seem fine. I followed a similar discussion here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/04062
Re: [ceph-users] IO wait spike in VM
G'day Alexandre I'm not sure if this is causing your issues, however it could be contributing to them. I noticed you have 4 Mon's, this could contributing to your problems as its recommended due to paxos algorithm which ceph uses for achieving quorum of mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth it's mentioning running 4 mon's would still only give you a possible failure of 1 mon without an outage. Spec wise the machines look pretty good, only thing I can see is the lack of journals and using btrfs at this stage. You could try some iperf testing between the machines to make sure the networking is working as expected. If you do rados benches for extended time what kind of stats do you see? For example, Write) ceph osd pool create benchmark1 ceph osd pool set benchmark1 size 3 rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32 * I suggest you create a 2nd benchmark pool and write for another 180 seconds or so to ensure nothing is cached then do a read test. Read) rados bench -p benchmark1 180 seq --concurrent-ios=32 You can also try the same using 4k blocks rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096 As you may know increasing the concurrent io's will increase cpu/disk load. = Total PG = OSD * 100 / Replicas Ie: 50 OSD System with 3 replicas would be around 1600 Hope this helps a little, Cheers, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bécholey Alexandre Sent: Thursday, 25 September 2014 1:27 AM To: ceph-users@lists.ceph.com Cc: Aviolat Romain Subject: [ceph-users] IO wait spike in VM Dear Ceph guru, We have a Ceph cluster (version 0.80.5 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) used as a backend storage for libvirt. Hosts: Ubuntu 14.04 CPU: 2 Xeon X5650 RAM: 48 GB (no swap) No SSD for journals HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition for the journal, the rest for the OSD) FS: btrfs (I know it's not recommended in the doc and I hope it's not the culprit) Network: dedicated 10GbE As we added some VMs to the cluster, we saw some sporadic huge IO wait on the VM. The hosts running the OSDs seem fine. I followed a similar discussion here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html Here is an example of a transaction that took some time: { "description": "osd_op(client.5275.0:262936 rbd_data.22e42ae8944a.0807 [] 3.c9699248 ack+ondisk+write e3158)", "received_at": "2014-09-23 15:23:30.820958", "age": "108.329989", "duration": "5.814286", "type_data": [ "commit sent; apply or cleanup", { "client": "client.5275", "tid": 262936}, [ { "time": "2014-09-23 15:23:30.821097", "event": "waiting_for_osdmap"}, { "time": "2014-09-23 15:23:30.821282", "event": "reached_pg"}, { "time": "2014-09-23 15:23:30.821384", "event": "started"}, { "time": "2014-09-23 15:23:30.821401", "event": "started"}, { "time": "2014-09-23 15:23:30.821459", "event": "waiting for subops from 14"}, { "time": "2014-09-23 15:23:30.821561", "event": "commit_queued_for_journal_write"}, { "time": "2014-09-23 15:23:30.821666", "event": "write_thread_in_journal_buffer"}, { "time": "2014-09-23 15:23:30.822591", "event": "op_applied"}, { "time": "2014-09-23 15:23:30.824707", "event": "sub_op_applied_rec"}, { "time": "2014-09-23 15:23:31.225157", "event": "journaled_completion_queued"}, { "time": "2014-09-23 15:23:31.225297", "event": "op_commit"}, { "time": "2014-09-23 15:23:36.635085", "event": "sub_op_commit_rec"},
Re: [ceph-users] NAS on RBD
We have been using the NFS/Pacemaker/RBD Method for a while explains it a bit better here, http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ PS: Thanks Sebastien, Our use case is vmware storage, So as I mentioned we've been running it for some time and we've had pretty mixed results. Pros are when it works it works really well! Cons When it doesn't, I've had a couple of instances where the XFS volumes needed fsck and this took about 3 hours on a 4TB Volume. (Lesson learnt use smaller volumes) ZFS RaidZ Option could be interesting but expensive if using say 3 Pools with 2x replicas with a RBD volume from each and a RaidZ on top of that. (I assume you would use 3 Pools here so we don't end up with data in the same PG which may be corrupted.) Currently we also use FreeNAS VM's which are backed via RBD w/ 3 replicas and ZFS Striped Volumes and iSCSI/NFS out of these. While not really HA seems mostly work be it FreeNAS iSCSI can get a bit cranky at times. We are moving towards another KVM Hypervisor such as proxmox for these vm's which don't quite fit into our Openstack environment instead of having to use "RBD Proxys" Regards, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan Van Der Ster Sent: Wednesday, 10 September 2014 12:54 AM To: Michal Kozanecki Cc: ceph-users@lists.ceph.com; Blair Bethwaite Subject: Re: [ceph-users] NAS on RBD > On 09 Sep 2014, at 16:39, Michal Kozanecki wrote: > On 9 September 2014 08:47, Blair Bethwaite wrote: >> On 9 September 2014 20:12, Dan Van Der Ster >> wrote: >>> One thing I’m not comfortable with is the idea of ZFS checking the data in >>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but >>> without any redundancy at the ZFS layer there will be no way to correct >>> that error. Of course, the hope is that RADOS will ensure 100% data >>> consistency, but what happens if not?... >> >> The ZFS checksumming would tell us if there has been any corruption, which >> as you've pointed out shouldn't happen anyway on top of Ceph. > > Just want to quickly address this, someone correct me if I'm wrong, but IIRC > even with replica value of 3 or more, ceph does not(currently) have any > intelligence when it detects a corrupted/"incorrect" PG, it will always > replace/repair the PG with whatever data is in the primary, meaning that if > the primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will > replace the good replicas with the bad. According to the the "scrub error on firefly” thread, repair "tends to choose the copy with the lowest osd number which is not obviously corrupted. Even with three replicas, it does not do any kind of voting at this time.” Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD journal deployment experiences
This reminds me of something I was trying to find out awhile back. If we have 2000 "Random" IOPS of which are 4K Blocks our cluster (assuming 3 x Replicas) will generate 6000 IOPS @ 4K onto the journals. Does this mean our Journals will absorb 6000 IOPS and turn these into X IOPS onto our spindles? If this is the case Is it possible to calculate how many IOPS a journal would "absorb" and how this would translate to x IOPS on spindle disk? Regards, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: Sunday, 7 September 2014 1:38 AM To: ceph-users Subject: Re: [ceph-users] SSD journal deployment experiences On Sat, 6 Sep 2014 14:50:20 + Dan van der Ster wrote: > September 6 2014 4:01 PM, "Christian Balzer" wrote: > > On Sat, 6 Sep 2014 13:07:27 + Dan van der Ster wrote: > > > >> Hi Christian, > >> > >> Let's keep debating until a dev corrects us ;) > > > > For the time being, I give the recent: > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12203.html > > > > And not so recent: > > http://www.spinics.net/lists/ceph-users/msg04152.html > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > (fragmentation city), never mind the other stability issues that > > crop up here ever so often. > > > Thanks for the links... So until I learn otherwise, I better assume > the OSD is lost when the journal fails. Even though I haven't > understood exactly why :( I'm going to UTSL to understand the consistency > better. > An op state diagram would help, but I didn't find one yet. > Using the source as an option of last resort is always nice, having to actually do so for something like this feels a bit lacking in the documentation department (that or my google foo being weak). ^o^ > BTW, do you happen to know, _if_ we re-use an OSD after the journal > has failed, are any object inconsistencies going to be found by a > scrub/deep-scrub? > No idea. And really a scenario I hope to never encounter. ^^;; > >> > >> We have 4 servers in a 3U rack, then each of those servers is > >> connected to one of these enclosures with a single SAS cable. > >> > >>>> With the current config, when I dd to all drives in parallel I > >>>> can write at 24*74MB/s = 1776MB/s. > >>> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe > >>> 2.0 lanes, so as far as that bus goes, it can do 4GB/s. > >>> And given your storage pod I assume it is connected with 2 > >>> mini-SAS cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s > >>> SATA bandwidth. > >> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > > > Alright, that explains that then. Any reason for not using both ports? > > > > Probably to minimize costs, and since the single 10Gig-E is a > bottleneck anyway. The whole thing is suboptimal anyway, since this > hardware was not purchased for Ceph to begin with. Hence retrofitting SSDs, > etc... > The single 10Gb/s link is the bottleneck for sustained stuff, but when looking at spikes... Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port might also get some loving. ^o^ The cluster I'm currently building is based on storage nodes with 4 SSDs (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for redundancy, not speed. ^^ > >>> Impressive, even given your huge cluster with 1128 OSDs. > >>> However that's not really answering my question, how much data is > >>> on an average OSD and thus gets backfilled in that hour? > >> > >> That's true -- our drives have around 300TB on them. So I guess it > >> will take longer - 3x longer - when the drives are 1TB full. > > > > On your slides, when the crazy user filled the cluster with 250 > > million objects and thus 1PB of data, I recall seeing a 7 hour backfill > > time? > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > close to 1PB. The point was that to fill the cluster with RBD, we'd > need > 250 million (4MB) objects. So, object-count-wise this was a full > cluster, but for the real volume it was more like 70TB IIRC (there > were some other larger objects too). > Ah, I see. ^^ > In that case, the backfilling was CPU-boun
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Sage & List I understand this is probably a hard question to answer. I mentioned previously our cluster is co-located MON’s on OSD servers, which are R515’s w/ 1 x AMD 6 Core processor & 11 3TB OSD’s w/ dual 10GBE. When our cluster is doing these busy operations and IO has stopped as in my case, I mentioned earlier running/setting tuneable to optimal or heavy recovery operations is there a way to ensure our IO doesn’t get completely blocked/stopped/frozen in our vms? Could it be as simple as putting all 3 of our mon servers on baremetal w/ssd’s? (I recall reading somewhere that a mon disk was doing several thousand IOPS during a recovery operation) I assume putting just one on baremetal won’t help because our mon’s will only ever be as fast as our slowest mon server? Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Sage, Andrija & List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrija Panic Sent: Tuesday, 15 July 2014 8:38 PM To: Sage Weil Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil mailto:sw...@redhat.com>> wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: > Hi, > after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush > tunables optimal" and after only few minutes I have added 2 more OSDs to the > CEPH cluster... > > So these 2 changes were more or a less done at the same time - rebalancing > because of tunables optimal, and rebalancing because of adding new OSD... > > Result - all VMs living on CEPH storage have gone mad, no disk access > efectively, blocked so to speak. > > Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... > > Did I do wrong by causing "2 rebalancing" to happen at the same time ? > Is this behaviour normal, to cause great load on all VMs because they are > unable to access CEPH storage efectively ? > > Thanks for any input... > -- > > Andrija Pani? > > -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Firefly Upgrade
Hi All, Just a quick question for the list, has anyone seen a significant increase in ram usage since firefly? I upgraded from 0.72.2 to 80.3 now all of my Ceph servers are using about double the ram they used to. Only other significant change to our setup was a upgrade to Kernel 3.13.0-30-generic #55~precise1-Ubuntu SMP Any ideas? Regards, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Thanks Greg, Looking forward to the new release! Regards, Quenten Grasso -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Tuesday, 1 April 2014 3:08 AM To: Quenten Grasso Cc: Kyle Bader; ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec" Yep, that looks like http://tracker.ceph.com/issues/7093, which is fixed in dumpling and most of the dev releases since emperor. ;) I also cherry-picked the fix to the emperor branch and it will be included whenever we do another point release of that. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Mar 25, 2014 at 6:39 PM, Quenten Grasso wrote: > Hi Greg, > > Restarting the actual service ie: service ceph restart osd.50, only takes a > few seconds. > > Attached is a ceph -w of just running a service ceph restart osd.50, > > You can see it marks itself down pretty much straight away. Takes a little > while to mark itself as up and finish "recovery" > > If I do this to all 12 osd's the node goes crazy, It's almost like the > node is cpu bound but it has 6 cores, and load average goes to 300+ > > http://pastie.org/pastes/8968950/text?key=0e0bs1ojbm2arnexn52iwq > > Regards, > Quenten > > -Original Message- > From: Gregory Farnum [mailto:g...@inktank.com] > Sent: Wednesday, 26 March 2014 2:02 AM > To: Quenten Grasso > Cc: Kyle Bader; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] OSD Restarts cause excessively high load average > and "requests are blocked > 32 sec" > > How long does it take for the OSDs to restart? Are you just issuing a restart > command via upstart/sysvinit/whatever? How many OSDMaps are generated from > the time you issue that command to the time the cluster is healthy again? > > This sounds like an issue we had for a while where OSDs would start peering > before they had processed the maps they needed to look at; the fix might not > have been backported to Emperor. But I'd like to be sure this isn't some > other issue you're seeing. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Sat, Mar 22, 2014 at 8:16 PM, Quenten Grasso wrote: >> Hi Kyle, >> >> Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 >> heres here's log for that one. >> >> ceph-osd.54.log.bz2 >> http://www67.zippyshare.com/v/99704627/file.html >> >> >> Strace osd 53, >> strace.zip >> http://www43.zippyshare.com/v/17581165/file.html >> >> >> Thanks, >> Quenten >> -Original Message- >> From: Kyle Bader [mailto:kyle.ba...@gmail.com] >> Sent: Sunday, 23 March 2014 12:10 PM >> To: Quenten Grasso >> Subject: Re: [ceph-users] OSD Restarts cause excessively high load average >> and "requests are blocked > 32 sec" >> >>> Any ideas on why the load average goes so crazy & starts to block IO? >> >> Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting >> the OSDs on one of your hosts and sharing the logs so we can take a look? >> >> It also might be worth while to strace one of the OSDs to try to determine >> what it's working so hard on, maybe: >> >> strace -fc -p > strace.osd1.log >> >> Thanks! >> >> -- >> >> Kyle >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Hi Greg, Restarting the actual service ie: service ceph restart osd.50, only takes a few seconds. Attached is a ceph -w of just running a service ceph restart osd.50, You can see it marks itself down pretty much straight away. Takes a little while to mark itself as up and finish "recovery" If I do this to all 12 osd's the node goes crazy, It's almost like the node is cpu bound but it has 6 cores, and load average goes to 300+ http://pastie.org/pastes/8968950/text?key=0e0bs1ojbm2arnexn52iwq Regards, Quenten -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Wednesday, 26 March 2014 2:02 AM To: Quenten Grasso Cc: Kyle Bader; ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec" How long does it take for the OSDs to restart? Are you just issuing a restart command via upstart/sysvinit/whatever? How many OSDMaps are generated from the time you issue that command to the time the cluster is healthy again? This sounds like an issue we had for a while where OSDs would start peering before they had processed the maps they needed to look at; the fix might not have been backported to Emperor. But I'd like to be sure this isn't some other issue you're seeing. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sat, Mar 22, 2014 at 8:16 PM, Quenten Grasso wrote: > Hi Kyle, > > Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 > heres here's log for that one. > > ceph-osd.54.log.bz2 > http://www67.zippyshare.com/v/99704627/file.html > > > Strace osd 53, > strace.zip > http://www43.zippyshare.com/v/17581165/file.html > > > Thanks, > Quenten > -Original Message----- > From: Kyle Bader [mailto:kyle.ba...@gmail.com] > Sent: Sunday, 23 March 2014 12:10 PM > To: Quenten Grasso > Subject: Re: [ceph-users] OSD Restarts cause excessively high load average > and "requests are blocked > 32 sec" > >> Any ideas on why the load average goes so crazy & starts to block IO? > > Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting the > OSDs on one of your hosts and sharing the logs so we can take a look? > > It also might be worth while to strace one of the OSDs to try to determine > what it's working so hard on, maybe: > > strace -fc -p > strace.osd1.log > > Thanks! > > -- > > Kyle > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Hi Kyle, Thanks, I turned on debug ms = 1 and debug osd = 10 and restarted osd.54 heres here's log for that one. ceph-osd.54.log.bz2 http://www67.zippyshare.com/v/99704627/file.html Strace osd 53, strace.zip http://www43.zippyshare.com/v/17581165/file.html Thanks, Quenten -Original Message- From: Kyle Bader [mailto:kyle.ba...@gmail.com] Sent: Sunday, 23 March 2014 12:10 PM To: Quenten Grasso Subject: Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec" > Any ideas on why the load average goes so crazy & starts to block IO? Could you turn on "debug ms = 1" and "debug osd = 10" prior to restarting the OSDs on one of your hosts and sharing the logs so we can take a look? It also might be worth while to strace one of the OSDs to try to determine what it's working so hard on, maybe: strace -fc -p > strace.osd1.log Thanks! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Hi All, I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel 3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers). Here's an example of starting all the OSD's after a reboot. top - 09:10:51 up 2 min, 1 user, load average: 332.93, 112.28, 39.96 Tasks: 310 total, 1 running, 309 sleeping, 0 stopped, 0 zombie Cpu(s): 50.3%us, 32.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 17.2%hi, 0.0%si, 0.0%st Mem: 32917276k total, 6331224k used, 26586052k free, 1332k buffers Swap: 33496060k total,0k used, 33496060k free, 1474084k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15875 root 20 0 910m 381m 50m S 60 1.2 0:50.57 ceph-osd 2996 root 20 0 867m 330m 44m S 59 1.0 0:58.32 ceph-osd 4502 root 20 0 907m 372m 47m S 58 1.2 0:55.14 ceph-osd 12465 root 20 0 949m 418m 55m S 58 1.3 0:51.79 ceph-osd 4171 root 20 0 886m 348m 45m S 57 1.1 0:56.17 ceph-osd 3707 root 20 0 941m 405m 50m S 57 1.3 0:59.68 ceph-osd 3560 root 20 0 924m 394m 51m S 56 1.2 0:59.37 ceph-osd 4318 root 20 0 965m 435m 55m S 56 1.4 0:54.80 ceph-osd 3337 root 20 0 935m 407m 51m S 56 1.3 1:01.96 ceph-osd 3854 root 20 0 897m 366m 48m S 55 1.1 1:00.55 ceph-osd 3143 root 20 0 1364m 424m 24m S 16 1.3 1:08.72 ceph-osd 2509 root 20 0 652m 261m 62m S2 0.8 0:26.42 ceph-mon 4 root 20 0 000 S0 0.0 0:00.08 kworker/0:0 Regards, Quenten Grasso From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso Sent: Tuesday, 18 March 2014 10:19 PM To: 'ceph-users@lists.ceph.com' Subject: [ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec" Hi All, I'm trying to troubleshoot a strange issue with my Ceph cluster. We're Running Ceph Version 0.72.2 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. All Pools have a replica of 2 or better. I.e. metadata replica of 3. I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a single node (any node) the load average of that node shoots up to 230+ and the whole cluster starts blocking IO requests until it settles down and its fine again. Any ideas on why the load average goes so crazy & starts to block IO? [osd] osd data = /var/ceph/osd.$id osd journal size = 15000 osd mkfs type = xfs osd mkfs options xfs = "-i size=2048 -f" osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k" osd max backfills = 5 osd recovery max active = 3 [osd.0] host = pbnerbd01 public addr = 10.100.96.10 cluster addr = 10.100.128.10 osd journal = /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 devs = /dev/sda4 Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Hi All, I'm trying to troubleshoot a strange issue with my Ceph cluster. We're Running Ceph Version 0.72.2 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. All Pools have a replica of 2 or better. I.e. metadata replica of 3. I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a single node (any node) the load average of that node shoots up to 230+ and the whole cluster starts blocking IO requests until it settles down and its fine again. Any ideas on why the load average goes so crazy & starts to block IO? [osd] osd data = /var/ceph/osd.$id osd journal size = 15000 osd mkfs type = xfs osd mkfs options xfs = "-i size=2048 -f" osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k" osd max backfills = 5 osd recovery max active = 3 [osd.0] host = pbnerbd01 public addr = 10.100.96.10 cluster addr = 10.100.128.10 osd journal = /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 devs = /dev/sda4 Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD Restarts cause excessively high load average and "requests are blocked > 32 sec"
Hi All, I'm trying to troubleshoot a strange issue with my Ceph cluster. We're Running Ceph Version 0.72.2 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. All Pools have a replica of 2 or better. I.e. metadata replica of 3. I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a single node (any node) the load average of that node shoots up to 230+ and the whole cluster starts blocking IO requests until it settles down and its fine again. Any ideas on why the load average goes so crazy & starts to block IO? [osd] osd data = /var/ceph/osd.$id osd journal size = 15000 osd mkfs type = xfs osd mkfs options xfs = "-i size=2048 -f" osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k" osd max backfills = 5 osd recovery max active = 3 [osd.0] host = pbnerbd01 public addr = 10.100.96.10 cluster addr = 10.100.128.10 osd journal = /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 devs = /dev/sda4 Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw public url
Hi All, Does Radosgw support a "Public URL" For static content? Being that I wish to share a "File" publicly but not give out username/passwords etc. I noticed in the http://ceph.com/docs/master/radosgw/swift/ it says Static Websites isn't supported.. which I assume is talking about this feature, I'm just not 100% sure. Cheers, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] issues with 'https://ceph.com/git/?p=ceph.git; a=blob_plain; f=keys/release.asc'
Hey Guys, Looks like 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' is down. Regards, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph write performance and my Dell R515's
G'day Mark, I stumbled across an older thread it looks like you were involved with the centos and poor seq write performance on the R515's. Were you using centos or Ubuntu on your server at the time? (I'm wondering if this could be related to Ubuntu) http://marc.info/?t=13481911702&r=1&w=2 Also I tried as you suggested to put the raid controller into JBOD mode but no joy. I also tried cross flashing the card as its apparently a 9260 but we don't have any spare slots outside of the storage slot of which the raid controller cables can reach so that was a non-event :( If you want to give it a try, if you have access to longer cables and or other servers you can put the perc h700 into. I downloaded this flashing kit from here, (has all of the tools) grabbed a freedos usb and copied it all onto that. http://forums.laptopvideo2go.com/topic/29166-sas2108-lsi-9260-based-firmware-files/ Then grabbed the latest 9260 firmware from, http://www.lsi.com/downloads/Public/MegaRAID%20Common%20Files/12.13.0-0154_SAS_2108_Fw_Image_APP2.130.383-2315.zip *** Steps to Cross Flash *** Disclaimer you do this at your own risk, I take no responsibility if you brick your card, Warranty, etc In a Dell R515 If you write the SBR of a LSI card i.e. the 9260 and reboot the system, The system will be halted as it's now a non-dell card in the storage slot. However if you attempt to flash the LSI firmware onto the perch700 without the correct SBR it won't flash correctly it seems. So if you have longer cables and or another server to try the h700 in that's not a dell. You can try and cross flash the card. (FYI if you're trying to do this in a dell and you fudge up you can recover your system/raid card by plugging it into another pci-e slot and reapplying the Dell H700 SBR/Firmware) Now I'll assume you have one raid controller in your system so you only have adapter 0 1) Backup your SBR in case you need to restore it ie: Megarec -readsbr 0 prch700.sbr 2) Write the SBR of the card you want to flash ie: megarec -writesbr 0 sbr9260.bin 3) Erase the raid controller bios/firmware Megarec -cleanflash 0 4) Reboot 5) flash new firmware Megarec -m0flash 0 mr2108fw.rom 6) Reboot & Done. Also if your command errors out half way through flashing/erasing run it again. Regards, Quenten Grasso -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Sunday, 22 September 2013 10:40 PM Cc: ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Ceph write performance and my Dell R515's On 09/22/2013 03:12 AM, Quenten Grasso wrote: > > Hi All, > > I'm finding my write performance is less than I would have expected. > After spending some considerable amount of time testing several > different configurations I can never seems to break over ~360mb/s > write even when using tmpfs for journaling. > > So I've purchased 3x Dell R515's with 1 x AMD 6C CPU with 12 x 3TB SAS > & 2 x 100GB Intel DC S3700 SSD's & 32GB Ram with the Perc H710p Raid > controller and Dual Port 10GBE Network Cards. > > So first up I realise the SSD's were a mistake, I should have bought > the 200GB Ones as they have considerably better write though put of > ~375 Mb/s vs 200 Mb/s > > So to our Nodes Configuration, > > 2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks in a > Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size, > > (Stripe size this part was particularly important because I found the > stripe size matters considerably even on a single disk raid0. contrary > to what you might read on the internet) > > Also each disk is configured with (write back cache) is enabled and > (read head) disabled. > > For Networking, All nodes are connected via LACP bond with L3 hashing > and using iperf I can get up to 16gbit/s tx and rx between the nodes. > > OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to > upgrade kernel due to 10Gbit Intel NIC's driver issues) > > So this gives me 11 OSD's & 2 SSD's Per Node. > I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you definitely will want to do some investigation to make sure that OSD isn't holding the other ones back. iostat or collectl might be useful, along with the ceph osd admin socket and the dump_ops_in_flight and dump_historic_ops commands. > Next I've tried several different configurations which I'll briefly > describe 2 of which below, > > 1)Cluster Configuration 1, > > 33 OSD's with 6x SSD's as Journals, w/ 15GB Journals on SSD. > > # ceph osd pool create benchmark1 1800 1800 > > # rados bench -p benchmark1 180 write --no-cleanup > >
[ceph-users] Ceph write performance and my Dell R515's
Hi All, I'm finding my write performance is less than I would have expected. After spending some considerable amount of time testing several different configurations I can never seems to break over ~360mb/s write even when using tmpfs for journaling. So I've purchased 3x Dell R515's with 1 x AMD 6C CPU with 12 x 3TB SAS & 2 x 100GB Intel DC S3700 SSD's & 32GB Ram with the Perc H710p Raid controller and Dual Port 10GBE Network Cards. So first up I realise the SSD's were a mistake, I should have bought the 200GB Ones as they have considerably better write though put of ~375 Mb/s vs 200 Mb/s So to our Nodes Configuration, 2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size, (Stripe size this part was particularly important because I found the stripe size matters considerably even on a single disk raid0. contrary to what you might read on the internet) Also each disk is configured with (write back cache) is enabled and (read head) disabled. For Networking, All nodes are connected via LACP bond with L3 hashing and using iperf I can get up to 16gbit/s tx and rx between the nodes. OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to upgrade kernel due to 10Gbit Intel NIC's driver issues) So this gives me 11 OSD's & 2 SSD's Per Node. Next I've tried several different configurations which I'll briefly describe 2 of which below, 1) Cluster Configuration 1, 33 OSD's with 6x SSD's as Journals, w/ 15GB Journals on SSD. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.250417 Total writes made: 10152 Write size: 4194304 Bandwidth (MB/sec): 225.287 Stddev Bandwidth: 35.0897 Max bandwidth (MB/sec): 312 Min bandwidth (MB/sec): 0 Average Latency:0.284054 Stddev Latency: 0.199075 Max latency:1.46791 Min latency:0.038512 -- # rados bench -p benchmark1 180 seq - Total time run:43.782554 Total reads made: 10120 Read size:4194304 Bandwidth (MB/sec):924.569 Average Latency: 0.0691903 Max latency: 0.262542 Min latency: 0.015756 - In this configuration I found my write performance suffers a lot to the SSD's seem to be a bottleneck and my write performance using rados bench was around 224-230mb/s 2) Cluster Configuration 2, 33 OSD's with 1Gbyte Journals on tmpfs. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.044669 Total writes made: 15328 Write size: 4194304 Bandwidth (MB/sec): 340.538 Stddev Bandwidth: 26.6096 Max bandwidth (MB/sec): 380 Min bandwidth (MB/sec): 0 Average Latency:0.187916 Stddev Latency: 0.0102989 Max latency:0.336581 Min latency:0.034475 -- # rados bench -p benchmark1 180 seq - Total time run:76.481303 Total reads made: 15328 Read size:4194304 Bandwidth (MB/sec):801.660 Average Latency: 0.079814 Max latency: 0.317827 Min latency: 0.016857 - Now it seems there is no bottleneck for journaling as we are using tmpfs, however still less then what I would expect write speed the sas disks are barely busy via iostat.. So I thought it might be a disk bus throughput issue. Next I completed some dd tests... This below is in a script dd-x.sh which executes the 11 readers or writers at once. dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k oflag=direct & dd if=/dev/zero of=/srv/