[ceph-users] Failed to read JournalPointer - MDS error (mds rank 0 is damaged)
Hi, We're using ceph 10.2.5 and cephfs. We had a weird monitor (mon0r0) which had some sort of meltdown as current active mds node. The monitor node called elections on/off over ~1 hour, sometimes with 5-10min between. On every occasion mds was also doing a replay, reconnect, rejoin => active (it never switched to use a standby mds). Then after 1 hour of it mostly working it gave up with: [ ... ] 2017-04-29 07:30:24.444980 7fe6d7e9c700 0 mds.beacon.mon0r0 handle_mds_beacon no longer laggy 2017-04-29 07:30:46.783817 7fe6d7e9c700 0 monclient: hunting for new mon *< bunch of errors like this >* 2017-04-29 07:31:11.782049 7fe6d7e9c700 1 mds.mon0r0 handle_mds_map i ( 172.16.130.10:6811/8235) dne in the mdsmap, respawning myself 2017-04-29 07:31:11.782054 7fe6d7e9c700 1 mds.mon0r0 respawn 2017-04-29 07:31:11.782056 7fe6d7e9c700 1 mds.mon0r0 e: '/usr/bin/ceph-mds' 2017-04-29 07:31:11.782058 7fe6d7e9c700 1 mds.mon0r0 0: '/usr/bin/ceph-mds' 2017-04-29 07:31:11.782060 7fe6d7e9c700 1 mds.mon0r0 1: '--cluster=ceph' 2017-04-29 07:31:11.782071 7fe6d7e9c700 1 mds.mon0r0 2: '-i' 2017-04-29 07:31:11.782072 7fe6d7e9c700 1 mds.mon0r0 3: 'mon0r0' 2017-04-29 07:31:11.782073 7fe6d7e9c700 1 mds.mon0r0 4: '-f' 2017-04-29 07:31:11.782074 7fe6d7e9c700 1 mds.mon0r0 5: '--setuser' 2017-04-29 07:31:11.782075 7fe6d7e9c700 1 mds.mon0r0 6: 'ceph' 2017-04-29 07:31:11.782076 7fe6d7e9c700 1 mds.mon0r0 7: '--setgroup' 2017-04-29 07:31:11.782077 7fe6d7e9c700 1 mds.mon0r0 8: 'ceph' 2017-04-29 07:31:11.782106 7fe6d7e9c700 1 mds.mon0r0 exe_path /usr/bin/ceph-mds 2017-04-29 07:31:11.799625 7f5487a92180 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 8235 2017-04-29 07:31:11.800097 7f5487a92180 0 pidfile_write: ignore empty --pid-file 2017-04-29 07:31:12.746033 7f5481a40700 1 mds.mon0r0 handle_mds_map standby 2017-04-29 07:32:01.941948 7f5481a40700 0 monclient: hunting for new mon 2017-04-29 07:32:48.186313 7f5481a40700 1 mds.mon0r0 handle_mds_map standby 2017-04-29 07:33:04.539413 7f5481a40700 0 monclient: hunting for new mon 2017-04-29 07:33:09.560848 7f5481a40700 1 mds.0.764 handle_mds_map i am now mds.0.764 2017-04-29 07:33:09.560857 7f5481a40700 1 mds.0.764 handle_mds_map state change up:boot --> up:replay 2017-04-29 07:33:09.560879 7f5481a40700 1 mds.0.764 replay_start 2017-04-29 07:33:09.560882 7f5481a40700 1 mds.0.764 recovery set is 2017-04-29 07:33:09.560890 7f5481a40700 1 mds.0.764 waiting for osdmap 17134 (which blacklists prior instance) 2017-04-29 07:33:09.571120 7f547c733700 -1 log_channel(cluster) log [ERR] : failed to read JournalPointer: -108 ((108) Cannot send after transport endpoint shutdown) 2017-04-29 07:33:09.575176 7f547c733700 1 mds.mon0r0 respawn 2017-04-29 07:33:09.575185 7f547c733700 1 mds.mon0r0 e: '/usr/bin/ceph-mds' 2017-04-29 07:33:09.575187 7f547c733700 1 mds.mon0r0 0: '/usr/bin/ceph-mds' 2017-04-29 07:33:09.575189 7f547c733700 1 mds.mon0r0 1: '--cluster=ceph' 2017-04-29 07:33:09.575191 7f547c733700 1 mds.mon0r0 2: '-i' 2017-04-29 07:33:09.575192 7f547c733700 1 mds.mon0r0 3: 'mon0r0' 2017-04-29 07:33:09.575193 7f547c733700 1 mds.mon0r0 4: '-f' 2017-04-29 07:33:09.575194 7f547c733700 1 mds.mon0r0 5: '--setuser' 2017-04-29 07:33:09.575195 7f547c733700 1 mds.mon0r0 6: 'ceph' 2017-04-29 07:33:09.575196 7f547c733700 1 mds.mon0r0 7: '--setgroup' 2017-04-29 07:33:09.575197 7f547c733700 1 mds.mon0r0 8: 'ceph' 2017-04-29 07:33:09.575221 7f547c733700 1 mds.mon0r0 exe_path /usr/bin/ceph-mds 2017-04-29 07:33:09.589993 7f9a9d0d1180 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 8235 2017-04-29 07:33:09.590461 7f9a9d0d1180 0 pidfile_write: ignore empty --pid-file 2017-04-29 07:33:10.567466 7f9a9707f700 1 mds.mon0r0 handle_mds_map standby 2017-04-29 07:34:46.972551 7f9a9707f700 0 monclient: hunting for new mon 2017-04-29 07:34:50.583321 7f9a9707f700 1 mds.mon0r0 handle_mds_map standby 2017-04-29 07:35:24.575818 7f9a9707f700 0 monclient: hunting for new mon 2017-04-29 07:36:31.988193 7f9a9707f700 0 monclient: hunting for new mon 2017-04-29 07:38:06.999197 7f9a9707f700 0 monclient: hunting for new mon 2017-04-29 07:39:12.009821 7f9a9707f700 0 monclient: hunting for new mon 2017-04-29 07:39:21.855605 7f9a9707f700 1 mds.mon0r0 handle_mds_map standby 2017-04-29 07:41:39.994418 7f9a9707f700 0 monclient: hunting for new mon *< Continues like the above until mds was restarted ~1 h later* [ ... ] 2017-04-29 08:49:22.204803 7f9a9300 -1 mds.mon0r0 *** got signal Terminated *** 2017-04-29 08:49:22.204821 7f9a9300 1 mds.mon0r0 suicide. wanted state up:standby 2017-04-29 09:00:31.510392 7ff9acd5e180 0 set uid:gid to 64045:64045 (ceph:ceph) 2017-04-29 09:00:31.510412 7ff9acd5e180 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-mds, pid 23804 2017-04-29 09:00:31.510853 7ff9acd5e180 0 pidfile_write: ignore empty --pid-file 2017-04-29
Re: [ceph-users] Find out the location of OSD Journal
Hi, Inside your mounted osd there is a symlink - journal - pointing to a file or disk/partition used with it. Cheers, Martin On Thu, May 7, 2015 at 11:06 AM, Patrik Plank pat...@plank.me wrote: Hi, i cant remember on which drive I install which OSD journal :-|| Is there any command to show this? thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel
Hi Andrei, If there is one thing I've come to understand by now is that ceph configs, performance, hw and well - everything - seems to vary on almost people basis. I do not recognize that latency issue either, this is from one of our nodes (4x 500GB samsung 840 pro - sd[c-f]) which has been running for 600+ days (so the iostat -x is an avg of that): # uptime 16:24:57 up 611 days, 4:03, 1 user, load average: 1.18, 1.55, 1.72 # iostat -x [ ... ] Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdc 0.00 0.164.87 22.62 344.18 458.65 58.41 0.051.920.452.24 0.76 2.10 sdd 0.00 0.124.37 20.02 317.98 437.95 61.98 0.051.900.442.21 0.78 1.91 sde 0.00 0.124.17 19.33 302.45 403.02 60.02 0.041.870.432.18 0.77 1.80 sdf 0.00 0.124.51 20.84 322.84 439.70 60.17 0.051.840.432.15 0.76 1.93 [ ... ] Granted, we do not have very high usage on this cluster on a ssd-basis and it might change as we put more load on it, but we will deal with it then. I do not think ~2ms access time is neither good nor bad. This is from another cluster we operate - this one has an intel DC S3700 800gb ssd (sdb) # uptime 09:37:26 up 654 days, 8:40, 1 user, load average: 0.33, 0.40, 0.54 # iostat -x [ ... ] sdb 0.01 1.49 39.76 86.79 1252.80 2096.98 52.94 0.020.761.220.54 0.41 5.21 [ ... ] It is misleading as the latter just have 3 disks + hardware based 1gb backed raidcontroller whereas the first is a 'cheap' dumb 12disk jbod IT based setup. All the ssd from both clusters have 3 partitions - 1 ceph-data and 2 journal partitions (1 journal for the ssd itself and 1 journal for 1 platter disk). The intel ssd is very sturdy though - it has had a 2.1MB/sec avg. write over 654 days - that is somewhere around 120TB so far. But ultimately it boils down to what you need - in our usecase the latter cluster has be to rockstable and performing - and we chose the intel ones based on that. The first one we don't really care if we loose a node or two and we replace disks every month or whenever it fits into our going-to-datacenter-schedule - we wanted an ok'ish performing cluster and focused more on total space / price than highperforming hardware. The fantastic thing is we are not locked into any specific hardware and we can replace any of it if we need to and/or find it is suddenly starting to have issues. Cheers, Martin On Sat, Feb 28, 2015 at 2:55 PM, Andrei Mikhailovsky and...@arhont.com wrote: Martin, I have been using Samsung 840 Pro for journals about 2 years now and have just replaced all my samsung drives with Intel. We have found a lot of performance issues with 840 Pro (we are using 128mb). In particular, a very strange behaviour with using 4 partitions (with 50% underprovisioning left as empty unpartitioned space on the drive) where the drive would grind to almost a halt after a few weeks of use. I was getting 100% utilisation on the drives doing just 3-4MB/s writes. This was not the case when I've installed the new drives. Manual Trimming helps for a few weeks until the same happens again. This has been happening with all 840 Pro ssds that we have and contacting Samsung Support has proven to be utterly useless. They do not want to speak with you until you install windows and run their monkey utility ((. Also, i've noticed the latencies of the Samsung 840 Pro ssd drives to be about 15-20 slower compared with a consumer grade Intel drives, like Intel 520. According to ceph osd pef, I would consistently get higher figures on the osds with Samsung journal drive compared with the Intel drive on the same server. Something like 2-3ms for Intel vs 40-50ms for Samsungs. At some point we had enough with Samsungs and scrapped them. Andrei -- *From: *Martin B Nielsen mar...@unity3d.com *To: *Philippe Schwarz p...@schwarz-fr.net *Cc: *ceph-users@lists.ceph.com *Sent: *Saturday, 28 February, 2015 11:51:57 AM *Subject: *Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel Hi, I cannot recognize that picture; we've been using samsumg 840 pro in production for almost 2 years now - and have had 1 fail. We run a 8node mixed ssd/platter cluster with 4x samsung 840 pro (500gb) in each so that is 32x ssd. They've written ~25TB data in avg each. Using the dd you had inside an existing semi-busy mysql-guest I get: 10240 bytes (102 MB) copied, 5.58218 s, 18.3 MB/s Which is still not a lot, but I think it is more a limitation of our setup/load. We are using dumpling. All that aside, I would prob. go with something tried and tested if I was to redo it today - we haven't had any issues, but it is still nice to use something
Re: [ceph-users] error adding OSD to crushmap
Hi Luis, I might remember wrong, but don't you need to actually create the osd first? (ceph osd create) Then you can use assign it a position using cli crushrules. Like Jason said, can you send the ceph osd tree output? Cheers, Martin On Mon, Jan 12, 2015 at 1:45 PM, Luis Periquito periqu...@gmail.com wrote: Hi all, I've been trying to add a few new OSDs, and as I manage everything with puppet, it was manually adding via the CLI. At one point it adds the OSD to the crush map using: # ceph osd crush add 6 0.0 root=default but I get Error ENOENT: osd.6 does not exist. create it before updating the crush map If I read correctly this command should be the correct one to create the OSD to the crush map... is this a bug? I'm running the latest firefly 0.80.7. thanks PS: I just edited the crushmap, but it would make it a lot easier to do it by the CLI commands... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD MTBF
A bit late getting back on this one. On Wed, Oct 1, 2014 at 5:05 PM, Christian Balzer ch...@gol.com wrote: smartctl states something like Wear = 092%, Hours = 12883, Datawritten = 15321.83 TB avg on those. I think that is ~30TB/day if I'm doing the calc right. Something very much does not add up there. Either you've written 15321.83 GB on those drives, making it about 30GB/day and well withing the Samsung specs, or you've written 10-20 times the expected TBW level of those drives... My bad, I forgot to say the Wear indicator here (92%) is sorta backwards - so it means it still has 92% to go before reaching expected TBW limit. I agree with what Massimiliano Cuttini wrote later as well - if your io boundaries are well within the expected TBW of the lifetime I see no reason to go for more expensive disks. Just monitor for wear and have a few in stock ready for replacement. Regarding the table of ssd and vendors: Brand Model TBW € €/TB Intel S3500 120Go 701221,74 Intel S3500 240Go 140 2251,60 Intel S3700 100Go 1873 2200,11 Intel S3700 200Go 3737 4000,10 Samsung 840 pro 120Go 701201,71 I don't disagree with the above - but the table assumes you'll wear out your SSD. Adjust the wear level and the price will change proportionally - if you're only writing 50-100TB/year pr ssd then the value will heavily swing in the cheaper consumer grade ssd favor. It is all about your estimated usage pattern and whether they're 'good enough' for your scenario or not (and/or you trust that vendor). In my experience ceph seldom (ever) maxes out io of a ssd - it is much more likely to be cpu or network before coming to that. Cheers, Martin In the article I mentioned previously: http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand-die-size The author clearly comes with a relationship of durability versus SSD size, as one would expect. But the Samsung homepage just stated 150TBW, for all those models... Christian Not to advertise or say every samsung 840 ssd is like this: http://www.vojcik.net/samsung-ssd-840-endurance-destruct-test/ Seen it before, but I have a feeling that this test doesn't quite put the same strain on the poor NANDs as Emmanuel's environment. Christian Cheers, Martin On Wed, Oct 1, 2014 at 10:18 AM, Christian Balzer ch...@gol.com wrote: On Wed, 1 Oct 2014 09:28:12 +0200 Kasper Dieter wrote: On Tue, Sep 30, 2014 at 04:38:41PM +0200, Mark Nelson wrote: On 09/29/2014 03:58 AM, Dan Van Der Ster wrote: Hi Emmanuel, This is interesting, because we?ve had sales guys telling us that those Samsung drives are definitely the best for a Ceph journal O_o ! Our sales guys or Samsung sales guys? :) If it was ours, let me know. The conventional wisdom has been to use the Intel DC S3700 because of its massive durability. The S3700 is definitely one of the better drives on the market for Ceph journals. Some of the higher end PCIE SSDs have pretty high durability (and performance) as well, but cost more (though you can save SAS bay space, so it's a trade-off). Intel P3700 could be an alternative with 10 Drive-Writes/Day for 5 years (see attachment) They're certainly nice and competitively priced (TBW/$ wise at least). However as I said in another thread, once your SSDs start to outlive your planned server deployment time (in our case 5 years) that's probably good enough. It's all about finding the balance between cost, speed (BW and IOPS), durability and space. For example I'm currently building a cluster based on 2U, 12 hotswap bays servers (because I already had 2 floating around) and am using 4 100GB DC S3700 (at US$200 each) and 8 HDDS in them. Putting in a 400GB DC P3700 (US$1200( instead and 4 more HDDs would have pushed me over the budget and left me with a less than 30% used SSD 5 years later, at a time when we clearly can expect these things to be massively faster and cheaper. Now if you're actually having a cluster that would wear out a P3700 in 5 years (or you're planning to run your machines until they burst into flames), then that's another story. ^.^ Christian -Dieter Anyway, I?m curious what do the SMART counters say on your SSDs?? are they really failing due to worn out P/E cycles or is it something else? Cheers, Dan On 29 Sep 2014, at 10:31, Emmanuel Lacour elac...@easter-eggs.com wrote: Dear ceph users, we are managing ceph clusters since 1 year now. Our setup is typically made of Supermicro servers with OSD sata drives and journal on SSD. Those SSD are all failing one after the other after one year :( We used Samsung 850 pro (120Go) with two setup (small
Re: [ceph-users] SSD MTBF
Hi, We settled on Samsung pro 840 240GB drives 1½ year ago and we've been happy so far. We've over-provisioned them a lot (left 120GB unpartitioned). We have 16x 240GB and 32x 500GB - we've lost 1x 500GB so far. smartctl states something like Wear = 092%, Hours = 12883, Datawritten = 15321.83 TB avg on those. I think that is ~30TB/day if I'm doing the calc right. Not to advertise or say every samsung 840 ssd is like this: http://www.vojcik.net/samsung-ssd-840-endurance-destruct-test/ Cheers, Martin On Wed, Oct 1, 2014 at 10:18 AM, Christian Balzer ch...@gol.com wrote: On Wed, 1 Oct 2014 09:28:12 +0200 Kasper Dieter wrote: On Tue, Sep 30, 2014 at 04:38:41PM +0200, Mark Nelson wrote: On 09/29/2014 03:58 AM, Dan Van Der Ster wrote: Hi Emmanuel, This is interesting, because we?ve had sales guys telling us that those Samsung drives are definitely the best for a Ceph journal O_o ! Our sales guys or Samsung sales guys? :) If it was ours, let me know. The conventional wisdom has been to use the Intel DC S3700 because of its massive durability. The S3700 is definitely one of the better drives on the market for Ceph journals. Some of the higher end PCIE SSDs have pretty high durability (and performance) as well, but cost more (though you can save SAS bay space, so it's a trade-off). Intel P3700 could be an alternative with 10 Drive-Writes/Day for 5 years (see attachment) They're certainly nice and competitively priced (TBW/$ wise at least). However as I said in another thread, once your SSDs start to outlive your planned server deployment time (in our case 5 years) that's probably good enough. It's all about finding the balance between cost, speed (BW and IOPS), durability and space. For example I'm currently building a cluster based on 2U, 12 hotswap bays servers (because I already had 2 floating around) and am using 4 100GB DC S3700 (at US$200 each) and 8 HDDS in them. Putting in a 400GB DC P3700 (US$1200( instead and 4 more HDDs would have pushed me over the budget and left me with a less than 30% used SSD 5 years later, at a time when we clearly can expect these things to be massively faster and cheaper. Now if you're actually having a cluster that would wear out a P3700 in 5 years (or you're planning to run your machines until they burst into flames), then that's another story. ^.^ Christian -Dieter Anyway, I?m curious what do the SMART counters say on your SSDs?? are they really failing due to worn out P/E cycles or is it something else? Cheers, Dan On 29 Sep 2014, at 10:31, Emmanuel Lacour elac...@easter-eggs.com wrote: Dear ceph users, we are managing ceph clusters since 1 year now. Our setup is typically made of Supermicro servers with OSD sata drives and journal on SSD. Those SSD are all failing one after the other after one year :( We used Samsung 850 pro (120Go) with two setup (small nodes with 2 ssd, 2 HD in 1U): 1) raid 1 :( (bad idea, each SSD support all the OSDs journals writes :() 2) raid 1 for OS (nearly no writes) and dedicated partition for journals (one per OSD) I'm convinced that the second setup is better and we migrate old setup to this one. Thought, statistics gives 60GB (option 2) to 100 GB (option 1) writes per day on SSD on a not really over loaded cluster. Samsung claims to give 5 years warranty if under 40GB/day. Those numbers seems very low to me. What are your experiences on this? What write volumes do you encounter, on wich SSD models, which setup and what MTBF? -- Easter-eggs Spécialiste GNU/Linux 44-46 rue de l'Ouest - 75014 Paris - France - Métro Gaité Phone: +33 (0) 1 43 35 00 37- Fax: +33 (0) 1 43 35 00 76 mailto:elac...@easter-eggs.com - http://www.easter-eggs.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] resizing the OSD
Hi, Or did you mean some OSD are near full while others are under-utilized? On Sat, Sep 6, 2014 at 5:04 PM, Christian Balzer ch...@gol.com wrote: Hello, On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote: Hello Cephers, We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the stuff seems to be working fine but we are seeing some degrading on the osd's due to lack of space on the osd's. Please elaborate on that degradation. Is there a way to resize the OSD without bringing the cluster down? Define both resize and cluster down. As in, resizing how? Are your current OSDs on disks/LVMs that are not fully used and thus could be grown? What is the size of your current OSDs? The normal way of growing a cluster is to add more OSDs. Preferably of the same size and same performance disks. This will not only simplify things immensely but also make them a lot more predictable. This of course depends on your use case and usage patterns, but often when running out of space you're also running out of other resources like CPU, memory or IOPS of the disks involved. So adding more instead of growing them is most likely the way forward. If you were to replace actual disks with larger ones, take them (the OSDs) out one at a time and re-add it. If you're using ceph-deploy, it will use the disk size as basic weight, if you're doing things manually make sure to specify that size/weight accordingly. Again, you do want to do this for all disks to keep things uniform. Just want to emphasize this - if your disks already have high utilization and you add a [much] larger drive and auto-weights it for say 2 or 3x the other disks, that disk will have that much higher utilization and will most likely max out and bottleneck your cluster. So keep that in mind :). Cheers, Martin If your cluster (pools really) are set to a replica size of at least 2 (risky!) or 3 (as per Firefly default), taking a single OSD out would of course never bring the cluster down. However taking an OSD out and/or adding a new one will cause data movement that might impact your cluster's performance. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One stuck PG
Hi Erwin, Did you try and restart the primary osd for that pg (24) - sometimes it needs a little ..nudge that way. Otherwise what does ceph pg dump say about that pg? Cheers, Martin On Thu, Sep 4, 2014 at 9:00 AM, Erwin Lubbers c...@erwin.lubbers.org wrote: Hi, My cluster is giving one stuck pg which seems to be backfilling for days now. Any suggestions on how to solve it? HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 32/6000626 degraded (0.001%) pg 206.3f is stuck unclean for 557655.601540, current state active+remapped+backfilling, last acting [24,28,3,44] pg 206.3f is active+remapped+backfilling, acting [24,28,3,44] recovery 32/6000626 degraded (0.001%) Regards, Erwin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD journal deployment experiences
Hi Dan, We took a different approach (and our cluster is tiny compared to many others) - we have two pools; normal and ssd. We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd for OS/journals. We partitioned the two OS ssd as raid1 using about half the space for the OS and leaving the rest on each for 2x journals and unprovisioned. We've partitioned the OS disks to each hold 2x platter journals. On top of that our ssd pooled disks also hold 2x journals; their own + an additional from a platter disk. We have 8 osd-nodes. So whenever an ssd fail we lose 2 osd (but never more). We've had this system in production for ~1½ year now and so far we've had 1 ssd and 2 platter disk fail. We run a couple of hundred vm-guests on it and use ~60TB. On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage. The times we lost disks we hardly noticed. All SSD (OS included) have a general utilization of 5%, platter disks near 10%. We did a lot of initial testing about putting journals on the OS-ssd as well extra on the ssd-osd, but we didn't find much difference or high latencies as others have experienced. When/if we notice otherwise we'll prob. switch to pure ssd as journalholders. We originally deployed using saltstack and even though we have automated replacing disks we still do it manually 'just to be sure'. It takes 5-10min to replace an old disk and get it backfilling, so I don't expect us to spend any time automating this. Recovering 2 disks at once for us takes a long time but we've intentionally set backfilling low and it is not noticeable on the cluster when it happens. Anyways, we have pretty low cluster usage but in our experience ssd seem to handle the constant load very well. Cheers, Martin On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Dear Cephalopods, In a few weeks we will receive a batch of 200GB Intel DC S3700’s to augment our cluster, and I’d like to hear your practical experience and discuss options how best to deploy these. We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they will become 20 OSDs + 4 SSDs per server. Until recently I’ve been planning to use the traditional deployment: 5 journal partitions per SSD. But as SSD-day approaches, I growing less comfortable with the idea of 5 OSDs going down every time an SSD fails, so perhaps there are better options out there. Before getting into options, I’m curious about real reliability of these drives: 1) How often are DC S3700's failing in your deployments? 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the backfilling which results from an SSD failure? Have you considered tricks like increasing the down out interval so backfilling doesn’t happen in this case (leaving time for the SSD to be replaced)? Beyond the usually 5 partitions deployment, is anyone running a RAID1 or RAID10 for the journals? If so, are you using the raw block devices or formatting it and storing the journals as files on the SSD array(s)? Recent discussions seem to indicate that XFS is just as fast as the block dev, since these drives are so fast. Next, I wonder how people with puppet/chef/… are handling the creation/re-creation of the SSD devices. Are you just wiping and rebuilding all the dependent OSDs completely when the journal dev fails? I’m not keen on puppetizing the re-creation of journals for OSDs... We also have this crazy idea of failing over to a local journal file in case an SSD fails. In this model, when an SSD fails we’d quickly create a new journal either on another SSD or on the local OSD filesystem, then restart the OSDs before backfilling started. Thoughts? Lastly, I would also consider using 2 of the SSDs in a data pool (with the other 2 SSDs to hold 20 journals — probably in a RAID1 to avoid backfilling 10 OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform adequately, that’d give us quite a few SSDs to build a dedicated high-IOPS pool. I’d also appreciate any other suggestions/experiences which might be relevant. Thanks! Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge issues with slow requests
Just echoing what Christian said. Also, iirc the currently waiting for subobs on [ could also mean a problem on those as it waits for ack from them (I might remember wrong). If that is the case you might want to check in on osd 13 37 as well. With the cluster load and size you should not have this problem; I'm pretty sure you're dealing with a rogue/faulty osd/node somewhere. Cheers, Martin On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer ch...@gol.com wrote: On Thu, 4 Sep 2014 12:02:13 +0200 David wrote: Hi, We’re running a ceph cluster with version: 0.67.7-1~bpo70+1 All of a sudden we’re having issues with the cluster (running RBD images for kvm) with slow requests on all of the OSD servers. Any idea why and how to fix it? You give us a Ceph version at least, but for anybody to make guesses we need much more information than a log spew. How many nodes/OSDs, OS, hardware, OSD details (FS, journals on SSDs), etc. Run atop (in a sufficiently large terminal) on all your nodes, see if you can spot a bottleneck, like a disk being at 100% all the time with a much higher avio than the others. Looking at your logs, I'd pay particular attention to the disk holding osd.22. A single slow disk can bring a whole large cluster to a crawl. If you're using a hardware controller with a battery backed up cache, check if that is fine, loss of the battery would switch from writeback to writethrough and massively slow down IOPS. Regards, Christian 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451: 6860 pgs: 6860 active+clean; 12163 GB data, 36308 GB used, 142 TB / 178 TB avail; 634KB/s rd, 487KB/s wr, 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN] 15 slow requests, 1 included below; oldest blocked for 44.745754 secs 2014-09-04 11:56:29.510276 osd.22 [WRN] slow request 30.999821 seconds old, received at 2014-09-04 11:55:58.510424: osd_op(client.10731617.0:81868956 rbd_data.967e022eb141f2.0e72 [write 0~4194304] 3.c585cebe e13901) v4 currently waiting for subops from [37,13] 2014-09-04 11:56:30.510528 osd.22 [WRN] 21 slow requests, 6 included below; oldest blocked for 45.745989 secs 2014-09-04 11:56:30.510534 osd.22 [WRN] slow request 30.122555 seconds old, received at 2014-09-04 11:56:00.387925: osd_op(client.13425082.0:11962345 rbd_data.54f24c3d1b58ba.3753 [stat,write 1114112~8192] 3.c9e49140 e13901) v4 currently waiting for subops from [13,42] 2014-09-04 11:56:30.510537 osd.22 [WRN] slow request 30.122362 seconds old, received at 2014-09-04 11:56:00.388118: osd_op(client.13425082.0:11962352 rbd_data.54f24c3d1b58ba.3753 [stat,write 1134592~4096] 3.c9e49140 e13901) v4 currently waiting for subops from [13,42] 2014-09-04 11:56:30.510541 osd.22 [WRN] slow request 30.122298 seconds old, received at 2014-09-04 11:56:00.388182: osd_op(client.13425082.0:11962353 rbd_data.54f24c3d1b58ba.3753 [stat,write 4046848~8192] 3.c9e49140 e13901) v4 currently waiting for subops from [13,42] 2014-09-04 11:56:30.510544 osd.22 [WRN] slow request 30.121577 seconds old, received at 2014-09-04 11:56:00.388903: osd_op(client.13425082.0:11962374 rbd_data.54f24c3d1b58ba.47f2 [stat,write 2527232~4096] 3.cd9a9015 e13901) v4 currently waiting for subops from [45,1] 2014-09-04 11:56:30.510548 osd.22 [WRN] slow request 30.121518 seconds old, received at 2014-09-04 11:56:00.388962: osd_op(client.13425082.0:11962375 rbd_data.54f24c3d1b58ba.47f2 [stat,write 3133440~4096] 3.cd9a9015 e13901) v4 currently waiting for subops from [45,1] 2014-09-04 11:56:31.510706 osd.22 [WRN] 26 slow requests, 6 included below; oldest blocked for 46.746163 secs 2014-09-04 11:56:31.510711 osd.22 [WRN] slow request 31.035418 seconds old, received at 2014-09-04 11:56:00.475236: osd_op(client.9266625.0:135900595 rbd_data.42d6792eb141f2.bc00 [stat,write 2097152~4096] 3.a2894ebe e13901) v4 currently waiting for subops from [37,13] 2014-09-04 11:56:31.510715 osd.22 [WRN] slow request 31.035335 seconds old, received at 2014-09-04 11:56:00.475319: osd_op(client.9266625.0:135900596 rbd_data.42d6792eb141f2.bc00 [stat,write 2162688~4096] 3.a2894ebe e13901) v4 currently waiting for subops from [37,13] 2014-09-04 11:56:31.510718 osd.22 [WRN] slow request 31.035270 seconds old, received at 2014-09-04 11:56:00.475384: osd_op(client.9266625.0:135900597 rbd_data.42d6792eb141f2.bc00 [stat,write 2400256~16384] 3.a2894ebe e13901) v4 currently waiting for subops from [37,13] 2014-09-04 11:56:31.510721 osd.22 [WRN] slow request 31.035093 seconds old, received at 2014-09-04 11:56:00.475561: osd_op(client.9266625.0:135900598 rbd_data.42d6792eb141f2.bc00 [stat,write 2420736~4096] 3.a2894ebe e13901) v4 currently waiting for subops from [37,13] 2014-09-04 11:56:31.510724 osd.22 [WRN] slow request 31.034990 seconds old,
Re: [ceph-users] Ceph Not getting into a clean state
Hi, I experienced exactly the same with 14.04 and the 0.79 release. It was a fresh clean install with default crushmap and ceph-deploy install as pr. the quick-start guide. Oddly enough changing replica size (incl min_size) from 3 - 2 (and 2-1) and back again it worked. I didn't have time to look into replicating the issue. Cheers, Martin On Thu, May 8, 2014 at 4:30 PM, Georg Höllrigl georg.hoellr...@xidras.comwrote: Hello, We've a fresh cluster setup - with Ubuntu 14.04 and ceph firefly. By now I've tried this multiple times - but the result keeps the same and shows me lots of troubles (the cluster is empty, no client has accessed it) #ceph -s cluster b04fc583-9e71-48b7-a741-92f4dff4cfef health HEALTH_WARN 470 pgs stale; 470 pgs stuck stale; 18 pgs stuck unclean; 26 requests are blocked 32 sec monmap e2: 3 mons at {ceph-m-01=10.0.0.100:6789/0, ceph-m-02=10.0.1.101:6789/0,ceph-m-03=10.0.1.102:6789/0}, election epoch 8, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03 osdmap e409: 9 osds: 9 up, 9 in pgmap v1231: 480 pgs, 9 pools, 822 bytes data, 43 objects 9373 MB used, 78317 GB / 78326 GB avail 451 stale+active+clean 1 stale+active+clean+scrubbing 10 active+clean 18 stale+active+remapped Anyone an idea what happens here? Should an empty cluster not show only active+clean pgs? Regards, Georg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Red Hat to acquire Inktank
First off, congrats to inktank! I'm sure having Redhat backing the project it will see even quicker development. My only worry is support for future non-RHEL platforms; like many others we've built our ceph stack around ubuntu and I'm just hoping it won't deteriorate into something like how it is only built/tested around Centos/Redhat ( ie moving the I+C from ubuntu to only be on Centos/Redhat - http://ceph.com/docs/master/start/os-recommendations/ and just keep a basic build-test around all other distroes) I fear a political decision to only have those extra tests on Centos/Redhat will 'force' people to run it on Centos/Redhat eventually. Cheers, Martin On Wed, Apr 30, 2014 at 2:18 PM, Sage Weil s...@inktank.com wrote: Today we are announcing some very big news: Red Hat is acquiring Inktank. We are very excited about what this means for Ceph, the community, the team, our partners, and our customers. Ceph has come a long way in the ten years since the first line of code has been written, particularly over the last two years that Inktank has been focused on its development. The fifty members of the Inktank team, our partners, and the hundreds of other contributors have done amazing work in bringing us to where we are today. We believe that, as part of Red Hat, the Inktank team will be able to build a better quality Ceph storage platform that will benefit the entire ecosystem. Red Hat brings a broad base of expertise in building and delivering hardened software stacks as well as a wealth of resources that will help Ceph become the transformative and ubiquitous storage platform that we always believed it could be. For existing Inktank customers, this is going to mean turning a reliable and robust storage system into something that delivers even more value. In particular, joining forces with the Red Hat team will improve our ability to address problems at all layers of the storage stack, including in the kernel. We naturally recognize that many customers and users have built platforms based on other Linux distributions. We will continue to support these installations while we determine how to provide the best customer experience moving forward and how the next iteration of the enterprise Ceph product will be structured. In the meantime, our team remains committed to keeping Ceph an open, multiplatform project that works in any environment where it makes sense, including other Linux distributions and non-Linux operating systems. Red Hat is one of only a handful of companies that I trust to steward the Ceph project. When we started Inktank two years ago, our goal was to build the business by making Ceph successful as a broad-based, collaborative open source project with a vibrant user, developer, and commercial community. Red Hat shares this vision. They are passionate about open source, and have demonstrated that they are strong and fair stewards with other critical projects (like KVM). Red Hat intends to administer the Ceph trademark in a manner that protects the ecosystem as a whole and creates a level playing field where everyone is held to the same standards of use. Similarly, policies like upstream first ensure that bug fixes and improvements that go into Ceph-derived products are always shared with the community to streamline development and benefit all members of the ecosystem. One important change that will take place involves Inktank's product strategy, in which some add-on software we have developed is proprietary. In contrast, Red Hat favors a pure open source model. That means that Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced. This is a big step forward for the Ceph community. Very little will change on day one as it will take some time to integrate the Inktank business and for any significant changes to happen with our engineering activities. However, we are very excited about what is coming next for Ceph and are looking forward to this new chapter. I'd like to thank everyone who has helped Ceph get to where we are today: the amazing research group at UCSC where it began, DreamHost for supporting us for so many years, the incredible Inktank team, and the many contributors and users that have helped shape the system. We continue to believe that robust, scalable, and completely open storage platforms like Ceph will transform a storage industry that is still dominated by proprietary systems. Let's make it happen! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Live database files on Ceph
Hi, We're running mysql in multi-master cluster (galera), mysql standalones, postgresql, mssql and oracle db's on ceph RBD via QEMU/KVM. As someone else pointed out it is usually faster with ceph, but sometimes you'll get some odd slow reads. Latency is our biggest enemy. Oracle comes with an awesome way to capture an insane amount of perf stats and through that we can see that our avg. latency for writes are ~12ms and reads slightly higher around 15ms. In our usecase that is acceptable. If we used local [SSD] disks this would be much lower ( 1-2ms). We've also experienced once that our galera cluster went out of sync due to a very stressed cluster/network (this particular cluster is satuated every now and then - both disk and network). We had to change scheduler from cfq - deadline on most db-servers to get acceptable speeds or we encountered writes taking up to 2sec whenever lots of seq. data had to be written. I wouldn't run super high precision/performance databases on it though. Your db performance will always reflect the status of your entire cluster system. I'd say for anything not requiring extremely finetuned always-consistent access times it runs very well. At the very least if you plan to do that, I'd suggest finding some way to isolate and gurantee performance for your guests no matter how busy your cluster would be (which I don't think you can do). We run with SSD journals and SSD backends for most of our db-stuff as we found using normal platter disks as backend could cause some issues if we hit a spikey period of cluster activity (even with ssd journals). Cheers, Martin On Thu, Apr 3, 2014 at 8:04 PM, Brian Beverage bbever...@americandatanetwork.com wrote: I am looking at setting up a Ganeti cluster using KVM and CentOS. While looking at storage I first looked at Gluster but noticed in the documentation it does not allow Live Database files to be saved to it. Does Ceph allow the use of LIVE database files being saved to it. If so does the database perform well? We have a couple Database servers that will be virtualized. I would like to know what other Ceph users are doing with their virtual environments that contain databases. I do not want to be locked into a SAN. I also would like to do this without being locked into a proprietary VM software. That is why Ganeti and KVM was the preferred software. Thanks, Brian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS debugging
Hi, I can see you're running mon, mds and osd on the same server. Also, from a quick glance you're using around 13GB resident memory. If you only have 16GB in your system I'm guessing you'll be swapping about now (or close). How much mem does the system hold? Also, how busy are the disks? Or is it primarily cpu-bound? Do you have many processes waiting for run time or high interrupt count? /Martin On Mon, Mar 31, 2014 at 1:49 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, Before the weekend we started some copying tests over ceph-fuse. Initially, this went ok. But then the performance started dropping gradually. Things are going very slow now: 2014-03-31 13:36:37.047423 mon.0 [INF] pgmap v265871: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 44747 kB/s rd, 216 kB/s wr, 10 op/s 2014-03-31 13:36:38.049286 mon.0 [INF] pgmap v265872: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4069 B/s rd, 363 kB/s wr, 24 op/s 2014-03-31 13:36:39.057680 mon.0 [INF] pgmap v265873: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 5092 B/s rd, 151 kB/s wr, 22 op/s 2014-03-31 13:36:40.075718 mon.0 [INF] pgmap v265874: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 25961 B/s rd, 1527 B/s wr, 10 op/s 2014-03-31 13:36:41.087764 mon.0 [INF] pgmap v265875: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71574 kB/s rd, 4564 B/s wr, 17 op/s 2014-03-31 13:36:42.109200 mon.0 [INF] pgmap v265876: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71238 kB/s rd, 3534 B/s wr, 9 op/s 2014-03-31 13:36:43.128113 mon.0 [INF] pgmap v265877: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4022 B/s rd, 116 kB/s wr, 24 op/s 2014-03-31 13:36:44.143382 mon.0 [INF] pgmap v265878: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 8030 B/s rd, 117 kB/s wr, 29 op/s 2014-03-31 13:36:45.160405 mon.0 [INF] pgmap v265879: 1300 pgs: 1300 active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 7049 B/s rd, 4531 B/s wr, 9 op/s ceph-mds seems very busy, and also only one osd! PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 54279 root 20 0 8561m 7.5g 4408 S 105.6 23.8 3202:05 ceph-mds 50242 root 20 0 1378m 373m 6452 S 0.7 1.2 523:38.77 ceph-osd 49446 root 18 -2 10644 356 320 S 0.0 0.0 0:00.00 udevd 49444 root 18 -2 10644 428 320 S 0.0 0.0 0:00.00 udevd 49319 root 20 0 1444m 405m 5684 S 0.0 1.3 513:41.13 ceph-osd 48452 root 20 0 1365m 364m 5636 S 0.0 1.1 551:52.31 ceph-osd 47641 root 20 0 1567m 388m 5880 S 0.0 1.2 754:50.60 ceph-osd 46811 root 20 0 1441m 393m 8256 S 0.0 1.2 603:11.26 ceph-osd 46028 root 20 0 1594m 398m 6156 S 0.0 1.2 657:22.16 ceph-osd 45275 root 20 0 1545m 510m 9920 S 18.9 1.6 943:11.99 ceph-osd 44532 root 20 0 1509m 395m 7380 S 0.0 1.2 665:30.66 ceph-osd 43835 root 20 0 1397m 384m 8292 S 0.0 1.2 466:35.47 ceph-osd 43146 root 20 0 1412m 393m 5884 S 0.0 1.2 506:42.07 ceph-osd 42496 root 20 0 1389m 364m 5292 S 0.0 1.1 522:37.70 ceph-osd 41863 root 20 0 1504m 393m 5864 S 0.0 1.2 462:58.11 ceph-osd 39035 root 20 0 918m 694m 3396 S 3.3 2.2 55:53.59 ceph-mon Does this look familiar to someone? How can we debug this further? I already have set the debug level of mds to 5. There are a lot of 'lookup' entries, but I can't see any reported warnings or errors. Thanks! Kind regards, Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] help, add mon failed lead to cluster failure
Hi, I experienced this from time to time with older releases of ceph, but haven't stumbled upon it for some time. Often I had to revert to the older state by using: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster and dump the monlist, find the original monitor - kill the newest addition, inject and restart it - then it should get online again. Cheers, Martin On Wed, Mar 26, 2014 at 11:40 AM, duan.xuf...@zte.com.cn wrote: Hi, I just add a new mon to a health cluster by following website manual http://ceph.com/docs/master/rados/operations/add-or-rm-mons/; ADDING MONITORS step by step, but when i execute step 6: ceph mon add mon-id ip[:port] the command didn't return, then i execute ceph -s on health mon node, this command didn't return either. so i try to restart mon to recover the whole cluster, but it seems never recover. Please anyone tell me how to deal with it? === mon.storage1 === Starting Ceph mon.storage1 on storage1... Starting ceph-create-keys on storage1... [root@storage1 ~]# ceph -s //after restart mon , ceph -s still have no output [root@storage1 ceph]# tail ceph-mon.storage1.log 2014-03-26 18:20:33.338554 7f60dbb967a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 24214 2014-03-26 18:20:33.460282 7f60dbb967a0 1 mon.storage1@-1(probing) e2 preinit fsid 3429fd17-4a92-4d3b-a7fa-04adedb0da82 2014-03-26 18:20:33.460694 7f60dbb967a0 1 mon.storage1@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2014-03-26 18:20:33.487899 7f60dbb967a0 0 mon.storage1@-1(probing) e2 my rank is now 0 (was -1) 2014-03-26 18:20:33.488575 7f60d6854700 0 -- 193.168.1.100:6789/0 193.168.1.133:6789/0 pipe(0x3f38280 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x3f19600).fault 2014-03-26 18:21:33.487686 7f60d8657700 0 mon.storage1@0(probing).data_health(0) update_stats avail 86% total 51606140 used 4324004 avail 44660696 2014-03-26 18:22:33.488091 7f60d8657700 0 mon.storage1@0(probing).data_health(0) update_stats avail 86% total 51606140 used 4324004 avail 44660696 2014-03-26 18:23:33.488500 7f60d8657700 0 mon.storage1@0(probing).data_health(0) update_stats avail 86% total 51606140 used 4324004 avail 44660696 ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec
Hi, I can see ~17% hardware interrupts which I find a little high - can you make sure all load is spread over all your cores (/proc/interrupts)? What about disk util once you restart them? Are they all 100% utilized or is it 'only' mostly cpu-bound? Also you're running a monitor on this node - how is the load on the nodes where you run a monitor compared to those where you dont? Cheers, Martin On Thu, Mar 20, 2014 at 10:18 AM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel 3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers). Here's an example of starting all the OSD's after a reboot. top - 09:10:51 up 2 min, 1 user, load average: 332.93, 112.28, 39.96 Tasks: 310 total, 1 running, 309 sleeping, 0 stopped, 0 zombie Cpu(s): 50.3%us, 32.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 17.2%hi, 0.0%si, 0.0%st Mem: 32917276k total, 6331224k used, 26586052k free, 1332k buffers Swap: 33496060k total,0k used, 33496060k free, 1474084k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15875 root 20 0 910m 381m 50m S 60 1.2 0:50.57 ceph-osd 2996 root 20 0 867m 330m 44m S 59 1.0 0:58.32 ceph-osd 4502 root 20 0 907m 372m 47m S 58 1.2 0:55.14 ceph-osd 12465 root 20 0 949m 418m 55m S 58 1.3 0:51.79 ceph-osd 4171 root 20 0 886m 348m 45m S 57 1.1 0:56.17 ceph-osd 3707 root 20 0 941m 405m 50m S 57 1.3 0:59.68 ceph-osd 3560 root 20 0 924m 394m 51m S 56 1.2 0:59.37 ceph-osd 4318 root 20 0 965m 435m 55m S 56 1.4 0:54.80 ceph-osd 3337 root 20 0 935m 407m 51m S 56 1.3 1:01.96 ceph-osd 3854 root 20 0 897m 366m 48m S 55 1.1 1:00.55 ceph-osd 3143 root 20 0 1364m 424m 24m S 16 1.3 1:08.72 ceph-osd 2509 root 20 0 652m 261m 62m S2 0.8 0:26.42 ceph-mon 4 root 20 0 000 S0 0.0 0:00.08 kworker/0:0 Regards, Quenten Grasso *From:* ceph-users-boun...@lists.ceph.com [mailto: ceph-users-boun...@lists.ceph.com] *On Behalf Of *Quenten Grasso *Sent:* Tuesday, 18 March 2014 10:19 PM *To:* 'ceph-users@lists.ceph.com' *Subject:* [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec Hi All, I'm trying to troubleshoot a strange issue with my Ceph cluster. We're Running Ceph Version 0.72.2 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. All Pools have a replica of 2 or better. I.e. metadata replica of 3. I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a single node (any node) the load average of that node shoots up to 230+ and the whole cluster starts blocking IO requests until it settles down and its fine again. Any ideas on why the load average goes so crazy starts to block IO? snips from my ceph.conf [osd] osd data = /var/ceph/osd.$id osd journal size = 15000 osd mkfs type = xfs osd mkfs options xfs = -i size=2048 -f osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k osd max backfills = 5 osd recovery max active = 3 [osd.0] host = pbnerbd01 public addr = 10.100.96.10 cluster addr = 10.100.128.10 osd journal = /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 devs = /dev/sda4 /end Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fluctuating I/O speed degrading over time
Hi Indra, I'd probably start by looking at your nodes and check if the SSDs are saturated or if they have high write access times. Any recommended benchmark tool to do this? Especially those specific to Ceph OSDs and will not cause any impact on overall performance? A simple way would be using iostat. I think something like 'iostat 1 -x' will show both total disk utilization (rightmost column) and the 3 columns just before should depict read/write io-access time in realtime. For ssd they should be low ( 6-8). Column 3 and 4 (r/s, w/s - iops) might be of interest as well. Just 'iostat -x' should show avg since uptime iirc. A better way is prob. collecting and graphing it - I think something like collectd or munin can do that or you. We use munin to keep historical data about all nodes performance - we can see if performance drops over time, iops increase or wait/latency time for disks explode as well. I think the default munin-node on ubuntu will capture this data automatically. Maybe test them individually directly on the cluster. Is it possible to test I/O speed from client directly to certain OSD on the cluster? From what I understand, the PGs are being randomly mapped to any of the OSDs (based on the crush map?). I was thinking more lowlvl - like fio or bonnie++ or the like :). I think with fio you can get the most detailed picture of how your ssd perform in terms of throughput and iops. At some point in time we accidentially had a node being reinstalled with a non-LTS image (13.04 I think) - and the kernel (3.5.something) had a bug/'feature' which caused lots of tcp segments to be retransmittet (approx. 1/100). Do you have any information about this bug? We are using Ubuntu 13.04 for all our Ceph nodes. If you can refer me to any documentation on this bug and how to resolve this issue, I will appreciate it very much. To be honest, I can't recall - the server in question was hosted @SoftLayer in Dallas and it was their techs asking us to upgrade the kernel after we found the issue with the high retransmit count and reported it. It was easy to just upgrade the kernel and test - and the issue went away. I didn't dig any deeper; if I remember, I'll try accessing the ticket monday to get all the details if it is still there. Cheers, Martin Looking forward to your reply, thank you. Cheers. On Fri, Mar 7, 2014 at 6:10 PM, Martin B Nielsen mar...@unity3d.comwrote: Hi, I'd probably start by looking at your nodes and check if the SSDs are saturated or if they have high write access times. If any of that is true, does that account for all SSD or just some of them? Maybe some of the disks needs a trim. Maybe test them individually directly on the cluster. If you can't find anything with the disks, then try and look further up the stack. Network, interrupts etc. At some point in time we accidentially had a node being reinstalled with a non-LTS image (13.04 I think) - and the kernel (3.5.something) had a bug/'feature' which caused lots of tcp segments to be retransmittet (approx. 1/100). This one node slowed down our entire cluster and caused high access time across the board. 'Upgrading' to LTS fixed it. As you say, it can just be that the increased utilization of the the cluster causes it and that you'll 'just' have to add more nodes. Cheers, Martin On Fri, Mar 7, 2014 at 10:50 AM, Indra Pramana in...@sg.or.id wrote: Hi, I have a Ceph cluster, currently with 5 osd servers and around 22 OSDs with SSD drives and I noted that the I/O speed, especially write access to the cluster is degrading over time. When we first started the cluster, we can get up to 250-300 MB/s write speed to the SSD cluster but now we can only get up to half the mark. Furthermore, it now fluctuates so sometimes I can get slightly better speed but on another time I get very bad result. We started with 3 osd servers and 12 OSDs and gradually add more servers. We are using KVM hypervisors as the Ceph clients and connection between clients and servers and between the servers are through 10 GBps switch with jumbo frames enabled on all interfaces. Any advice on how can I start to troubleshoot what might have caused the degradation of the I/O speed? Does utilisation contributes to it (since now we have more users compared to last time when we started)? Any optimisation we can do to improve the I/O performance? Appreciate any advice, thank you. Cheers. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fluctuating I/O speed degrading over time
Hi, I'd probably start by looking at your nodes and check if the SSDs are saturated or if they have high write access times. If any of that is true, does that account for all SSD or just some of them? Maybe some of the disks needs a trim. Maybe test them individually directly on the cluster. If you can't find anything with the disks, then try and look further up the stack. Network, interrupts etc. At some point in time we accidentially had a node being reinstalled with a non-LTS image (13.04 I think) - and the kernel (3.5.something) had a bug/'feature' which caused lots of tcp segments to be retransmittet (approx. 1/100). This one node slowed down our entire cluster and caused high access time across the board. 'Upgrading' to LTS fixed it. As you say, it can just be that the increased utilization of the the cluster causes it and that you'll 'just' have to add more nodes. Cheers, Martin On Fri, Mar 7, 2014 at 10:50 AM, Indra Pramana in...@sg.or.id wrote: Hi, I have a Ceph cluster, currently with 5 osd servers and around 22 OSDs with SSD drives and I noted that the I/O speed, especially write access to the cluster is degrading over time. When we first started the cluster, we can get up to 250-300 MB/s write speed to the SSD cluster but now we can only get up to half the mark. Furthermore, it now fluctuates so sometimes I can get slightly better speed but on another time I get very bad result. We started with 3 osd servers and 12 OSDs and gradually add more servers. We are using KVM hypervisors as the Ceph clients and connection between clients and servers and between the servers are through 10 GBps switch with jumbo frames enabled on all interfaces. Any advice on how can I start to troubleshoot what might have caused the degradation of the I/O speed? Does utilisation contributes to it (since now we have more users compared to last time when we started)? Any optimisation we can do to improve the I/O performance? Appreciate any advice, thank you. Cheers. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Trying to rescue a lost quorum
Hi, You can't form quorom with your monitors on cuttlefish if you're mixing 0.61.5 with any 0.61.5+ ( https://ceph.com/docs/master/release-notes/ ) = section about 0.61.5. I'll advice installing pre-0.61.5, form quorom and then upgrade to 0.61.9 (if needs be) - and then latest dumpling on top. Cheers, Martin On Fri, Feb 28, 2014 at 2:09 AM, Marc m...@shoowin.de wrote: Hi, thanks for the reply. I updated one of the new mons. And after a resonably long init phase (inconsistent state), I am now seeing these: 2014-02-28 01:05:12.344648 7fe9d05cb700 0 cephx: verify_reply coudln't decrypt with error: error decoding block for decryption 2014-02-28 01:05:12.345599 7fe9d05cb700 0 -- X.Y.Z.207:6789/0 X.Y.Z.201:6789/0 pipe(0x14e1400 sd=21 :49082 s=1 pgs=5421935 cs=12 l=0).failed verifying authorize reply with .207 being the updated mon and .201 being one of the old alive mons. I guess they don't understand each other? I would rather not try to update the mons running on servers that also host OSDs, especially since there seem to be communication issues between those versions... or am I reading this wrong? KR, Marc On 28/02/2014 01:32, Gregory Farnum wrote: On Thu, Feb 27, 2014 at 4:25 PM, Marc m...@shoowin.de wrote: Hi, I was handed a Ceph cluster that had just lost quorum due to 2/3 mons (b,c) running out of disk space (using up 15GB each). We were trying to rescue this cluster without service downtime. As such we freed up some space to keep mon b running a while longer, which succeeded, quorum restored (a,b), mon c remained offline. Even though we have freed up some space on mon c's disk also, that mon just won't start. It's log file does say ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mon, pid 27846 and thats all she wrote. Even when starting ceph-mon with -d mind you. So we had a cluster with 2/3 mons up and wanted to add another mon since it was only a matter of time til mon b failed again due to disk space. As such I added mon.g to the cluster, which took a long while to sync, but now reports running. Then mon.h got added for the same reason. mon.h fails to start much the same as mon.c does. Still that should leave us with 3/5 mons up. However running ceph daemon mon.{g,h} mon_status on the respective node also blocks. The only output we get from those are fault messages. Ok so now mon.g apparantly crashed: 2014-02-28 00:11:48.861263 7f4728042700 -1 mon/Monitor.cc: In function 'void Monitor::sync_timeout(entity_inst_t)' thread 7f4728042700 time 2014-02-28 00:11:48.782305 mon/Monitor.cc: 1099: FAILED assert(sync_state == SYNC_STATE_CHUNKS) ... and now blocks trying to start much like c and h. Long story short: is it possible to add .61.9 mons to a cluster running .61.2 on the 2 alive mons and all the osds? I'm guessing this is the last shot at trying to rescue the cluster without downtime. That should be fine, and is likely (though not guaranteed) to resolve your sync issues -- although it's pretty unfortunate that you're that far behind on the point releases; they fixed a whole lot of sync issues and related things and you might need to upgrade the existing monitors too in order to get the fixes you need... :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] questions about monitor data and ceph recovery
Hi Pavel, Will try and answer some of your questions: My first question will be about monitor data directory. How much space I need to reserve for it? Can monitor-fs be corrupted if monitor goes out of storage space? We have about 20GB partitions for monitors - they really don't use much space, but in case you need to do some extra logging it is nice to have (and ceph doing max debug consumes scary amounts of space). Also if you look in the monitor log they constantly monitor for free space. I don't know what will happen if a monitor runs full (or close to full), but I'm guessing that monitor will simply be marked as down or stopped somehow. You can change some of the values for a mon about how much data to keep before trimming etc. I also have questions about ceph auto-recovery process. For example, I have two nodes with 8 drives on each, each drive is presented as separate osd. The number of replicas = 2. I have wrote a crush ruleset, which picks two nodes and one osd on each to store replicas. Which will happens on following scenarios: 1. One drive in one node failed. Will ceph automatically re-replicate affected objects? Where replicas will be stored? Yes, as long as you have available space on the node that lost one OSD the data that was on that disk will be distributed aross the remaining 7 OSD on that node (according to your CRUSH rules) 1.1 The failed osd will appears online again with all of it's data. How ceph cluster will deal with it? This is just how I _think_ it works; please correct me if I'm wrong. All OSD have an internal map (pg map) which is constantly updated throughout the cluster. When any OSD goes offline/down and is started back up the latest pgmap of that OSD is 'diffed' up vs the latest map from the cluster and then the cluster can generate a new map based on what it has/had, what is missing/updated and generate a new map with the objects the newly started OSD should have. Then it will start to replicate and only get the changed/new objects. Bottom line, this really just works and works very well. 2. One node (with 8 osds) goes offline. Will ceph automatically replicate all objects on the remaining node to maintain number of replicas = 2? No, because it can no longer satisfy your CRUSH rules. Your crush rule states 1x copy pr. node and it will keep it that way. The cluster will go into a degraded state until you can bring up another node (ie all your data now is very vulnerable). I think it is often suggested to run with 3x replica if possible - or at the very least nr_nodes = replicas + 1. If you had to make it replicate on the remaining node you'd have to change your CRUSH rule to replicate based on OSD and not node. But then you'll most likely have problems when 1 node dies because objects could easily be on 2x OSD on the failed node. 2.1 The failed node goes online again with all data. How ceph cluster will deal with it? Same as the above with the OSD. Cheers, Martin Thanks in advance, Pavel. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pages stuck unclean (but remapped)
Hi, I would prob. start by figuring out exactly what pg are stuck unclean. You can do 'ceph pg dump | grep unclean' to get that info - then if your theory holds you should be able to verify the disk(s) in question. I cannot see any _too_full so am curious what could be the cause. You can also always adjust the weights manually if needed ( http://ceph.com/docs/master/rados/operations/control/#osd-subsystem ) with the (re)weight cmd. Cheers, Martin On Mon, Feb 24, 2014 at 2:09 AM, Gautam Saxena gsax...@i-a-inc.com wrote: I have 19 pages that are stuck unclean (see below result of ceph -s). This occurred after I executed a ceph osd reweight-by-utilization 108 to resolve problems with backfill_too_full messages, which I believe occurred because my OSDs sizes vary significantly in size (from a low of 600GB to a high of 3 TB). How can I get ceph to get these pages out of stuck-unclean? (And why is this occurring anyways?) My best guess of how to fix (though I don't know why) is that I need to run: ceph osd crush tunables optimal. However, my kernel version (on a fully up-to-date Centos 6.5) is 2.6.32, which is well below the minimum required version of 3.6 that's stated in the documentation (http://ceph.com/docs/master/rados/operations/crush-map/ ) -- so if I must run ceph osd crush tunables optimal to fix this problem, I presume I must upgrade my kernel first, right?...Any thoughts or am I chasing the wrong solution -- I want to avoid kernel upgrade unless it's needed.) = [root@ia2 ceph4]# ceph -s cluster 14f78538-6085-43f9-ac80-e886ca4de119 health HEALTH_WARN 19 pgs backfilling; 19 pgs stuck unclean; recovery 42959/5511127 objects degraded (0.779%) monmap e9: 3 mons at {ia1= 192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0}, election epoch 496, quorum 0,1,2 ia1,ia2,ia3 osdmap e7931: 23 osds: 23 up, 23 in pgmap v1904820: 1500 pgs, 1 pools, 10531 GB data, 2670 kobjects 18708 GB used, 26758 GB / 45467 GB avail 42959/5511127 objects degraded (0.779%) 1481 active+clean 19 active+remapped+backfilling client io 1457 B/s wr, 0 op/s [root@ia2 ceph4]# ceph -v ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) [root@ia2 ceph4]# uname -r 2.6.32-431.3.1.el6.x86_64 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] can one slow hardisk slow whole cluster down?
Hi, At least it used to be like that - I'm not sure if that has changed. I believe this is also part why it is adviced to go with the same kind of hw and setup if possible. Since at least rbd images are spread in objects throughout the cluster you'll prob. have to wait for a slow disk when reading - writing will still go journal - disk so if you have the ssd journal - sata you prob. wont notice that much unless you're doing lots and/or heavy writes. You can peak into it via the admin socket and get some perfstats for each osd (iirc it is 'perf dump' you want). You could set something up to poll at given intervals and graph it and prob. spot trends/slow disks that way. I think it is a manual process to locate a slow drive and either drop it from the cluster or give it lower weight. If possible, I'd suggest toying with something like fio/bonnie++ in a guest and run some tests with and without the osd/node in question - you'll know for certain then. Cheers, Martin On Tue, Jan 28, 2014 at 4:22 PM, Gautam Saxena gsax...@i-a-inc.com wrote: If one node which happens to have a single raid 0 hardisk is slow, would that impact the whole ceph cluster? That is, when VMs interact with the rbd pool to read and write data, would the kvm client wait for that slow hardisk/node to return with the requested data, thus making that slow hardisk/node the ultimate bottleneck? Or, would kvm/ceph be smart enough to get the needed data from whichever node is ready to serve it up? That is, kvm/ceph will request all possilbe osds to return data, but if one osd is done with its request, it can chose to return more data that the slow harddisk/node still hasn't returnedI'm trying to decide whether to remove the slow harddisk/node from the ceph cluster (depending on how ceph works). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] many blocked requests when recovery
Hi, You didn't state what version of ceph or kvm/qemu you're using. I think it wasn't until qemu 1.5.0 (1.4.2+?) that an async patch from inktank was accepted into mainstream which significantly helps in situations like this. If not using that on top of not limiting recovery threads you'll prob. see issues like you describe. Also more nodes make it easier on the entire cluster in case of recovery so it might make sense adding smaller ones if/when you expand it. Cheers, Martin On Tue, Dec 3, 2013 at 7:09 AM, 飞 duron...@qq.com wrote: hello, I'm testing Ceph as storage for KVM virtual machine images, my cluster have 3 mons and 3 data nodes, every data node have 8x2T SATA HDD and 1 SSD for journal. when I shutdown one data node to imitate server fault, the cluster begin to recovery , when under recovery, I can see many blocked requests, and the KVM VMs will be crash (crash as they think their disk is offline), how Can I solve this issue ? any idea ? thank you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Big or small node ?
Hi, I'd almost always go with more lesser beefy nodes than bigger ones. You're much more vulnerable if the big one(s) die and replication will not impact your cluster as much. I also find it easier to extend a cluster with smaller nodes. At least it feels like you can increase in more smooth rates with smaller nodes at your preferred rate instead of big chunks of extra added storage. But I guess it depends on intended cluster usage. In your example you can loose 1 of the smaller nodes (depending about replication level and total space usage ofc), but loosing the big one means nothing works. If only 1 node I would prob. not go ceph and just opt for zfs or raid6 instead (and drop the extra ssd and get 12x sata) - it will prob. perform better and you'll have more total space assuming you'll go with replication x2 with ceph. Cheers, Martin On Wed, Nov 20, 2013 at 8:47 AM, Daniel Schwager daniel.schwa...@dtnet.dewrote: Hallo, we are going to setup a 36-40TB (brutto) test setup in our company for disk2disk2tape backup. Now, we are faced to decide if we go the high- or low density ceph way. --- The big node (only one necessary): 1 x Supermicro, System 6027R-E1R12T with 2 x CPU E5-2620v2 (6 core (Hyper threading) per CPU), 32 GB with - 2 x SSD 80GB (RAID1, OS only) - 2 x SSDD 100GB, Intel s3700 (bandwidth R/W: 500/200MB/sec) for 10 journal (5 each SSD) - 10 x 4TB Seagate Enterprise Capacity ST4000NM0033 - 2 x embedded 10GBit for public/storage network So, I' would install also 1 monitor on this node containing also 10 OSD's. The price is about 20 US Cent / GB The small node (ok - we have to by 3 guy's) could be like 3 x Supermicro, SuperServer 6017R-TDF+ with 1 x E5-2603 (4 cores without Hyper threading), 16GB with 1 x 120 GB SSD Intel Serie S3500 SATA for OS and 3 OSD journal's (bandwidth R/W: 340/100 MB/sec) 2 x Intel® PRO/1000 PT Dual Port Server Adapter (LACP link aggregation, 2 ports for public-, 2 ports for storage network) 3 x 4TB Seagate Enterprise Capacity ST4000NM0033 I would also install a monitor on each node. The price is about 23 US Cent / GB I think, the performance (because of the better components like 10GBit, SSD, CPU) is much better in the big node. Because we may not add more HDD to the cluster, I'm not sure how to decide - big or small node. Is there a recommendation? Maybe also in regards to my chosen hardware components ? best regards Danny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping
Probably common sense but I was bitten by this once in a likewise situation.. If you run 3x replica and distribute them over 3x hosts (is that default now?) make sure that the disks on the host with the failed disk have space for it - the remaining two disks will have to hold the content of the failed disk and if they can't, your cluster will run full and halt. Cheers, Martin On Wed, Nov 13, 2013 at 12:59 AM, David Zafman david.zaf...@inktank.comwrote: Since the disk is failing and you have 2 other copies I would take osd.0 down. This means that ceph will not attempt to read the bad disk either for clients or to make another copy of the data: * Not sure about the syntax of this for the version of ceph you are running ceph osd down 0 Mark it “out” which will immediately trigger recovery to create more copies of the data with the remaining OSDs. ceph osd out 0 You can now finish the process of removing the osd by looking at these instructions: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual David Zafman Senior Developer http://www.inktank.com On Nov 12, 2013, at 3:16 AM, Mihály Árva-Tóth mihaly.arva-t...@virtual-call-center.eu wrote: Hello, I have 3 node, with 3 OSD in each node. I'm using .rgw.buckets pool with 3 replica. One of my HDD (osd.0) has just bad sectors, when I try to read an object from OSD direct, I get Input/output errror. dmesg: [1214525.670065] mpt2sas0: log_info(0x3108): originator(PL), code(0x08), sub_code(0x) [1214525.670072] mpt2sas0: log_info(0x3108): originator(PL), code(0x08), sub_code(0x) [1214525.670100] sd 0:0:2:0: [sdc] Unhandled sense code [1214525.670104] sd 0:0:2:0: [sdc] [1214525.670107] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1214525.670110] sd 0:0:2:0: [sdc] [1214525.670112] Sense Key : Medium Error [current] [1214525.670117] Info fld=0x60c8f21 [1214525.670120] sd 0:0:2:0: [sdc] [1214525.670123] Add. Sense: Unrecovered read error [1214525.670126] sd 0:0:2:0: [sdc] CDB: [1214525.670128] Read(16): 88 00 00 00 00 00 06 0c 8f 20 00 00 00 08 00 00 Okay I known need to replace HDD. Fragment of ceph -s output: pgmap v922039: 856 pgs: 855 active+clean, 1 active+clean+inconsistent; ceph pg dump | grep inconsistent 11.15d 25443 0 0 0 6185091790 30013001 active+clean+inconsistent 2013-11-06 02:30:45.23416. ceph pg map 11.15d osdmap e1600 pg 11.15d (11.15d) - up [0,8,3] acting [0,8,3] pg repair or deep-scrub can not fix this issue. But if I understand correctly, osd has to known it can not retrieve object from osd.0 and need to be replicate an another osd because there is no 3 working replicas now. Thank you, Mihaly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD question
Hi, Plus reads will still come from your non-SSD disks unless you're using something like flashcache in front and as Greg said, having much more IOPS available for your db often makes a difference (depending on load, usage etc ofc). We're using Samsung Pro 840 256GB pretty much like Martin describes and we haven't had any issues (yet). We've setup our environment so that if we lose a node or take one offline for maint it won't impact the cluster much; with only 3 nodes I would prob. go with more durable hw-specs. Cheers, Martin On Tue, Oct 22, 2013 at 6:38 AM, Gregory Farnum g...@inktank.com wrote: On Mon, Oct 21, 2013 at 7:05 PM, Martin Catudal mcatu...@metanor.ca wrote: Hi, I have purchase my hardware for my Ceph storage cluster but did not open any of my 960GB SSD drive box since I need to answer my question first. Here's my hardware. THREE server Dual 6 core Xeon 2U capable with 8 hotswap tray plus 2 SSD mount internally. In each server I will have : 2 x SSD 840 Pro Samsung 128 GB in RAID 1 for the OS 2 x SSD 840 Pro Samsung for journal 4 x 4TB Hitachi 7K4000 7200RPM 1 x 960GB Crucial M500 for one fast OSD pool. Configuration : One SSD journal for two 4TB so If I lost one SSD journal, I will only lost Two OSD instead of all my storage for that particular node. I have also bought 3 x 960GB M500 SSD from Crucial for the creation of a fast Pool of OSD made from SSD's. So One 960GB per server for database application. It is advisable to do that but it is better to return them and for the same price buy 6 more 4TB Hitachi? Since the write acknowledgment is made from the SSD journal, does I have a huge improvement by using SSD as OSD? My goal is to have solid fast performance for database ERP and 3D modeling of mining gallery run in VM. The specifics depend on a lot of factors, but for database applications you are likely to have better performance with an SSD pool. This is because even though the journal can do fast acknowledgements, that's for evening out write bursts — on average it will restrict itself to the speed of the backing store. A good SSD can generally do much more than 6x a HDD's random IOPS. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Thank's Martin -- Martin Catudal Responsable TIC Ressources Metanor Inc Ligne directe: (819) 218-2708 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph with high disk densities?
Hi Scott, Just some observations from here. We run 8 nodes, 2U units with 12x OSD each (4x 500GB ssd, 8x 4TB platter) attached to 2x LSI 2308 cards. Each node uses an intel E5-2620 with 32G mem. Granted, we only have like 25 VM (some fairly io-hungry, both iops and throughput-wise though) on that cluster, but we hardly see any cpu-usage at all. We have ~6k PG and according to munin our avg. cpu time is ~9% (that is out of all cores, so 9% out of 1200% (6 cores, 6 HT)). Sadly I didn't record cpu-usage while stresstesting or breaking it. We're using cuttlefish and XFS. And again, this cluster is still pretty underused, so the cpu-usage does not reflect a more active system. Cheers, Martin On Mon, Oct 7, 2013 at 6:15 PM, Scott Devoid dev...@anl.gov wrote: I brought this up within the context of the RAID discussion, but it did not garner any responses. [1] In our small test deployments (160 HDs and OSDs across 20 machines) our performance is quickly bounded by CPU and memory overhead. These are 2U machines with 2x 6-core Nehalem; and running 8 OSDs consumed 25% of the total CPU time. This was a cuttlefish deployment. This seems like a rather high CPU overhead. Particularly when we are looking to hit density target of 10-15 4TB drives / U within 1.5 years. Does anyone have suggestions for hitting this requirement? Are there ways to reduce CPU and memory overhead per OSD? My one suggestion was to do some form of RAID to join multiple drives and present them to a single OSD. A 2 drive RAID-0 would halve the OSD overhead while doubling the failure rate and doubling the rebalance overhead. It is not clear to me if that is better or not. [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/004833.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendations
Hi Shain, Those R515 seem to mimic our servers (2U supermicro w. 12x 3.5 bays and 2x 2.5 in the rear for OS). Since we need a mix of SSD platter we have 8x 4TB drives and 4x 500GB SSD + 2x 250GB SSD for OS in each node (2x 8-port LSI 2308 in IT-mode) We've partitioned 10GB from each 4x 500GB to use as journal for 4x 4TB drives and each of the OS disks each hold 2x journals each for the remaining 4 platter disks. We tested a lot how to put these journals and this setup seemed to fit best into our setup (pure VM block storage - 3x replica). Everything connected via 10GbE (1 network for cluster, 1 for public) and 3 standalone monitor servers. For storage nodes we use E5-2620/32gb ram, and monitor nodes E3-1260L/16gb ram - we've tested with both 1 and 2 nodes going down and starting redistributing data and they seem to cope more than fine. Overall I find these nodes as a good compromise between capacity, price and performance - we looked into getting 2U servers with 8x 3.5 bays and get more of them, but ultimately went with this. We also have some boxes from coraid (SR SRX with and without flashcache/etherflash) so we've been able to do some direct comparison and so far ceph is looking good - especially price-storage ratio. At any rate, back to your mail, I think the most important factor is looking at all the pieces and making sure you're not being [hard] bottlenecked somewhere - we found 24gb ram to be a little on the low side when all 12 disks started to redistribute, but 32 is fine. Also not having journals on SSD before writing to platter really hurt a lot when we tested - this can prob. be mitigated somewhat with better raid controllers. CPU-wise the E5 2620 hardly breaks a sweat even when having to do just a little with a node going down. Good luck with your HW-adventure :). Cheers, Martin On Mon, Aug 26, 2013 at 3:56 PM, Shain Miley smi...@npr.org wrote: Good morning, I am in the process of deciding what hardware we are going to purchase for our new ceph based storage cluster. I have been informed that I must submit my purchase needs by the end of this week in order to meet our FY13 budget requirements (which does not leave me much time). We are planning to build multiple clusters (one primarily for radosgw at location 1; the other for vm block storage at location 2). We will be building our radosgw storage out first, so this is the main focus of this email thread. I have read all the docs and the white papers, etc on hardware suggestions ...and we have an existing relationship with Dell, so I have been planning on buying a bunch of Dell R515's with 4TB drives and using 10GigE networking for this radosgw setup (although this will be primary used for radosgw purposes...I will be testing running a limited number of vm's on this infrastructure as well...in order to see what kind of performance we can achieve). I am just wondering if anyone else has any quick thoughts on these hardware choices, or any alternative suggestions that I might look at as I seek to finalize our purchasing this week. Thanks in advance, Shain Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] performance questions
Hi Jeff, I would be surprised as well - we initially tested on a 2-replica cluster with 8 nodes having 12 osd each - and went to 3-replica as we re-built the cluster. The performance seems to be where I'd expect it (doing consistent writes in a rbd VM @ ~400MB/sec on 10GbE which I'd expect is either a limit in disks, network, qemu/kvm or the 3-replica setup kicking in) Just curious, anything in dmesg about the disk mounted as osd.4? Cheers, Martin On Tue, Aug 20, 2013 at 4:02 PM, Mark Nelson mark.nel...@inktank.comwrote: On 08/20/2013 08:42 AM, Jeff Moskow wrote: Hi, More information. If I look in /var/log/ceph/ceph.log, I see 7893 slow requests in the last 3 hours of which 7890 are from osd.4. Should I assume a bad drive? I SMART says the drive is healthy? Bad osd? Definitely sounds suspicious! Might be worth taking that OSD out and doing some testing on the drive. Thanks, Jeff __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph instead of RAID
Hi, I'd just like to echo what Wolfgang said about ceph being a complex system. I initially started out testing ceph with a setup much like yours. And while it overall performed ok, it was not as good as sw raid on the same machine. Also, as Mark said you'll have at very best half write speeds because of how the journaling works if you do larger continuous writes. Ceph really shines with multiple servers multiple concurrency. My testmachine was running for ½ a year+ (going from argonaut - cuttlefish) and in that process I came to realize that mixing types of disk (and size) was a bad idea (some enterprise SATA, some fast desktop and some green disks) - as speed will be determined by the slowest drive in your setup (that's why they're advocating using similar hw if at all possible I guess). I also experienced all the challenging issues having to deal with a very young technology; osds suddenly refusing to start, pg's going into various incomplete/down/inconsistent states, monitor leveldb running full, monitor dying at weird times and well - I think it is good for a learning experience, but like Wolfgang said I think it is too much hassle for too little gain when you have something like raid10/zfs around. But, by all means, don't let us discourage you if you want to go this route - ceph's unique self-healing ability was what drew me into running a single machine in the first place. Cheers, Martin On Tue, Aug 13, 2013 at 9:32 AM, Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at wrote: On 08/13/2013 09:23 AM, Jeffrey 'jf' Lim wrote: Anyway, I thought what if instead of RAID-10 I use ceph? All 6 disks will be local, so I could simply create 6 local OSDs + a monitor, right? Is there anything I need to watch out for in such configuration? You can do that. Although it's nice to play with and everything, I wouldn't recommend doing it. It will give you more pain than pleasure. How so? Care to elaborate? Ceph is a complex system, built for clusters. It does some stuff in software that is otherwhise done in hardware (raid controllers). The nature of the complexity of a cluster system is a lot of overhead compared to a local raid [whatever] system, and latency of disk i/o will naturally suffer a bit. An OSD needs about 300 MB of RAM (may vary on your PGs), times 6 is a waste of nearly 2 GB of RAM (compared to a local RAID). Also ceph is young, and it does indeed have some bugs. RAID is old, and very mature. Although I rely on ceph on a productive cluster, too, it is way harder to maintain than a simple local raid. When a disk fails in ceph you don't have to worry about your data, which is a good thing, but you have to worry about the rebuilding (which isn't too hard, but at least you need to know SOMETHING about ceph), with (hardware) RAID you simply replace the disk, and it will be rebuilt. Others will find more reasons why this is not the best idea for a production system. Don't get me wrong, I'm a big supporter of ceph, but only for clusters, not for single systems. wogri -jf -- He who settles on the idea of the intelligent man as a static entity only shows himself to be a fool. Every nonfree program has a lord, a master -- and if you use the program, he is your master. --Richard Stallman -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebuild the monitor infrastructure
Hi Bryan, I asked the same question a few months ago: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-February/000221.html But basically, that is pretty bad; you'll be stuck on your own and would need to get in contact with Inktank - they might be able to help rebuild a monitor for you. Cheers, Martin On Wed, Apr 24, 2013 at 8:19 AM, Bryan Stansell br...@stansell.org wrote: Sorry for possibly a silly new user question, but I was wondering if there was any way to rebuild the monitor infrastructure in case of catastrophic failure. My simple case is a single monitor. If the data is lost because of hardware failure, etc, can it be recreated from scratch? The same could be said for the suggested 3-node monitor setup - you'd probably have to have a user error for that kind of destruction, but could happen. I've been searching for anything that explains how to recreate this. I found the docs that talk about how to recreate a single monitor from scratch if one out of many are misbehaving (treat it like adding a new instance). Losing this data is REALLY bad, I understand that. I'm hoping that by dumping out some critical set of data while it is working would provide enough data to recreate things from scratch. Is this a possibility? Is there a documented procedure any place? Thanks much. Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph error: active+clean+scrubbing+deep
Hi Kakito, You def. _want_ scrubbing to happen! http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing If you feel it kills your system you can tweak some of the values; like: osd scrub load threshold osd scrub max interval osd deep scrub interval I have no experience in changing those values, so I can't say how it will influence your system. Also, not that it is any of my business, but it seems you're running with replication set to 1. Cheers, Martin On Tue, Apr 16, 2013 at 3:11 AM, kakito tientienminh080...@gmail.com wrote: Dear all, I use Ceph Storage, Recently, I often get an error: mon.0 [INF] pgmap v277690: 640 pgs: 639 active+clean, 1 active+clean+scrubbing+deep; 14384 GB data, 14409 GB used, 90007 GB / 107 TB avail. It seems that it is not correct. I tried to restart. But not ok. It lows my system. I user ceph 0.56.4, kernel 3.8.6-1.el6.elrepo.x86_64 How to fix it ?! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to calculate the capacity of a ceph cluster
Hi Ashish, Yep, that would be the correct way to do it. If you already have a cluster running, a ceph -s will also show usage, ie like: ceph -s pgmap v1842777: 8064 pgs: 8064 active+clean; 1069 GB data, 2144 GB used, 7930 GB / 10074 GB avail; 3569B/s wr, 0op/s This is a small test-cluster with 2x replica, 1TB data used and roughly 2x the amount used. Also 'rados df' will show usage pr. pool :) Cheers, Martin On Wed, Mar 13, 2013 at 11:15 AM, Ashish Kumar ku...@weclapp.com wrote: Hi Guys, ** ** Just want to know, how can I calculate the capacity of my ceph cluster. I don't know whether a simple RAID system calculation will work or not. ** ** I have 5 servers each with the storage of 2TB and there are three copies of data, will it be ok to calculate the capacity in following way: ** ** 5(servers)* 2TB/3. ** ** ** ** ** ** *Ashish kumar|* Software Development *|* *weclapp GmbH* Frauenbergstraße 31-33 *|* D-35039 Marburg + 49 6421 999 1805 office *|* + 49 6421 999 1899 fax ** ** weclapp GmbH | Sitz der Gesellschaft: Marburg | Handelsregister: Amtsgericht Marburg HRB 5438 Geschäftsführer: Michael Schmidt, Ertan Özdil, Uwe Knoke ** ** *[image: cid:image001.png@01CD39A1.D8A1C250]* http://www.weclapp.com/*** * ** ** ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com