Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?
Full disclosure - I have not created an erasure code pool yet! I have been wanting to do the same thing that you are attempting and have these links saved. I believe this is what you are looking for. This link is for decompiling the CRUSH rules and recompiling: https://docs.ceph.com/docs/luminous/rados/operations/crush-map-edits/ This link is for creating the EC rules for 4+2 with only 3 hosts: https://ceph.io/planet/erasure-code-on-small-clusters/ I hope that helps! Chris On 2019-10-18 2:55 pm, Salsa wrote: Ok, I'm lost here. How am I supposed to write a crush rule? So far I managed to run: #ceph osd crush rule dump test -o test.txt So I can edit the rule. Now I have two problems: 1. Whats the functions and operations to use here? Is there documentation anywhere abuot this? 2. How may I create a crush rule using this file? 'ceph osd crush rule create ... -i test.txt' does not work. Am I taking the wrong approach here? -- Salsa Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Friday, October 18, 2019 3:56 PM, Paul Emmerich wrote: Default failure domain in Ceph is "host" (see ec profile), i.e., you need at least k+m hosts (but at least k+m+1 is better for production setups). You can change that to OSD, but that's not a good idea for a production setup for obvious reasons. It's slightly better to write a crush rule that explicitly picks two disks on 3 different hosts Paul Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Oct 18, 2019 at 8:45 PM Salsa sa...@protonmail.com wrote: > I have probably misunterstood how to create erasure coded pools so I may be in need of some theory and appreciate if you can point me to documentation that may clarify my doubts. > I have so far 1 cluster with 3 hosts and 30 OSDs (10 each host). > I tried to create an erasure code profile like so: > " > > ceph osd erasure-code-profile get ec4x2rs > > == > > crush-device-class= > crush-failure-domain=host > crush-root=default > jerasure-per-chunk-alignment=false > k=4 > m=2 > plugin=jerasure > technique=reed_sol_van > w=8 > " > If I create a pool using this profile or any profile where K+M > hosts , then the pool gets stuck. > " > > ceph -s > > > > cluster: > id: eb4aea44-0c63-4202-b826-e16ea60ed54d > health: HEALTH_WARN > Reduced data availability: 16 pgs inactive, 16 pgs incomplete > 2 pools have too many placement groups > too few PGs per OSD (4 < min 30) > services: > mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 11d) > mgr: ceph01(active, since 74m), standbys: ceph03, ceph02 > osd: 30 osds: 30 up (since 2w), 30 in (since 2w) > data: > pools: 11 pools, 32 pgs > objects: 0 objects, 0 B > usage: 32 GiB used, 109 TiB / 109 TiB avail > pgs: 50.000% pgs not active > 16 active+clean > 16 creating+incomplete > > ceph osd pool ls > > = > > test_ec > test_ec2 > " > The pool will never leave this "creating+incomplete" state. > The pools were created like this: > " > > ceph osd pool create test_ec2 16 16 erasure ec4x2rs > > > > ceph osd pool create test_ec 16 16 erasure > > === > > " > The default profile pool is created correctly. > My profiles are like this: > " > > ceph osd erasure-code-profile get default > > == > > k=2 > m=1 > plugin=jerasure > technique=reed_sol_van > > ceph osd erasure-code-profile get ec4x2rs > > == > > crush-device-class= > crush-failure-domain=host > crush-root=default > jerasure-per-chunk-alignment=false > k=4 > m=2 > plugin=jerasure > technique=reed_sol_van > w=8 > " > From what I've read it seems to be possible to create erasure code pools with higher than hosts K+M. Is this not so? > What am I doing wrong? Do I have to create any special crush map rule? > -- > Salsa > Sent with ProtonMail Secure Email. > > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] bluestore db & wal use spdk device how to ?
Hi All, I have multiple nvme ssd and I wish to use two of them for spdk as bluestore db & wal my assumption would be in ceph.conf under osd.conf put following bluestore_block_db_path = "spdk::01:00.0"bluestore_block_db_size = 40 * 1024 * 1024 * 1024 (40G) Then how to prepare osd? ceph-volume lvm prepare --bluestore --data vg_ceph/lv_sas-sda --block.db spdk::01:00.0 ? what if I have a second nvme ssd (:1a:00.0) want to use for different osd ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Changing the release cadence
It seems like since the change to the 9 months cadence it has been bumpy for the Debian based installs. Changing to a 12 month cadence sounds like a good idea. Perhaps some Debian maintainers can suggest a good month for them to get the packages in time for their release cycle. On 2019-06-05 12:16 pm, Alexandre DERUMIER wrote: Hi, - November: If we release Octopus 9 months from the Nautilus release (planned for Feb, released in Mar) then we'd target this November. We could shift to a 12 months candence after that. For the 2 last debian releases, the freeze was around january-february, november seem to be a good time for ceph release. - Mail original - De: "Sage Weil" À: "ceph-users" , "ceph-devel" , d...@ceph.io Envoyé: Mercredi 5 Juin 2019 17:57:52 Objet: Changing the release cadence Hi everyone, Since luminous, we have had the follow release cadence and policy: - release every 9 months - maintain backports for the last two releases - enable upgrades to move either 1 or 2 releases heads (e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...) This has mostly worked out well, except that the mimic release received less attention that we wanted due to the fact that multiple downstream Ceph products (from Red Has and SUSE) decided to based their next release on nautilus. Even though upstream every release is an "LTS" release, as a practical matter mimic got less attention than luminous or nautilus. We've had several requests/proposals to shift to a 12 month cadence. This has several advantages: - Stable/conservative clusters only have to be upgraded every 2 years (instead of every 18 months) - Yearly releases are more likely to intersect with downstream distribution release (e.g., Debian). In the past there have been problems where the Ceph releases included in consecutive releases of a distro weren't easily upgradeable. - Vendors that make downstream Ceph distributions/products tend to release yearly. Aligning with those vendors means they are more likely to productize *every* Ceph release. This will help make every Ceph release an "LTS" release (not just in name but also in terms of maintenance attention). So far the balance of opinion seems to favor a shift to a 12 month cycle[1], especially among developers, so it seems pretty likely we'll make that shift. (If you do have strong concerns about such a move, now is the time to raise them.) That brings us to an important decision: what time of year should we release? Once we pick the timing, we'll be releasing at that time *every year* for each release (barring another schedule shift, which we want to avoid), so let's choose carefully! A few options: - November: If we release Octopus 9 months from the Nautilus release (planned for Feb, released in Mar) then we'd target this November. We could shift to a 12 months candence after that. - February: That's 12 months from the Nautilus target. - March: That's 12 months from when Nautilus was *actually* released. November is nice in the sense that we'd wrap things up before the holidays. It's less good in that users may not be inclined to install the new release when many developers will be less available in December. February kind of sucked in that the scramble to get the last few things done happened during the holidays. OTOH, we should be doing what we can to avoid such scrambles, so that might not be something we should factor in. March may be a bit more balanced, with a solid 3 months before when people are productive, and 3 months after before they disappear on holiday to address any post-release issues. People tend to be somewhat less available over the summer months due to holidays etc, so an early or late summer release might also be less than ideal. Thoughts? If we can narrow it down to a few options maybe we could do a poll to gauge user preferences. Thanks! sage [1] https://twitter.com/larsmb/status/1130010208971952129 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online
When your node went down, you lost 100% of the copies of the objects that were stored on that node, so the cluster had to re-create a copy of everything. When the node came back online (and particularly since your usage was near-zero), the cluster discovered that many objects did not require changes and were still identical to their counterparts. The only moved objects would have been ones that had changed and ones that needed to be moved in order to satisfy the requirements of your crush map for the purposes of distribution. On January 27, 2019 09:47:59 Götz Reinicke wrote: Dear all, thanks for your feedback and Fäll try to take any suggestion in consideration. I’v rebooted node in question and oll 24 OSDs came online without any complaining. But wat makes me wonder is: During the downtime the Object got rebalanced and placed on the remaining nodes. With the failed node online, only a couple of hundreds objects where misplaced, out of about 35 million. The question for me is: What happens to the objects on the OSDs that went down after the OSDs got back online? Thanks for feedback Am 27.01.2019 um 04:17 schrieb Christian Balzer : Hello, this is where (depending on your topology) something like: --- mon_osd_down_out_subtree_limit = host --- can come in very handy. Provided you have correct monitoring, alerting and operations, recovering a down node can often be restored long before any recovery would be finished and you also avoid the data movement back and forth. And if you see that recovering the node will take a long time, just manually set things out for the time being. Christian On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote: Dear Chris, Thanks for your feedback. The node/OSDs in question are part of an erasure coded pool and during the weekend the workload should be close to none. But anyway, I could get a look on the console and on the server; the power is up, but I cant use any console, the Loginprompt is shown, but no key is accepted. I’ll have to reboot the server and check what he is complaining about tomorrow morning ASAP I can access the server again. Fingers crossed and regards. Götz Am 26.01.2019 um 23:41 schrieb Chris : It sort of depends on your workload/use case. Recovery operations can be computationally expensive. If your load is light because its the weekend you should be able to turn that host back on as soon as you resolve whatever the issue is with minimal impact. You can also increase the priority of the recovery operation to make it go faster if you feel you can spare additional IO and it won't affect clients. We do this in our cluster regularly and have yet to see an issue (given that we take care to do it during periods of lower client io) On January 26, 2019 17:16:38 Götz Reinicke wrote: Hi, one host out of 10 is down for yet unknown reasons. I guess a power failure. I could not yet see the server. The Cluster is recovering and remapping fine, but still has some objects to process. My question: May I just switch the server back on and in best case, the 24 OSDs get back online and recovering will do the job without problems. Or what might be a good way to handle that host? Should I first wait till the recover is finished? Thanks for feedback and suggestions - Happy Saturday Night :) . Regards . Götz -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications Götz Reinicke IT-Koordinator IT-OfficeNet +49 7141 969 82420 goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg http://www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzende des Aufsichtsrates: Petra Olschowski Staatssekretärin im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt Datenschutzerklärung | Transparenzinformation Data privacy statement | Transparency information ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online
It sort of depends on your workload/use case. Recovery operations can be computationally expensive. If your load is light because its the weekend you should be able to turn that host back on as soon as you resolve whatever the issue is with minimal impact. You can also increase the priority of the recovery operation to make it go faster if you feel you can spare additional IO and it won't affect clients. We do this in our cluster regularly and have yet to see an issue (given that we take care to do it during periods of lower client io) On January 26, 2019 17:16:38 Götz Reinicke wrote: Hi, one host out of 10 is down for yet unknown reasons. I guess a power failure. I could not yet see the server. The Cluster is recovering and remapping fine, but still has some objects to process. My question: May I just switch the server back on and in best case, the 24 OSDs get back online and recovering will do the job without problems. Or what might be a good way to handle that host? Should I first wait till the recover is finished? Thanks for feedback and suggestions - Happy Saturday Night :) . Regards . Götz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Garbage collection growing and db_compaction with small file uploads
Hi all I'm seeing some behaviour I wish to check on a Luminous (12.2.10) cluster that I'm running for rbd and rgw (mostly SATA filestore with NVME journal with a few SATA only bluestore). There's a set of dedicated SSD OSDs running bluestore for the .rgw buckets.index pool and also holding the .rgw.gc pool There's a long running upload of small files, which I think is causing a large amount of leveldb compaction (on filestore nodes) and rocksdb compaction on bluestore nodes. The .rgw.buckets bluestore nodes were exhibiting noticeably higher load than filestore nodes, although this seems to have been solved following configuring the following options for bluestore SATA osds: bluestore cache size hdd = 10737418240 osd memory target = 10737418240 However the bluestore nodes are still showing significantly higher wait CPU and higher disk IO than filestore nodes, is there anything else that I should be looking at tuning for bluestore, or is this is expected due to the loss of file cache with filestore? Whilst the upload has been running a "radosgw-admin orphans find" was also being executed, although this was ended manually before completion, as a significant buildup in garbage collection has occurred. Looking into this, it looks like most of the outstanding garbage collection relates to a single bucket, which was shown to contain a large amount of multipart/shadow files. These are now being listed in the radosgw-admin gc list # radosgw-admin gc list | grep -c '"oid":' 224557347 # radosgw-admin gc list | grep '"oid":' | grep -v -c "default.1084171934.99" 3674322 # radosgw-admin gc list | head -1000 | grep '"oid":'| grep 1084171934 "oid": "default.1084171934.99__multipart_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1", "oid": "default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_1", "oid": "default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_2", "oid": "default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_3", "oid": "default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_4", "oid": "default.1084171934.99__multipart_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.2", Despite running multiple "radosgw-admin gc process" commands alongside our radosgw processes, which has helped clean up garbage collection in the past, our gc list is currently continuing to grow. I believe I can loop through this manually and use the rados rm command to remove the objects from the .rgw.buckets pool after having a look through some historic posts on this list, and then remove the garbage collection objects - is this a reasonable solution? Are there any recommendations for dealing with a garbage collection list of this size? If there's any additional information I should provide for context here, please let me know. Thanks for any help Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help Ceph Cluster Down
If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time). If those 3 down OSDs can be revived there is a (small) chance that you can right the ship, but 1400pg/OSD is pretty extreme. I'm surprised the cluster even let you do that - this sounds like a data loss event. Bring back the 3 OSD and see what those 2 inconsistent pgs look like with ceph pg query. On January 3, 2019 21:59:38 Arun POONIA wrote: Hi, Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times. Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recovering or rebalancing. Not sure what to do. I tried shutting down OSDs on new server. Status: [root@fre105 ~]# ceph -s 2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) No such file or directory cluster: id: adb9ad8e-f458-4124-bf58-7963a8d1391f health: HEALTH_ERR 3 pools have many more objects per pg than average 373907/12391198 objects misplaced (3.018%) 2 scrub errors 9677 PGs pending on creation Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 2717 pgs stale Possible data damage: 2 pgs inconsistent Degraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized 52486 slow requests are blocked > 32 sec 9287 stuck requests are blocked > 4096 sec too many PGs per OSD (2968 > max 200) services: mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 osd: 39 osds: 36 up, 36 in; 51 remapped pgs rgw: 1 daemon active data: pools: 18 pools, 54656 pgs objects: 6050k objects, 10941 GB usage: 21727 GB used, 45308 GB / 67035 GB avail pgs: 13.073% pgs not active 178350/12391198 objects degraded (1.439%) 373907/12391198 objects misplaced (3.018%) 46177 active+clean 5054 down 1173 stale+down 1084 stale+active+undersized 547 activating 201 stale+active+undersized+degraded 158 stale+activating 96activating+degraded 46stale+active+clean 42activating+remapped 34stale+activating+degraded 23stale+activating+remapped 6 stale+activating+undersized+degraded+remapped 6 activating+undersized+degraded+remapped 2 activating+degraded+remapped 2 active+clean+inconsistent 1 stale+activating+degraded+remapped 1 stale+active+clean+remapped 1 stale+remapped 1 down+remapped 1 remapped+peering io: client: 0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr Thanks -- Arun Poonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade
I am also having this problem. Zheng (or anyone else), any idea how to perform this downgrade on a node that is also a monitor and an OSD node? dpkg complains of a dependency conflict when I try to install ceph-mds_13.2.1-1xenial_amd64.deb: ``` dpkg: dependency problems prevent configuration of ceph-mds: ceph-mds depends on ceph-base (= 13.2.1-1xenial); however: Version of ceph-base on system is 13.2.2-1xenial. ``` I don't think I want to downgrade ceph-base to 13.2.1. Thank you, Chris Martin > Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and > marking mds repaird can resolve this. > > Yan, Zheng > On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin wrote: > > > > Update: > > I discovered http://tracker.ceph.com/issues/24236 and > > https://github.com/ceph/ceph/pull/22146 > > Make sure that it is not relevant in your case before proceeding to > > operations that modify on-disk data. > > > > > > On 6.10.2018, at 03:17, Sergey Malinin wrote: > > > > I ended up rescanning the entire fs using alternate metadata pool approach > > as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > > The process has not competed yet because during the recovery our cluster > > encountered another problem with OSDs that I got fixed yesterday (thanks to > > Igor Fedotov @ SUSE). > > The first stage (scan_extents) completed in 84 hours (120M objects in data > > pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by > > OSDs failure so I have no timing stats but it seems to be runing 2-3 times > > faster than extents scan. > > As to root cause -- in my case I recall that during upgrade I had forgotten > > to restart 3 OSDs, one of which was holding metadata pool contents, before > > restarting MDS daemons and that seemed to had an impact on MDS journal > > corruption, because when I restarted those OSDs, MDS was able to start up > > but soon failed throwing lots of 'loaded dup inode' errors. > > > > > > On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky > gmail.com> wrote: > > > > Same problem... > > > > # cephfs-journal-tool --journal=purge_queue journal inspect > > 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c > > Overall journal integrity: DAMAGED > > Objects missing: > > 0x16c > > Corrupt regions: > > 0x5b00- > > > > Just after upgrade to 13.2.2 > > > > Did you fixed it? > > > > > > On 26/09/18 13:05, Sergey Malinin wrote: > > > > Hello, > > Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2. > > After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are > > damaged. Resetting purge_queue does not seem to work well as journal still > > appears to be damaged. > > Can anybody help? > > > > mds log: > > > > -789> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.mds2 Updating MDS map > > to version 586 from mon.2 > > -788> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 handle_mds_map i > > am now mds.0.583 > > -787> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 handle_mds_map > > state change up:rejoin --> up:active > > -786> 2018-09-26 18:42:32.527 7f70f78b1700 1 mds.0.583 recovery_done -- > > successful recovery! > > > >-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: > > Decode error at read_pos=0x322ec6636 > >-37> 2018-09-26 18:42:32.707 7f70f28a7700 5 mds.beacon.mds2 > > set_want_state: up:active -> down:damaged > >-36> 2018-09-26 18:42:32.707 7f70f28a7700 5 mds.beacon.mds2 _send > > down:damaged seq 137 > >-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: > > _send_mon_message to mon.ceph3 at mon:6789/0 > >-34> 2018-09-26 18:42:32.707 7f70f28a7700 1 -- mds:6800/e4cc09cf --> > > mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- > > 0x563b321ad480 con 0 > > > > -3> 2018-09-26 18:42:32.743 7f70f98b5700 5 -- mds:6800/3838577103 >> > > mon:6789/0 conn(0x563b3213e000 :-1 > > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq > > 29 0x563b321ab880 mdsbeaco > > n(85106/mds2 down:damaged seq 311 v587) v7 > > -2> 2018-09-26 18:42:32.743 7f70f98b5700 1 -- mds:6800/3838577103 <== > > mon.2 mon:6789/0 29 mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 > > 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e > > 000 > > -1> 2018-09-26 18:42:32.743 7f70f98b5700
Re: [ceph-users] slow ops after cephfs snapshot removal
> On Nov 9, 2018, at 1:38 PM, Gregory Farnum wrote: > >> On Fri, Nov 9, 2018 at 2:24 AM Kenneth Waegeman >> wrote: >> Hi all, >> >> On Mimic 13.2.1, we are seeing blocked ops on cephfs after removing some >> snapshots: >> >> [root@osd001 ~]# ceph -s >>cluster: >> id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47 >> health: HEALTH_WARN >> 5 slow ops, oldest one blocked for 1162 sec, mon.mds03 has >> slow ops >> >>services: >> mon: 3 daemons, quorum mds01,mds02,mds03 >> mgr: mds02(active), standbys: mds03, mds01 >> mds: ceph_fs-2/2/2 up {0=mds03=up:active,1=mds01=up:active}, 1 >> up:standby >> osd: 544 osds: 544 up, 544 in >> >>io: >> client: 5.4 KiB/s wr, 0 op/s rd, 0 op/s wr >> >> [root@osd001 ~]# ceph health detail >> HEALTH_WARN 5 slow ops, oldest one blocked for 1327 sec, mon.mds03 has >> slow ops >> SLOW_OPS 5 slow ops, oldest one blocked for 1327 sec, mon.mds03 has slow ops >> >> [root@osd001 ~]# ceph -v >> ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic >> (stable) >> >> Is this a known issue? > > It's not exactly a known issue, but from the output and story you've got here > it looks like the OSDs are deleting the snapshot data too fast and the MDS > isn't getting quick enough replies? Or maybe you have an overlarge CephFS > directory which is taking a long time to clean up somehow; you should get the > MDS ops and the MDS' objecter ops in flight and see what specifically is > taking so long. > -Greg We had a similar issue on ceph 10.2 and RBD images. It was fixed by slowing down snapshot removal by adding this to the ceph.conf. [osd] osd snap trim sleep = 0.6 > >> >> Cheers, >> >> Kenneth >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Resolving Large omap objects in RGW index pool
Hi Tom, I used a slightly modified version of your script to generate a comparative list to mine (echoing out the bucket name, id and actual_id), which has returned substantially more indexes than mine, including a number that don't show any indication of resharding having been run, or versioning being enabled, including some with only minor different bucket_ids: 5813 buckets_with_multiple_reindexes2.txt (my script) 7999 buckets_with_multiple_reindexes3.txt (modified Tomasz script) For example a bucket has 2 entries: default.23404.6 default.23407.8 running "radosgw-admin bucket stats" against this bucket shows the current id as default.23407.9 None of the indexes (including the active one) shows multiple shards, or any resharding activities. Using the command: rados -p ,rgw,buckets.index listomapvals .dir.${id} Shows the other (lower) index ids as being empty, and the current one containing the index data. I'm wondering if it is possible some of these are remnants from upgrades (this cluster started as giant and has been upgraded through the LTS releases to Luminous)? Using radosgw-admin metadata get bucket.instance on my sample bucket shows different "ver" information between them all: old: "ver": { "tag": "__17wYsZGbXIhRKtx3goicMV", "ver": 1 }, "mtime": "2014-03-24 15:45:03.00Z" "ver": { "tag": "_x5RWprsckrL3Bj8h7Mbwklt", "ver": 1 }, "mtime": "2014-03-24 15:43:31.00Z" active: "ver": { "tag": "_6sTOABOHCGTSZ-EEIZ29VSN", "ver": 4 }, "mtime": "2017-08-10 15:06:38.940464Z", This obviously still leaves me with the original issue noticed, which is multiple instances of buckets that seem to have been repeatedly resharded to the same number of shards as the currently active index. From having a search around the tracker it seems like this may be worth following - "Aborted dynamic resharding should clean up created bucket index objs" : https://tracker.ceph.com/issues/35953 Again, any other suggestions or ideas are greatly welcomed on this :) Chris On Wed, 17 Oct 2018 at 12:29 Tomasz Płaza wrote: > Hi, > > I have a similar issue, and created a simple bash file to delete old > indexes (it is PoC and have not been tested on production): > > for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort` > do > actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'` > for instance in `radosgw-admin metadata list bucket.instance | jq -r > '.[]' | grep ${bucket}: | cut -d ':' -f 2` > do > if [ "$actual_id" != "$instance" ] > then > radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance} > radosgw-admin metadata rm bucket.instance:${bucket}:${instance} > fi > done > done > > I find it more readable than mentioned one liner. Any sugestions on this > topic are greatly appreciated. > Tom > > Hi, > > Having spent some time on the below issue, here are the steps I took to > resolve the "Large omap objects" warning. Hopefully this will help others > who find themselves in this situation. > > I got the object ID and OSD ID implicated from the ceph cluster logfile on > the mon. I then proceeded to the implicated host containing the OSD, and > extracted the implicated PG by running the following, and looking at which > PG had started and completed a deep-scrub around the warning being logged: > > grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large > omap|deep-scrub)' > > If the bucket had not been sharded sufficiently (IE the cluster log showed > a "Key Count" or "Size" over the thresholds), I ran through the manual > sharding procedure (shown here: > https://tracker.ceph.com/issues/24457#note-5) > > Once this was successfully sharded, or if the bucket was previously > sufficiently sharded by Ceph prior to disabling the functionality I was > able to use the following command (seemingly undocumented for Luminous > http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands): > > radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id} > > I then issued a ceph pg deep-scrub against the PG that had contained the > Large omap object. > > Once I had completed this procedure, my Large omap object warnings went > away and the cluster returned to HEALTH_OK. > > However our radosgw bucket indexes pool now seems to be using > substantially more space than previously. Having looked initially at this > bug, and in particular the first comment: > > http://tracker.ceph.com/issues/34307#note-1
Re: [ceph-users] Resolving Large omap objects in RGW index pool
Hi, Having spent some time on the below issue, here are the steps I took to resolve the "Large omap objects" warning. Hopefully this will help others who find themselves in this situation. I got the object ID and OSD ID implicated from the ceph cluster logfile on the mon. I then proceeded to the implicated host containing the OSD, and extracted the implicated PG by running the following, and looking at which PG had started and completed a deep-scrub around the warning being logged: grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large omap|deep-scrub)' If the bucket had not been sharded sufficiently (IE the cluster log showed a "Key Count" or "Size" over the thresholds), I ran through the manual sharding procedure (shown here: https://tracker.ceph.com/issues/24457#note-5 ) Once this was successfully sharded, or if the bucket was previously sufficiently sharded by Ceph prior to disabling the functionality I was able to use the following command (seemingly undocumented for Luminous http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands): radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id} I then issued a ceph pg deep-scrub against the PG that had contained the Large omap object. Once I had completed this procedure, my Large omap object warnings went away and the cluster returned to HEALTH_OK. However our radosgw bucket indexes pool now seems to be using substantially more space than previously. Having looked initially at this bug, and in particular the first comment: http://tracker.ceph.com/issues/34307#note-1 I was able to extract a number of bucket indexes that had apparently been resharded, and removed the legacy index using the radosgw-admin bi purge --bucket ${bucket} ${marker}. I am still able to perform a radosgw-admin metadata get bucket.instance:${bucket}:${marker} successfully, however now when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is returned. Even after this, we were still seeing extremely high disk usage of our OSDs containing the bucket indexes (we have a dedicated pool for this). I then modified the one liner referenced in the previous link as follows: grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F ":" '{print $2}' | tr -d '",' | while read -r bucket; do read -r id; read -r marker; [ "$id" == "$marker" ] && true || NEWID=`radosgw-admin --id rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${marker} | python -c 'import sys, json; print json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`; while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ] && [ ${NEWID} != ${bucket} ] ; then echo "$bucket $NEWID"; fi; NEWID=`radosgw-admin --id rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${NEWID} | python -c 'import sys, json; print json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`; done; done > buckets_with_multiple_reindexes2.txt This loops through the buckets that have a different marker/bucket_id, and looks to see if a new_bucket_instance_id is there, and if so will loop through until there is no longer a "new_bucket_instance_id". After letting this complete, this suggests that I have over 5000 indexes for 74 buckets, some of these buckets have > 100 indexes apparently. :~# awk '{print $1}' buckets_with_multiple_reindexes2.txt | uniq | wc -l 74 ~# wc -l buckets_with_multiple_reindexes2.txt 5813 buckets_with_multiple_reindexes2.txt This is running a single realm, multiple zone configuration, and no multi site sync, but the closest I can find to this issue is this bug https://tracker.ceph.com/issues/24603 Should I be OK to loop through these indexes and remove any with a reshard_status of 2, a new_bucket_instance_id that does not match the bucket_instance_id returned by the command: radosgw-admin bucket stats --bucket ${bucket} I'd ideally like to get to a point where I can turn dynamic sharding back on safely for this cluster. Thanks for any assistance, let me know if there's any more information I should provide Chris On Thu, 4 Oct 2018 at 18:22 Chris Sarginson wrote: > Hi, > > Thanks for the response - I am still unsure as to what will happen to the > "marker" reference in the bucket metadata, as this is the object that is > being detected as Large. Will the bucket generate a new "marker" reference > in the bucket metadata? > > I've been reading this page to try and get a better understanding of this > http://docs.ceph.com/docs/luminous/radosgw/layout/ > > However I'm no clearer on this (and what the "marker" is used for), or why > there are multiple separate "bucket_id" values (with different mtime > stamps) that all show as having the same number
Re: [ceph-users] Resolving Large omap objects in RGW index pool
Hi, Thanks for the response - I am still unsure as to what will happen to the "marker" reference in the bucket metadata, as this is the object that is being detected as Large. Will the bucket generate a new "marker" reference in the bucket metadata? I've been reading this page to try and get a better understanding of this http://docs.ceph.com/docs/luminous/radosgw/layout/ However I'm no clearer on this (and what the "marker" is used for), or why there are multiple separate "bucket_id" values (with different mtime stamps) that all show as having the same number of shards. If I were to remove the old bucket would I just be looking to execute rados - p .rgw.buckets.index rm .dir.default.5689810.107 Is the differing marker/bucket_id in the other buckets that was found also an indicator? As I say, there's a good number of these, here's some additional examples, though these aren't necessarily reporting as large omap objects: "BUCKET1", "default.281853840.479", "default.105206134.5", "BUCKET2", "default.364663174.1", "default.349712129.3674", Checking these other buckets, they are exhibiting the same sort of symptoms as the first (multiple instances of radosgw-admin metadata get showing what seem to be multiple resharding processes being run, with different mtimes recorded). Thanks Chris On Thu, 4 Oct 2018 at 16:21 Konstantin Shalygin wrote: > Hi, > > Ceph version: Luminous 12.2.7 > > Following upgrading to Luminous from Jewel we have been stuck with a > cluster in HEALTH_WARN state that is complaining about large omap objects. > These all seem to be located in our .rgw.buckets.index pool. We've > disabled auto resharding on bucket indexes due to seeming looping issues > after our upgrade. We've reduced the number reported of reported large > omap objects by initially increasing the following value: > > ~# ceph daemon mon.ceph-mon-1 config get > osd_deep_scrub_large_omap_object_value_sum_threshold > { > "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648 > <(214)%20748-3648>" > } > > However we're still getting a warning about a single large OMAP object, > however I don't believe this is related to an unsharded index - here's the > log entry: > > 2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 : > cluster [WRN] Large omap object found. Object: > 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size > (bytes): 4458647149 <(445)%20864-7149> > > The object in the logs is the "marker" object, rather than the bucket_id - > I've put some details regarding the bucket here: > https://pastebin.com/hW53kTxL > > The bucket limit check shows that the index is sharded, so I think this > might be related to versioning, although I was unable to get confirmation > that the bucket in question has versioning enabled through the aws > cli(snipped debug output below) > > 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response > headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137', > 'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default', > 'content-type': 'application/xml'} > 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response > body: > xmlns="http://s3.amazonaws.com/doc/2006-03-01/;> > > After dumping the contents of large omap object mentioned above into a file > it does seem to be a simple listing of the bucket contents, potentially an > old index: > > ~# wc -l omap_keys > 17467251 omap_keys > > This is approximately 5 million below the currently reported number of > objects in the bucket. > > When running the commands listed > here:http://tracker.ceph.com/issues/34307#note-1 > > The problematic bucket is listed in the output (along with 72 other > buckets): > "CLIENTBUCKET", "default.294495648.690", "default.5689810.107" > > As this tests for bucket_id and marker fields not matching to print out the > information, is the implication here that both of these should match in > order to fully migrate to the new sharded index? > > I was able to do a "metadata get" using what appears to be the old index > object ID, which seems to support this (there's a "new_bucket_instance_id" > field, containing a newer "bucket_id" and reshard_status is 2, which seems > to suggest it has completed). > > I am able to take the "new_bucket_instance_id" and get additional metadata > about the bucket, each time I do this I get a slightly newer > "new_bucket_instance_id", until it stops suggesting updated indexes. > > It's probably worth po
[ceph-users] Resolving Large omap objects in RGW index pool
Hi, Ceph version: Luminous 12.2.7 Following upgrading to Luminous from Jewel we have been stuck with a cluster in HEALTH_WARN state that is complaining about large omap objects. These all seem to be located in our .rgw.buckets.index pool. We've disabled auto resharding on bucket indexes due to seeming looping issues after our upgrade. We've reduced the number reported of reported large omap objects by initially increasing the following value: ~# ceph daemon mon.ceph-mon-1 config get osd_deep_scrub_large_omap_object_value_sum_threshold { "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648" } However we're still getting a warning about a single large OMAP object, however I don't believe this is related to an unsharded index - here's the log entry: 2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 : cluster [WRN] Large omap object found. Object: 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size (bytes): 4458647149 The object in the logs is the "marker" object, rather than the bucket_id - I've put some details regarding the bucket here: https://pastebin.com/hW53kTxL The bucket limit check shows that the index is sharded, so I think this might be related to versioning, although I was unable to get confirmation that the bucket in question has versioning enabled through the aws cli(snipped debug output below) 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137', 'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default', 'content-type': 'application/xml'} 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response body: http://s3.amazonaws.com/doc/2006-03-01/;> After dumping the contents of large omap object mentioned above into a file it does seem to be a simple listing of the bucket contents, potentially an old index: ~# wc -l omap_keys 17467251 omap_keys This is approximately 5 million below the currently reported number of objects in the bucket. When running the commands listed here: http://tracker.ceph.com/issues/34307#note-1 The problematic bucket is listed in the output (along with 72 other buckets): "CLIENTBUCKET", "default.294495648.690", "default.5689810.107" As this tests for bucket_id and marker fields not matching to print out the information, is the implication here that both of these should match in order to fully migrate to the new sharded index? I was able to do a "metadata get" using what appears to be the old index object ID, which seems to support this (there's a "new_bucket_instance_id" field, containing a newer "bucket_id" and reshard_status is 2, which seems to suggest it has completed). I am able to take the "new_bucket_instance_id" and get additional metadata about the bucket, each time I do this I get a slightly newer "new_bucket_instance_id", until it stops suggesting updated indexes. It's probably worth pointing out that when going through this process the final "bucket_id" doesn't match the one that I currently get when running 'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also suggests that no further resharding has been done as "reshard_status" = 0 and "new_bucket_instance_id" is blank. The output is available to view here: https://pastebin.com/g1TJfKLU It would be useful if anyone can offer some clarification on how to proceed from this situation, identifying and removing any old/stale indexes from the index pool (if that is the case), as I've not been able to spot anything in the archives. If there's any further information that is needed for additional context please let me know. Thanks Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intermittent slow/blocked requests on one node
Hi ceph-users, A few weeks ago, I had an OSD node -- ceph02 -- lock up hard with no indication why. I reset the system and everything came back OK, except that I now get intermittent warnings about slow/blocked requests from OSDs on the other nodes, waiting for a "subop" to complete on one of ceph02's OSDs. Each of these blocked requests will persist for a few (5-8?) minutes, then complete. (I see this using the admin socket to "dump_ops_in_flight" and "dump_historic_slow_ops".) I have tried several things to fix the issue, including rebuilding ceph02 completely! Wiping and reinstalling the OS, purging and re-creating OSDs. All disks reporting "OK" for SMART health status. The only effective intervention has been to mark all of ceph02's OSDs as "out". At this point I strongly suspect a hardware/firmware issue. Two questions for you folks while I dig into that: 1. Any more diagnostics that I should try to troubleshoot the delayed subops in Ceph? Perhaps identify what is causing the delay? 2. When an OSD is complaining about a slow/blocked request (waiting for sub ops), do RBD clients actually notice this, or does it appear to the client that the write has completed? Thank you! Information about my cluster and example warning messages follow. Chris Martin About my cluster: Luminous (12.2.4), 5 nodes, each with 12 OSDs (one rotary HDD per OSD), and a shared SSD in each node with 24 partitions for all the RocksDB databases and WALs. Systems are Supermicro 6028R-E1CR12T with RAID controller (LSI SAS 3108) set to JBOD mode. Deployed with ceph-ansible and using Bluestore. Bonded 10 gbps links throughout (20 gbps each for for client network and cluster network). ``` HEALTH_WARN 2 slow requests are blocked > 32 sec REQUEST_SLOW 2 slow requests are blocked > 32 sec 2 ops are blocked > 262.144 sec osd.2 has blocked requests > 262.144 sec ``` ``` { "description": "osd_op(client.84174831.0:45611220 10.1e0 10:07b8635b:::rbd_data.d091f474b0dc51.6084:head [write 716800~4096] snapc 3c3=[3c3] ondisk+write+known_if_redirected e7305)", "initiated_at": "2018-08-10 14:21:20.507929", "age": 317.226205, "duration": 294.342909, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.84174831", "client_addr": "10.140.120.206:0/2228066036", "tid": 45611220 }, "events": [ { "time": "2018-08-10 14:21:20.507929", "event": "initiated" }, { "time": "2018-08-10 14:21:20.508035", "event": "queued_for_pg" }, { "time": "2018-08-10 14:21:20.508102", "event": "reached_pg" }, { "time": "2018-08-10 14:21:20.508192", "event": "started" }, { "time": "2018-08-10 14:21:20.508331", "event": "waiting for subops from 12,21,60" }, { "time": "2018-08-10 14:21:20.509890", "event": "op_commit" }, { "time": "2018-08-10 14:21:20.509895", "event": "op_applied" }, { "time": "2018-08-10 14:21:20.510475", "event": "sub_op_commit_rec from 12" }, { "time": "2018-08-10 14:21:20.510526", "event": "sub_op_commit_rec from 21" }, { "time": "2018-08-10 14:26:14.850653", "event": "sub_op_commit_rec from 60" }, { "time": "2018-08-10 14:26:14.850728", "event": "commit_sent" }, { "time": "2018-08-10 14:26:14.850838", "event": "done" } ] } } ``` ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph plugin balancer error
I have tried to modify /usr/lib64/ceph/mgr/balancer/module.py replace iteritems () with items(), but I still got following error g1:/usr/lib64/ceph/mgr/balancer # ceph balancer status Error EINVAL: Traceback (most recent call last): File "/usr/lib64/ceph/mgr/balancer/module.py", line 297, in handle_command return (0, json.dumps(s, indent=4), '') File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib64/python3.6/json/encoder.py", line 201, in encode chunks = list(chunks) File "/usr/lib64/python3.6/json/encoder.py", line 430, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict yield from chunks File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode o = _default(o) File "/usr/lib64/python3.6/json/encoder.py", line 180, in default o.__class__.__name__) TypeError: Object of type 'dict_keys' is not JSON serializable it seems to me that ceph mgr is complie/written for python 3.6 but balancer plugin is written for python 2.7... this might be related https://github.com/ceph/ceph/pull/21446 this might be opensuse building ceph package issue chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph plugin balancer error
weird thing is that, g1:~ # locate /bin/python /usr/bin/python /usr/bin/python2 /usr/bin/python2.7 /usr/bin/python3 /usr/bin/python3.6 /usr/bin/python3.6m g1:~ # ls /usr/bin/python* -al lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python -> python2.7 lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python2 -> python2.7 -rwxr-xr-x 1 root root 6304 May 13 07:41 /usr/bin/python2.7 lrwxrwxrwx 1 root root 9 May 13 08:39 /usr/bin/python3 -> python3.6 -rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6 -rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6m my default python env is 2.7... so under dict object should have iteritems method.... Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph plugin balancer error
Hi, I am running test on ceph mimic 13.0.2.1874+ge31585919b-lp150.1.2 using openSUSE-Leap-15.0 when I ran "ceph balancer status", it errors out. g1:/var/log/ceph # ceph balancer status Error EIO: Module 'balancer' has experienced an error and cannot handle commands: 'dict' object has no attribute 'iteritems' what config need to be done in order to get it work? Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Frequent slow requests
On 2018-06-19 12:17 pm, Frank de Bot (lists) wrote: Frank (lists) wrote: Hi, On a small cluster (3 nodes) I frequently have slow requests. When dumping the inflight ops from the hanging OSD, it seems it doesn't get a 'response' for one of the subops. The events always look like: I've done some further testing, all slow request are blocked by OSD's on a single host. How can I debug this problem further? I can't find any errors or other strange things on the host with osd's that are seemingly not sending a response to an op. I don't know if you have already checked, but we usually find a bad drive after running 'smartctl - t long' or the OSD node is starting to use the swap space because of memory usage. Regards, Frank de Bot ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journal flushed on osd clean shutdown?
Excellent news - tks! On Wed, Jun 13, 2018 at 11:50:15AM +0200, Wido den Hollander wrote: On 06/13/2018 11:39 AM, Chris Dunlop wrote: Hi, Is the osd journal flushed completely on a clean shutdown? In this case, with Jewel, and FileStore osds, and a "clean shutdown" being: It is, a Jewel OSD will flush it's journal on a clean shutdown. The flush-journal is no longer needed. Wido ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Journal flushed on osd clean shutdown?
Hi, Is the osd journal flushed completely on a clean shutdown? In this case, with Jewel, and FileStore osds, and a "clean shutdown" being: systemctl stop ceph-osd@${osd} I understand it's documented practice to issue a --flush-journal after shutting down down an osd if you're intending to do anything with the journal, but herein lies the sorry tale... I've accidentally issued a 'blkdiscard' on a whole SSD device containing the journals for multiple osds, rather than for a specific partition as intended. The affected osds themselves continue to work along happily. I assume the journals are write-only during normal operation, in which case it's understandable the osds are oblivious to the underlying zeroing of the journals (and partition table!). The GPT partition table and the individual journal partition types and guids etc. have been recreated, so, in theory at least, a clean shutdown and restart should be fine *if* the clean shutdown means there's nothing in the journal to replay on startup. I've experimented with one of the affected osds (used for "scatch" purposes, so safe to play with), shutting it down and starting it up again, and it seems to be happy - somewhat to my surprise. I thought I'd have to at least use --mkjournal before it would start up again, to reinstate whatever header/signature is used in the journals. There are other affected osds which hold live data, so I want to be more careful there. One option is to simply kill the affected osds and recreate them, and allow the data redundancy to take care of things. However I'm wondering if things should theoretically be ok if I carefully shutdown and restart each of the remaining osds in turn, or am I taking some kind of data corruption risk? Tks, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi Caspar, Sean and I replaced the problematic DC S4600 disks (after all but one had failed) in our cluster with Samsung SM863a disks. There was an NDA for new Intel firmware (as mentioned earlier in the thread by David) but given the problems we were experiencing we moved all Intel disks to a single failure domain but were unable to get to deploy additional firmware to test. The Samsung should fit your requirements. http://www.samsung.com/semiconductor/minisite/ssd/product/enterprise/sm863a/ Regards Chris On Thu, 22 Feb 2018 at 12:50 Caspar Smit <caspars...@supernas.eu> wrote: > Hi Sean and David, > > Do you have any follow ups / news on the Intel DC S4600 case? We are > looking into this drives to use as DB/WAL devices for a new to be build > cluster. > > Did Intel provide anything (like new firmware) which should fix the issues > you were having or are these drives still unreliable? > > At the moment we are also looking into the Intel DC S3610 as an > alternative which are a step back in performance but should be very > reliable. > > Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as > DB/WAL? (Aiming for ~3 DWPD should be sufficient for DB/WAL?) > > Kind regards, > Caspar > > 2018-01-12 15:45 GMT+01:00 Sean Redmond <sean.redmo...@gmail.com>: > >> Hi David, >> >> To follow up on this I had a 4th drive fail (out of 12) and have opted to >> order the below disks as a replacement, I have an ongoing case with Intel >> via the supplier - Will report back anything useful - But I am going to >> avoid the Intel s4600 2TB SSD's for the moment. >> >> 1.92TB Samsung SM863a 2.5" Enterprise SSD, SATA3 6Gb/s, 2-bit MLC V-NAND >> >> Regards >> Sean Redmond >> >> On Wed, Jan 10, 2018 at 11:08 PM, Sean Redmond <sean.redmo...@gmail.com> >> wrote: >> >>> Hi David, >>> >>> Thanks for your email, they are connected inside Dell R730XD (2.5 inch >>> 24 disk model) in None RAID mode via a perc RAID card. >>> >>> The version of ceph is Jewel with kernel 4.13.X and ubuntu 16.04. >>> >>> Thanks for your feedback on the HGST disks. >>> >>> Thanks >>> >>> On Wed, Jan 10, 2018 at 10:55 PM, David Herselman <d...@syrex.co> wrote: >>> >>>> Hi Sean, >>>> >>>> >>>> >>>> No, Intel’s feedback has been… Pathetic… I have yet to receive anything >>>> more than a request to ‘sign’ a non-disclosure agreement, to obtain beta >>>> firmware. No official answer as to whether or not one can logically unlock >>>> the drives, no answer to my question whether or not Intel publish serial >>>> numbers anywhere pertaining to recalled batches and no information >>>> pertaining to whether or not firmware updates would address any known >>>> issues. >>>> >>>> >>>> >>>> This with us being an accredited Intel Gold partner… >>>> >>>> >>>> >>>> >>>> >>>> We’ve returned the lot and ended up with 9/12 of the drives failing in >>>> the same manner. The replaced drives, which had different serial number >>>> ranges, also failed. Very frustrating is that the drives fail in a way that >>>> result in unbootable servers, unless one adds ‘rootdelay=240’ to the >>>> kernel. >>>> >>>> >>>> >>>> >>>> >>>> I would be interested to know what platform your drives were in and >>>> whether or not they were connected to a RAID module/card. >>>> >>>> >>>> >>>> PS: After much searching we’ve decided to order the NVMe conversion kit >>>> and have ordered HGST UltraStar SN200 2.5 inch SFF drives with a 3 DWPD >>>> rating. >>>> >>>> >>>> >>>> >>>> >>>> Regards >>>> >>>> David Herselman >>>> >>>> >>>> >>>> *From:* Sean Redmond [mailto:sean.redmo...@gmail.com] >>>> *Sent:* Thursday, 11 January 2018 12:45 AM >>>> *To:* David Herselman <d...@syrex.co> >>>> *Cc:* Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com >>>> >>>> *Subject:* Re: [ceph-users] Many concurrent drive failures - How do I >>>> activate pgs? >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> I have a case where 3 out to 12 of
[ceph-users] ceph mons de-synced from rest of cluster?
All, Recently doubled the number of OSDs in our cluster, and towards the end of the rebalancing, I noticed that recovery IO fell to nothing and that the ceph mons eventually looked like this when I ran ceph -s cluster: id: 6a65c3d0-b84e-4c89-bbf7-a38a1966d780 health: HEALTH_WARN 34922/4329975 objects misplaced (0.807%) Reduced data availability: 542 pgs inactive, 49 pgs peering, 13502 pgs stale Degraded data redundancy: 248778/4329975 objects degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs undersized services: mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2 mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2 osd: 376 osds: 376 up, 376 in data: pools: 9 pools, 13952 pgs objects: 1409k objects, 5992 GB usage: 31528 GB used, 1673 TB / 1704 TB avail pgs: 3.225% pgs unknown 0.659% pgs not active 248778/4329975 objects degraded (5.745%) 34922/4329975 objects misplaced (0.807%) 6141 stale+active+clean 4537 stale+active+remapped+backfilling 1575 stale+active+undersized+degraded 489 stale+active+clean+remapped 450 unknown 396 stale+active+recovery_wait+degraded 216 stale+active+undersized+degraded+remapped+backfilling 40 stale+peering 30 stale+activating 24 stale+active+undersized+remapped 22 stale+active+recovering+degraded 13 stale+activating+degraded 9stale+remapped+peering 4stale+active+remapped+backfill_wait 3stale+active+clean+scrubbing+deep 2 stale+active+undersized+degraded+remapped+backfill_wait 1stale+active+remapped The problem is, everything works fine. If I run ceph health detail and do a pg query against one of the 'degraded' placement groups, it reports back as active-clean. All clients in the cluster can write and read at normal speeds, but not IO information is ever reported in ceph -s. From what I can see, everything in the cluster is working properly except the actual reporting on the status of the cluster. Has anyone seen this before/know how to sync the mons up to what the OSDs are actually reporting? I see no connectivity errors in the logs of the mons or the osds. Thanks, --- v/r Chris Apsey bitskr...@bitskrieg.net https://www.bitskrieg.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Increase recovery / backfilling speed (with many small objects)
You probably want to consider increasing osd max backfills You should be able to inject this online http://docs.ceph.com/docs/luminous/rados/configuration/osd-config-ref/ You might want to drop your osd recovery max active settings back down to around 2 or 3, although with it being SSD your performance will probably be fine. On Fri, 5 Jan 2018 at 20:13 Stefan Koomanwrote: > Hi, > > I know I'm not the only one with this question as I have see similar > questions on this list: > How to speed up recovery / backfilling? > > Current status: > > pgs: 155325434/800312109 objects degraded (19.408%) > 1395 active+clean > 440 active+undersized+degraded+remapped+backfill_wait > 21 active+undersized+degraded+remapped+backfilling > > io: > client: 180 kB/s rd, 5776 kB/s wr, 273 op/s rd, 440 op/s wr > recovery: 2990 kB/s, 109 keys/s, 114 objects/s > > What we did? Shutdown one DC. Fill cluster with loads of objects, turn > DC back on (size = 3, min_size=2). To test exactly this: recovery. > > I have been going trough all the recovery options (including legacy) but > I cannot get the recovery speed to increase: > > osd_recovery_op_priority 63 > osd_client_op_priority 3 > > ^^ yup, reversed those, to no avail > > osd_recovery_max_active 10' > > ^^ This helped for a short period of time, and then it went back to > "slow" mode > > osd_recovery_max_omap_entries_per_chunk 0 > osd_recovery_max_chunk 67108864 > > Haven't seen any change in recovery speed. > > osd_recovery_sleep_ssd": "0.00 > ^^ default for SSD > > The whole cluster is idle, ODSs have very low load. What can be the > reason for the slow recovery? Something is holding it back but I cannot > think of what. > > Ceph Luminous 12.2.2 (bluestore on lvm, all SSD) > > Thanks, > > Stefan > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 > <+31%20318%20648%20688> / i...@bit.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Switch to replica 3
On 2017-11-20 3:39 am, Matteo Dacrema wrote: Yes I mean the existing Cluster. SSDs are on a fully separate pool. Cluster is not busy during recovery and deep scrubs but I think it’s better to limit replication in some way when switching to replica 3. My question is to understand if I need to set some options parameters to limit the impact of the creation of new objects.I’m also concerned about disk filling up during recovery because of inefficient data balancing. You can try using osd_recovery_sleep to slow down the backfilling so it does not cause the client io to hang. ceph tell osd.* injectargs "--osd_recovery_sleep 0.1" Here osd tree ID WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY -10 19.69994 root ssd -11 5.06998 host ceph101 166 0.98999 osd.166 up 1.0 1.0 167 1.0 osd.167 up 1.0 1.0 168 1.0 osd.168 up 1.0 1.0 169 1.07999 osd.169 up 1.0 1.0 170 1.0 osd.170 up 1.0 1.0 -12 4.92998 host ceph102 171 0.98000 osd.171 up 1.0 1.0 172 0.92999 osd.172 up 1.0 1.0 173 0.98000 osd.173 up 1.0 1.0 174 1.0 osd.174 up 1.0 1.0 175 1.03999 osd.175 up 1.0 1.0 -13 4.69998 host ceph103 176 0.84999 osd.176 up 1.0 1.0 177 0.84999 osd.177 up 1.0 1.0 178 1.0 osd.178 up 1.0 1.0 179 1.0 osd.179 up 1.0 1.0 180 1.0 osd.180 up 1.0 1.0 -14 5.0 host ceph104 181 1.0 osd.181 up 1.0 1.0 182 1.0 osd.182 up 1.0 1.0 183 1.0 osd.183 up 1.0 1.0 184 1.0 osd.184 up 1.0 1.0 185 1.0 osd.185 up 1.0 1.0 -1 185.19835 root default -2 18.39980 host ceph001 63 0.7 osd.63up 1.0 1.0 64 0.7 osd.64up 1.0 1.0 65 0.7 osd.65up 1.0 1.0 146 0.7 osd.146 up 1.0 1.0 147 0.7 osd.147 up 1.0 1.0 148 0.90999 osd.148 up 1.0 1.0 149 0.7 osd.149 up 1.0 1.0 150 0.7 osd.150 up 1.0 1.0 151 0.7 osd.151 up 1.0 1.0 152 0.7 osd.152 up 1.0 1.0 153 0.7 osd.153 up 1.0 1.0 154 0.7 osd.154 up 1.0 1.0 155 0.8 osd.155 up 1.0 1.0 156 0.84999 osd.156 up 1.0 1.0 157 0.7 osd.157 up 1.0 1.0 158 0.7 osd.158 up 1.0 1.0 159 0.84999 osd.159 up 1.0 1.0 160 0.90999 osd.160 up 1.0 1.0 161 0.90999 osd.161 up 1.0 1.0 162 0.90999 osd.162 up 1.0 1.0 163 0.7 osd.163 up 1.0 1.0 164 0.90999 osd.164 up 1.0 1.0 165 0.64999 osd.165 up 1.0 1.0 -3 19.41982 host ceph002 23 0.7 osd.23up 1.0 1.0 24 0.7 osd.24up 1.0 1.0 25 0.90999 osd.25up 1.0 1.0 26 0.5 osd.26up 1.0 1.0 27 0.95000 osd.27up 1.0 1.0 28 0.64999 osd.28up 1.0 1.0 29 0.75000 osd.29up 1.0 1.0 30 0.8 osd.30up 1.0 1.0 31 0.90999 osd.31up 1.0 1.0 32 0.90999 osd.32up 1.0 1.0 33 0.8 osd.33up 1.0 1.0 34 0.90999 osd.34up 1.0 1.0 35 0.90999 osd.35up 1.0 1.0 36 0.84999 osd.36up 1.0 1.0 37 0.8 osd.37up 1.0 1.0 38 1.0 osd.38up 1.0 1.0 39 0.7 osd.39up 1.0 1.0 40 0.90999 osd.40up 1.0 1.0 41 0.84999 osd.41up 1.0 1.0 42
Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time
I'll document the resolution here for anyone else who experiences similar issues. We have determined the root cause of the long boot time was a combination of factors having to do with ZFS version and tuning, in combination with how long filenames are handled. ## 1 ## Insufficient ARC cache size. Dramatically increasing the arc_max and arc_meta_limit allowed better performance once the cache had time to populate. Previously, each call to getxattr took about 8ms (0.008 sec). Multiply that by millions of getxattr calls during OSD daemon startup, this was taking hours. This only became apparent when we upgraded to Jewel. Hammer does not appear to parse all of the extended attributes during startup; This appeared to be introduced in Jewel as part of the sortbitwise algorithm. Increasing the arc_max and arc_meta_limit allowed more of the meta data to be cached in memory. This reduced getxattr call duration to between 10 to 100 microseconds (0.0001 to 0.1 sec). An average of around 400x faster. ## 2 ## ZFS version 0.6.5.11 and inability to store large amounts of meta info in the inode/dnode. My understanding is that the ability to use a larger dnode size to store meta was not introduced until ZFS version 0.7.x. In version 0.6.5.11 this was causing large quantities of meta data to be stored in inefficient spill blocks, which were taking longer to access since they were not cached due to (previously) undersized ARC settings. ## Summary ## Increasing ARC cache settings improved performance, but performance will still be a concern if the ARC is purged/flushed, such during system reboot, until the cache rebuilds itself. Upgrading to ZFS version 0.7.x is one potential upgrade path to utilize larger dnode size. Another upgrade path is to switch to XFS, which is the recommended filesystem for CEPH. XFS does not appear to require any kind of meta cache due to different handling of meta info in the inode. -- Chris From: Willem Jan Withagen <w...@digiware.nl> Sent: Wednesday, November 1, 2017 4:51:52 PM To: Chris Jones; Gregory Farnum Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time On 01/11/2017 18:04, Chris Jones wrote: > Greg, > > Thanks so much for the reply! > > We are not clear on why ZFS is behaving poorly under some circumstances > on getxattr system calls, but that appears to be the case. > > Since the last update we have discovered that back-to-back booting of > the OSD yields very fast boot time, and very fast getxattr system calls. > > A longer period between boots (or perhaps related to influx of new data) > correlates to longer boot duration. This is due to slow getxattr calls > of certain types. > > We suspect this may be a caching or fragmentation issue with ZFS for > xattrs. Use of longer filenames appear to make this worse. As far as I understand is a lot of this data stored in the metadata. Which is (or can be) a different set in the (l2)arc cache. So are you talking about a OSD reboot, or a system reboot? Don't quite understand what you mean back-to-back... I have little experience with ZFS on Linux. So if behaviour there is different is hard for me to tell. IF you are rebooting the OSD, I can imagine that with certain sequences of rebooting pre-loads the meta-cache. Reboots further apart can have lead to a different working set in the ZFS-caches. And then all data needs to be refetched, instead of getting it from l2arc. And note that in newer ZFS versions the in memory ARC even can be compressed, leading to an even higher hit rate. For example on my development server with 32Gb memory: ARC: 20G Total, 1905M MFU, 16G MRU, 70K Anon, 557M Header, 1709M Other 17G Compressed, 42G Uncompressed, 2.49:1 Ratio --WjW > > We experimented on some OSDs with swapping over to XFS as the > filesystem, and the problem does not appear to be present on those OSDs. > > The two examples below are representative of a Long Boot (longer running > time and more data influx between osd rebooting) and a Short Boot where > we booted the same OSD back to back. > > Notice the drastic difference in time on the getxattr that yields the > ENODATA return. Around 0.009 secs for "long boot" and "0.0002" secs when > the same OSD is booted back to back. Long boot time is approx 40x to 50x > longer. Multiplied by thousands of getxattr calls, this is/was our > source of longer boot time. > > We are considering a full switch to XFS, but would love to hear any ZFS > tuning tips that might be a short term workaround. > > We are using ZFS 6.5.11 prior to implementation of the ability to use > large dnodes which would allow the use o
Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time
Greg, Thanks so much for the reply! We are not clear on why ZFS is behaving poorly under some circumstances on getxattr system calls, but that appears to be the case. Since the last update we have discovered that back-to-back booting of the OSD yields very fast boot time, and very fast getxattr system calls. A longer period between boots (or perhaps related to influx of new data) correlates to longer boot duration. This is due to slow getxattr calls of certain types. We suspect this may be a caching or fragmentation issue with ZFS for xattrs. Use of longer filenames appear to make this worse. We experimented on some OSDs with swapping over to XFS as the filesystem, and the problem does not appear to be present on those OSDs. The two examples below are representative of a Long Boot (longer running time and more data influx between osd rebooting) and a Short Boot where we booted the same OSD back to back. Notice the drastic difference in time on the getxattr that yields the ENODATA return. Around 0.009 secs for "long boot" and "0.0002" secs when the same OSD is booted back to back. Long boot time is approx 40x to 50x longer. Multiplied by thousands of getxattr calls, this is/was our source of longer boot time. We are considering a full switch to XFS, but would love to hear any ZFS tuning tips that might be a short term workaround. We are using ZFS 6.5.11 prior to implementation of the ability to use large dnodes which would allow the use of dnodesize=auto. #Long Boot <0.44>[pid 3413902] 13:08:00.884238 getxattr("/osd/9/current/20.86bs3_head/default.34597.7\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebana_1d9e1e82d623f49c994f_0_long", "user.cephos.lfn3", "default.34597.7\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-92d9df789f9aaf007c50c50bb66e70af__head_0177C86B__14__3", 1024) = 616 <0.44> <0.008875>[pid 3413902] 13:08:00.884476 getxattr("/osd/9/current/20.86bs3_head/default.34597.57\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_79a7acf2d32f4302a1a4_0_long", "user.cephos.lfn3-alt", 0x7f849bf95180, 1024) = -1 ENODATA (No data available) <0.008875> #Short Boot <0.15> [pid 3452111] 13:37:18.604442 getxattr("/osd/9/current/20.15c2s3_head/default.34597.22\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_efb8ca13c57689d76797_0_long", "user.cephos.lfn3", "default.34597.22\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-b519f8607a3d9de0f815d18b6905b27d__head_9726F5C2__14__3", 1024) = 617 <0.15> <0.18> [pid 3452111] 13:37:18.604546 getxattr("/osd/9/current/20.15c2s3_head/default.34597.66\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_0e6d86f58e03d0f6de04_0_long", "user.cephos.lfn3-alt", 0x7fd4e8017680, 1024) = -1 ENODATA (No data available) <0.18> ------ Christopher J. Jones From: Gregory Farnum <gfar...@redhat.com> Sent: Monday, October 30, 2017 6:20:15 PM To: Chris Jones Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time On Thu, Oct 26, 2017 at 11:33 AM Chris Jones <chris.jo...@ctl.io<mailto:chris.jo...@ctl.io&
Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time
The long running functionality appears to be related to clear_temp_objects(); from OSD.cc called from init(). What is this functionality intended to do? Is it required to be run on every OSD startup? Any configuration settings that would help speed this up? -- Christopher J. Jones From: Chris Jones Sent: Wednesday, October 25, 2017 12:52:13 PM To: ceph-users@lists.ceph.com Subject: Hammer to Jewel Upgrade - Extreme OSD Boot Time After upgrading from CEPH Hammer to Jewel, we are experiencing extremely long osd boot duration. This long boot time is a huge concern for us and are looking for insight into how we can speed up the boot time. In Hammer, OSD boot time was approx 3 minutes. After upgrading to Jewel, boot time is between 1 and 3 HOURS. This was not surprising during initial boot after the upgrade, however we are seeing this occur each time an OSD process is restarted. This is using ZFS. We added the following configuration to ceph.conf as part of the upgrade to overcome some filesystem startup issues per the recommendations at the following url: https://github.com/zfsonlinux/zfs/issues/4913 Added ceph.conf configuration: filestore_max_inline_xattrs = 10 filestore_max_inline_xattr_size = 65536 filestore_max_xattr_value_size = 65536 Example OSD Log (note the long duration at the line containing "osd.191 119292 crush map has features 281819681652736, adjusting msgr requires for osds": 2017-10-24 18:01:18.410249 7f1333d08700 1 leveldb: Generated table #524178: 158056 keys, 1502244 bytes 2017-10-24 18:01:18.805235 7f1333d08700 1 leveldb: Generated table #524179: 266429 keys, 2129196 bytes 2017-10-24 18:01:19.254798 7f1333d08700 1 leveldb: Generated table #524180: 197068 keys, 2128820 bytes 2017-10-24 18:01:20.070109 7f1333d08700 1 leveldb: Generated table #524181: 192675 keys, 2129122 bytes 2017-10-24 18:01:20.947818 7f1333d08700 1 leveldb: Generated table #524182: 196806 keys, 2128945 bytes 2017-10-24 18:01:21.183475 7f1333d08700 1 leveldb: Generated table #524183: 63421 keys, 828081 bytes 2017-10-24 18:01:21.477197 7f1333d08700 1 leveldb: Generated table #524184: 173331 keys, 1348407 bytes 2017-10-24 18:01:21.477226 7f1333d08700 1 leveldb: Compacted 1@2 + 12@3 files => 19838392 bytes 2017-10-24 18:01:21.509952 7f1333d08700 1 leveldb: compacted to: files[ 0 1 66 551 788 0 0 ] 2017-10-24 18:01:21.512235 7f1333d08700 1 leveldb: Delete type=2 #523994 2017-10-24 18:01:23.142853 7f1349d93800 0 filestore(/osd/191) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-10-24 18:01:27.927823 7f1349d93800 0 cls/hello/cls_hello.cc:305: loading cls_hello 2017-10-24 18:01:27.933105 7f1349d93800 0 cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan 2017-10-24 18:01:27.960283 7f1349d93800 0 osd.191 119292 crush map has features 281544803745792, adjusting msgr requires for clients 2017-10-24 18:01:27.960309 7f1349d93800 0 osd.191 119292 crush map has features 281819681652736 was 8705, adjusting msgr requires for mons 2017-10-24 18:01:27.960316 7f1349d93800 0 osd.191 119292 crush map has features 281819681652736, adjusting msgr requires for osds 2017-10-24 23:28:09.694213 7f1349d93800 0 osd.191 119292 load_pgs 2017-10-24 23:28:14.757449 7f1333d08700 1 leveldb: Compacting 1@1 + 13@2 files 2017-10-24 23:28:15.002381 7f1333d08700 1 leveldb: Generated table #524185: 17970 keys, 2128900 bytes 2017-10-24 23:28:15.198899 7f1333d08700 1 leveldb: Generated table #524186: 22386 keys, 2128610 bytes 2017-10-24 23:28:15.337819 7f1333d08700 1 leveldb: Generated table #524187: 3890 keys, 371799 bytes 2017-10-24 23:28:15.693433 7f1333d08700 1 leveldb: Generated table #524188: 21984 keys, 2128947 bytes 2017-10-24 23:28:15.874955 7f1333d08700 1 leveldb: Generated table #524189: 9565 keys, 1207375 bytes 2017-10-24 23:28:16.253599 7f1333d08700 1 leveldb: Generated table #524190: 21999 keys, 2129625 bytes 2017-10-24 23:28:16.576250 7f1333d08700 1 leveldb: Generated table #524191: 21544 keys, 2128033 bytes Strace on an OSD process during startup reveals what appears to be parsing of objects and calling getxattr. The bulk of the time is spent on parsing the objects and performing the getxattr system calls... for example: (Full lines truncated intentionally for brevity). [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.13\...(ommitted) [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted) [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted) Cluster details: - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 drives per host, collocated journals) - Pre-upgrade: Hammer (ceph version 0.94.6) - Post-upgrade: Jewel (ceph version 10.2.9) - object storage use only - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs) - fail
[ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time
After upgrading from CEPH Hammer to Jewel, we are experiencing extremely long osd boot duration. This long boot time is a huge concern for us and are looking for insight into how we can speed up the boot time. In Hammer, OSD boot time was approx 3 minutes. After upgrading to Jewel, boot time is between 1 and 3 HOURS. This was not surprising during initial boot after the upgrade, however we are seeing this occur each time an OSD process is restarted. This is using ZFS. We added the following configuration to ceph.conf as part of the upgrade to overcome some filesystem startup issues per the recommendations at the following url: https://github.com/zfsonlinux/zfs/issues/4913 Added ceph.conf configuration: filestore_max_inline_xattrs = 10 filestore_max_inline_xattr_size = 65536 filestore_max_xattr_value_size = 65536 Example OSD Log (note the long duration at the line containing "osd.191 119292 crush map has features 281819681652736, adjusting msgr requires for osds": 2017-10-24 18:01:18.410249 7f1333d08700 1 leveldb: Generated table #524178: 158056 keys, 1502244 bytes 2017-10-24 18:01:18.805235 7f1333d08700 1 leveldb: Generated table #524179: 266429 keys, 2129196 bytes 2017-10-24 18:01:19.254798 7f1333d08700 1 leveldb: Generated table #524180: 197068 keys, 2128820 bytes 2017-10-24 18:01:20.070109 7f1333d08700 1 leveldb: Generated table #524181: 192675 keys, 2129122 bytes 2017-10-24 18:01:20.947818 7f1333d08700 1 leveldb: Generated table #524182: 196806 keys, 2128945 bytes 2017-10-24 18:01:21.183475 7f1333d08700 1 leveldb: Generated table #524183: 63421 keys, 828081 bytes 2017-10-24 18:01:21.477197 7f1333d08700 1 leveldb: Generated table #524184: 173331 keys, 1348407 bytes 2017-10-24 18:01:21.477226 7f1333d08700 1 leveldb: Compacted 1@2 + 12@3 files => 19838392 bytes 2017-10-24 18:01:21.509952 7f1333d08700 1 leveldb: compacted to: files[ 0 1 66 551 788 0 0 ] 2017-10-24 18:01:21.512235 7f1333d08700 1 leveldb: Delete type=2 #523994 2017-10-24 18:01:23.142853 7f1349d93800 0 filestore(/osd/191) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-10-24 18:01:27.927823 7f1349d93800 0 cls/hello/cls_hello.cc:305: loading cls_hello 2017-10-24 18:01:27.933105 7f1349d93800 0 cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan 2017-10-24 18:01:27.960283 7f1349d93800 0 osd.191 119292 crush map has features 281544803745792, adjusting msgr requires for clients 2017-10-24 18:01:27.960309 7f1349d93800 0 osd.191 119292 crush map has features 281819681652736 was 8705, adjusting msgr requires for mons 2017-10-24 18:01:27.960316 7f1349d93800 0 osd.191 119292 crush map has features 281819681652736, adjusting msgr requires for osds 2017-10-24 23:28:09.694213 7f1349d93800 0 osd.191 119292 load_pgs 2017-10-24 23:28:14.757449 7f1333d08700 1 leveldb: Compacting 1@1 + 13@2 files 2017-10-24 23:28:15.002381 7f1333d08700 1 leveldb: Generated table #524185: 17970 keys, 2128900 bytes 2017-10-24 23:28:15.198899 7f1333d08700 1 leveldb: Generated table #524186: 22386 keys, 2128610 bytes 2017-10-24 23:28:15.337819 7f1333d08700 1 leveldb: Generated table #524187: 3890 keys, 371799 bytes 2017-10-24 23:28:15.693433 7f1333d08700 1 leveldb: Generated table #524188: 21984 keys, 2128947 bytes 2017-10-24 23:28:15.874955 7f1333d08700 1 leveldb: Generated table #524189: 9565 keys, 1207375 bytes 2017-10-24 23:28:16.253599 7f1333d08700 1 leveldb: Generated table #524190: 21999 keys, 2129625 bytes 2017-10-24 23:28:16.576250 7f1333d08700 1 leveldb: Generated table #524191: 21544 keys, 2128033 bytes Strace on an OSD process during startup reveals what appears to be parsing of objects and calling getxattr. The bulk of the time is spent on parsing the objects and performing the getxattr system calls... for example: (Full lines truncated intentionally for brevity). [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.13\...(ommitted) [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted) [pid 3068964] getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted) Cluster details: - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 drives per host, collocated journals) - Pre-upgrade: Hammer (ceph version 0.94.6) - Post-upgrade: Jewel (ceph version 10.2.9) - object storage use only - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs) - failure domain of host - cluster is currently storing approx 500TB over 200 MObjects -- Christopher J. Jones ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] remove require_jewel_osds flag after upgrade to kraken
The flag is fine, it's just to ensure that OSDs from a release before Jewel can't be added to the cluster: See http://ceph.com/geen-categorie/v10-2-4-jewel-released/ under "Upgrading from hammer" On Thu, 13 Jul 2017 at 07:59 Jan Krcmarwrote: > hi, > > is it possible to remove the require_jewel_osds flag after upgrade to > kraken? > > $ ceph osd stat > osdmap e29021: 40 osds: 40 up, 40 in > flags sortbitwise,require_jewel_osds,require_kraken_osds > > it seems that ceph osd unset does not support require_jewel_osds > > $ ceph osd unset require_jewel_osds > Invalid command: require_jewel_osds not in > > full|pause|noup|nodown|noout|noin|nobackfill|norebalance|norecover|noscrub|nodeep-scrub|notieragent|sortbitwise > osd unset > full|pause|noup|nodown|noout|noin|nobackfill|norebalance|norecover|noscrub|nodeep-scrub|notieragent|sortbitwise > : unset > Error EINVAL: invalid command > > is there any way to remove it? > if not, is it ok to leave the flag there? > > thanks > fous > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] autoconfigured haproxy service?
Hi Sage, The automated tool Cepheus https://github.com/cepheus-io/cepheus does this with ceph-chef. It's based on json data for a given environment. It uses Chef and Ansible. If someone wanted to break out the haproxy (ADC) portion into a package then it has a good model for HAProxy they could look at. Originally created due to the need for our own software LB solution over a hardware LB. It also supports keep-alived and bird (BGP). Thanks On Tue, Jul 11, 2017 at 11:03 AM, Sage Weilwrote: > Hi all, > > Luminous features a new 'service map' that lets rgw's (and rgw nfs > gateways and iscsi gateways and rbd mirror daemons and ...) advertise > themselves to the cluster along with some metadata (like the addresses > they are binding to and the services the provide). > > It should be pretty straightforward to build a service that > auto-configures haproxy based on this information so that you can deploy > an rgw front-end that dynamically reconfigures itself when additional > rgw's are deployed or removed. haproxy has a facility to adjust its > backend configuration at runtime[1]. > > Anybody interested in tackling this? Setting up the load balancer in > front of rgw is one of the more annoying pieces of getting ceph up and > running in production and until now has been mostly treated as out of > scope. It would be awesome if there was an autoconfigured service that > did it out of the box (and had all the right haproxy options set). > > sage > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osdmap several thousand epochs behind latest
All, Had a fairly substantial network interruption that knocked out about ~270 osds: health HEALTH_ERR [...] 273/384 in osds are down noup,nodown,noout flag(s) set monmap e2: 3 mons at {cephmon-0=10.10.6.0:6789/0,cephmon-1=10.10.6.1:6789/0,cephmon-2=10.10.6.2:6789/0} election epoch 138, quorum 0,1,2 cephmon-0,cephmon-1,cephmon-2 mgr no daemons active osdmap e37718: 384 osds: 111 up, 384 in; 16764 remapped pgs flags noup,nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds We've had network interruptions before, and normally OSDs come back on their own, or do so with a service restart. This time, no such luck (I'm guessing the scale was just too much). After a few hours of trying to figure out why OSD services were running on the hosts (according to systemd) but marked 'down' in ceph osd tree, I found this thread: http://ceph-devel.vger.kernel.narkive.com/ftEN7TOU/70-osd-are-down-and-not-coming-up which appears to perfectly describe the scenario (high CPU usage, osdmap way out of sync, etc.) I've taken the steps outlined and set the appropriate flags and am monitoring the 'catch up' progress of the OSDs. The OSD farthest behind is about 5000 epochs out of sync, so I assume it will be a few hours before I see CPU usage level out. Once the OSDs are caught up, are there any other steps I should take before 'ceph osd unset noup' (or anything to do after)? Thanks in advance, -- v/r Chris Apsey bitskr...@bitskrieg.net https://www.bitskrieg.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sharing SSD journals and SSD drive choice
Adam, Before we deployed our cluster, we did extensive testing on all kinds of SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME Drives. We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB to 12x HGST 10TB SAS3 HDDs. It provided the best price/performance/density balance for us overall. As a frame of reference, we have 384 OSDs spread across 16 nodes. A few (anecdotal) notes: 1. Consumer SSDs have unpredictable performance under load; write latency can go from normal to unusable with almost no warning. Enterprise drives generally show much less load sensitivity. 2. Write endurance; while it may appear that having several consumer-grade SSDs backing a smaller number of OSDs will yield better longevity than an enterprise grade SSD backing a larger number of OSDs, the reality is that enterprise drives that use SLC or eMLC are generally an order of magnitude more reliable when all is said and done. 3. Power Loss protection (PLP). Consumer drives generally don't do well when power is suddenly lost. Yes, we should all have UPS, etc., but things happen. Enterprise drives are much more tolerant of environmental failures. Recovering from misplaced objects while also attempting to serve clients is no fun. --- v/r Chris Apsey bitskr...@bitskrieg.net https://www.bitskrieg.net On 2017-04-26 10:53, Adam Carheden wrote: What I'm trying to get from the list is /why/ the "enterprise" drives are important. Performance? Reliability? Something else? The Intel was the only one I was seriously considering. The others were just ones I had for other purposes, so I thought I'd see how they fared in benchmarks. The Intel was the clear winner, but my tests did show that throughput tanked with more threads. Hypothetically, if I was throwing 16 OSDs at it, all with osd op threads = 2, do the benchmarks below not show that the Hynix would be a better choice (at least for performance)? Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously the single drive leaves more bays free for OSD disks, but is there any other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s mean: a) fewer OSDs go down if the SSD fails b) better throughput (I'm speculating that the S3610 isn't 4 times faster than the S3520) c) load spread across 4 SATA channels (I suppose this doesn't really matter since the drives can't throttle the SATA bus). -- Adam Carheden On 04/26/2017 01:55 AM, Eneko Lacunza wrote: Adam, What David said before about SSD drives is very important. I will tell you another way: use enterprise grade SSD drives, not consumer grade. Also, pay attention to endurance. The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7, and probably it isn't even the most suitable SATA SSD disk from Intel; better use S3610 o S3710 series. Cheers Eneko El 25/04/17 a las 21:02, Adam Carheden escribió: On 04/25/2017 11:57 AM, David wrote: On 19 Apr 2017 18:01, "Adam Carheden" <carhe...@ucar.edu <mailto:carhe...@ucar.edu>> wrote: Does anyone know if XFS uses a single thread to write to it's journal? You probably know this but just to avoid any confusion, the journal in this context isn't the metadata journaling in XFS, it's a separate journal written to by the OSD daemons Ha! I didn't know that. I think the number of threads per OSD is controlled by the 'osd op threads' setting which defaults to 2 So the ideal (for performance) CEPH cluster would be one SSD per HDD with 'osd op threads' set to whatever value fio shows as the optimal number of threads for that drive then? I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps consider going up to a 37xx and putting more OSDs on it. Of course with the caveat that you'll lose more OSDs if it goes down. Why would you avoid the SanDisk and Hynix? Reliability (I think those two are both TLC)? Brand trust? If it's my benchmarks in my previous email, why not the Hynix? It's slower than the Intel, but sort of decent, at lease compared to the SanDisk. My final numbers are below, including an older Samsung Evo (MCL I think) which did horribly, though not as bad as the SanDisk. The Seagate is a 10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison. SanDisk SDSSDA240G, fio 1 jobs: 7.0 MB/s (5 trials) SanDisk SDSSDA240G, fio 2 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 4 jobs: 7.5 MB/s (5 trials) SanDisk SDSSDA240G, fio 8 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 16 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 32 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 64 jobs: 7.6 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 1 jobs: 4.2 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 2 jobs: 0.6 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 4 jobs: 7.5 MB/s (5 tri
Re: [ceph-users] Creating journal on needed partition
Nikita, Take a look at https://git.cybbh.space/vta/saltstack/tree/master/apps/ceph Particularly files/init-journal.sh and files/osd-bootstrap.sh We use salt to do some of the legwork (templatizing the bootstrap process), but for the most part it is all just a bunch of shell scripts with some control flow. We partition an nvme device and then create symlinks from osds to the partitions in a per-determined fashion. We don't use ceph-desk at all. --- v/r Chris Apsey bitskr...@bitskrieg.net https://www.bitskrieg.net On 2017-04-17 08:56, Nikita Shalnov wrote: Hi all. Is there any way to create osd manually, which would use a designated partition of the journal disk (without using ceph-ansible)? I have journals on SSD disks nad each journal disk contains 3 partitions for 3 osds. Example: one of the osds crashed. I changed a disk (sdaa) and want to prepare the disk for adding to the cluster. Here is the journal, that should be used by new osd: /dev/sdaf : /dev/sdaf2 ceph journal /dev/sdaf3 ceph journal, for /dev/sdab1 /dev/sdaf1 ceph journal, for /dev/sdz1 You see, that the bad disk used a second partition. If I run CEPH-DISK PREPARE /DEV/SDAA /DEV/SDAF, ceph-disk creates /dev/sdaf4 partition and sets it as journal disk for the osd. But I want to use second empty partition (/dev/sdaf2). If I delete /dev/sdaf2 partition, the behavior doesn't change. Can someone help me? BR, Nikita ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] saving file on cephFS mount using vi takes pause/time
Is it related to this the recovery behaviour of vim creating a swap file, which I think nano does not do? http://vimdoc.sourceforge.net/htmldoc/recover.html A sync into cephfs I think needs the write to get confirmed all the way down from the osds performing the write before it returns the confirmation to the client calling the sync, though I stand to be corrected on that. On Thu, 13 Apr 2017 at 22:04 Deepak Naiduwrote: > Ok, I tried strace to check why vi slows or pauses. It seems to slow on > *fsync(3)* > > > > I didn’t see the issue with nano editor. > > > > -- > > Deepak > > > > > > *From:* Deepak Naidu > *Sent:* Wednesday, April 12, 2017 2:18 PM > *To:* 'ceph-users' > *Subject:* saving file on cephFS mount using vi takes pause/time > > > > Folks, > > > > This is bit weird issue. I am using the cephFS volume to read write files > etc its quick less than seconds. But when editing a the file on cephFS > volume using vi , when saving the file the save takes couple of seconds > something like sync(flush). The same doesn’t happen on local filesystem. > > > > Any pointers is appreciated. > > > > -- > > Deepak > -- > This email message is for the sole use of the intended recipient(s) and > may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > -- > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster
Success! There was an issue related to my operating system install procedure that was causing the journals to become corrupt, but it was not caused by ceph! That bug fixed; now the procedure on shutdown in this thread has been verified to work as expected. Thanks for all the help. -Chris > On Mar 1, 2017, at 9:39 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 03/01/17 15:36, Heller, Chris wrote: >> I see. My journal is specified in ceph.conf. I'm not removing it from the >> OSD so sounds like flushing isn't needed in my case. >> > Okay but it seems it's not right if it's saying it's a non-block journal. > (meaning a file, not a block device). > > Double check your ceph.conf... make sure the path works, and somehow make > sure the [osd.x] actually matches that osd (no idea how to test that, esp. if > the osd doesn't start ... maybe just increase logging). > > Or just make a symlink for now, just to see if it solves the problem, which > would imply the ceph.conf is wrong. > > >> -Chris >>> On Mar 1, 2017, at 9:31 AM, Peter Maloney >>> <peter.malo...@brockmann-consult.de >>> <mailto:peter.malo...@brockmann-consult.de>> wrote: >>> >>> On 03/01/17 14:41, Heller, Chris wrote: >>>> That is a good question, and I'm not sure how to answer. The journal is on >>>> its own volume, and is not a symlink. Also how does one flush the journal? >>>> That seems like an important step when bringing down a cluster safely. >>>> >>> You only need to flush the journal if you are removing it from the osd, >>> replacing it with a different journal. >>> >>> So since your journal is on its own, then you need either a symlink in the >>> osd directory named "journal" which points to the device (ideally not >>> /dev/sdx but /dev/disk/by-.../), or you put it in the ceph.conf. >>> >>> And since it said you have a non-block journal now, it probably means there >>> is a file... you should remove that (rename it to journal.junk until you're >>> sure it's not an important file, and delete it later). >>>> >>>>>> This is where I've stopped. All but one OSD came back online. One has >>>>>> this backtrace: >>>>>> >>>>>> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: >>>>>> disabling aio for non-block journal. Use journal_force_aio to force use >>>>>> of aio anyway >>>>> Are the journals inline? or separate? If they're separate, the above >>>>> means the journal symlink/config is missing, so it would possibly make a >>>>> new journal, which would be bad if you didn't flush the old journal >>>>> before. >>>>> >>>>> And also just one osd is easy enough to replace (which I wouldn't do >>>>> until the cluster settled down and recovered). So it's lame for it to be >>>>> broken, but it's still recoverable if that's the only issue. >>>> >>> >>> >> > > > -- > > > Peter Maloney > Brockmann Consult > Max-Planck-Str. 2 > 21502 Geesthacht > Germany > Tel: +49 4152 889 300 > Fax: +49 4152 889 333 > E-mail: peter.malo...@brockmann-consult.de > <mailto:peter.malo...@brockmann-consult.de> > Internet: http://www.brockmann-consult.de > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.brockmann-2Dconsult.de=DwMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=fXi7JtWroHrS8RV824OLTqf8NbD_NERvG8hvrPFmUAA=lga4HYFhA45fm1KJHyov1htPfqKhBHZsNFVkt3bTJx0=> > smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster
I see. My journal is specified in ceph.conf. I'm not removing it from the OSD so sounds like flushing isn't needed in my case. -Chris > On Mar 1, 2017, at 9:31 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 03/01/17 14:41, Heller, Chris wrote: >> That is a good question, and I'm not sure how to answer. The journal is on >> its own volume, and is not a symlink. Also how does one flush the journal? >> That seems like an important step when bringing down a cluster safely. >> > You only need to flush the journal if you are removing it from the osd, > replacing it with a different journal. > > So since your journal is on its own, then you need either a symlink in the > osd directory named "journal" which points to the device (ideally not > /dev/sdx but /dev/disk/by-.../), or you put it in the ceph.conf. > > And since it said you have a non-block journal now, it probably means there > is a file... you should remove that (rename it to journal.junk until you're > sure it's not an important file, and delete it later). > >> -Chris >> >>> On Mar 1, 2017, at 8:37 AM, Peter Maloney >>> <peter.malo...@brockmann-consult.de >>> <mailto:peter.malo...@brockmann-consult.de>> wrote: >>> >>> On 02/28/17 18:55, Heller, Chris wrote: >>>> Quick update. So I'm trying out the procedure as documented here. >>>> >>>> So far I've: >>>> >>>> 1. Stopped ceph-mds >>>> 2. set noout, norecover, norebalance, nobackfill >>>> 3. Stopped all ceph-osd >>>> 4. Stopped ceph-mon >>>> 5. Installed new OS >>>> 6. Started ceph-mon >>>> 7. Started all ceph-osd >>>> >>>> This is where I've stopped. All but one OSD came back online. One has this >>>> backtrace: >>>> >>>> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: >>>> disabling aio for non-block journal. Use journal_force_aio to force use >>>> of aio anyway >>> Are the journals inline? or separate? If they're separate, the above means >>> the journal symlink/config is missing, so it would possibly make a new >>> journal, which would be bad if you didn't flush the old journal before. >>> >>> And also just one osd is easy enough to replace (which I wouldn't do until >>> the cluster settled down and recovered). So it's lame for it to be broken, >>> but it's still recoverable if that's the only issue. >> > > smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster
That is a good question, and I'm not sure how to answer. The journal is on its own volume, and is not a symlink. Also how does one flush the journal? That seems like an important step when bringing down a cluster safely. -Chris > On Mar 1, 2017, at 8:37 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 02/28/17 18:55, Heller, Chris wrote: >> Quick update. So I'm trying out the procedure as documented here. >> >> So far I've: >> >> 1. Stopped ceph-mds >> 2. set noout, norecover, norebalance, nobackfill >> 3. Stopped all ceph-osd >> 4. Stopped ceph-mon >> 5. Installed new OS >> 6. Started ceph-mon >> 7. Started all ceph-osd >> >> This is where I've stopped. All but one OSD came back online. One has this >> backtrace: >> >> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: >> disabling aio for non-block journal. Use journal_force_aio to force use of >> aio anyway > Are the journals inline? or separate? If they're separate, the above means > the journal symlink/config is missing, so it would possibly make a new > journal, which would be bad if you didn't flush the old journal before. > > And also just one osd is easy enough to replace (which I wouldn't do until > the cluster settled down and recovered). So it's lame for it to be broken, > but it's still recoverable if that's the only issue. smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Antw: Safely Upgrading OS on a live Ceph Cluster
In my case the version will be identical. But I might have to do this node by node approach if I can't stabilize the more general shutdown/bring-up approach. There are 192 OSD in my cluster, so it will take a while to go node by node unfortunately. -Chris > On Mar 1, 2017, at 2:50 AM, Steffen Weißgerber <weissgerb...@ksnb.de> wrote: > > Hello, > > some time ago I upgraded our 6 node cluster (0.94.9) running on Ubuntu from > Trusty > to Xenial. > > The problem here was that with the os update also ceph is upgraded what we > did not want > in the same step because then we had to upgrade all nodes at the same time. > > Therefore we did it node by node first freeing the osd's on the node with > setting the weight to 0. > > After os update, configuring the right ceph version for our setup and testing > the reboot so that > all components start up correctly we set the osd weights to the normal value > so that the > cluster was rebalancing. > > With this procedure the cluster was always up. > > Regards > > Steffen > > >>>> "Heller, Chris" <chel...@akamai.com> schrieb am Montag, 27. Februar 2017 um > 18:01: >> I am attempting an operating system upgrade of a live Ceph cluster. Before I >> go an screw up my production system, I have been testing on a smaller >> installation, and I keep running into issues when bringing the Ceph FS >> metadata server online. >> >> My approach here has been to store all Ceph critical files on non-root >> partitions, so the OS install can safely proceed without overwriting any of >> the Ceph configuration or data. >> >> Here is how I proceed: >> >> First I bring down the Ceph FS via `ceph mds cluster_down`. >> Second, to prevent OSDs from trying to repair data, I run `ceph osd set >> noout` >> Finally I stop the ceph processes in the following order: ceph-mds, >> ceph-mon, >> ceph-osd >> >> Note my cluster has 1 mds and 1 mon, and 7 osd. >> >> I then install the new OS and then bring the cluster back up by walking the >> steps in reverse: >> >> First I start the ceph processes in the following order: ceph-osd, ceph-mon, >> ceph-mds >> Second I restore OSD functionality with `ceph osd unset noout` >> Finally I bring up the Ceph FS via `ceph mds cluster_up` >> >> Everything works smoothly except the Ceph FS bring up. The MDS starts in the >> active:replay state and eventually crashes with the following backtrace: >> >> starting mds.cuba at :/0 >> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors >> {default=true} >> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got >> (2) No such file or directory >> mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void >> SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time >> 2017-02-27 16:56:08.537739 >> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to >> load sessionmap") >> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x8b) [0x98bb4b] >> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] >> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] >> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] >> 5: (()+0x8192) [0x7f31d9c8f192] >> 6: (clone()+0x6d) [0x7f31d919c51d] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc >> <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, >> ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 >> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to >> load sessionmap") >> >> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x8b) [0x98bb4b] >> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] >> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] >> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] >> 5: (()+0x8192) [0x7f31d9c8f192] >> 6: (clone()+0x6d) [0x7f31d919c51d] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> >> -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors >> {default=true} >> -1>
Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster
13: (ReplicatedBackend::do_pull(std::tr1::shared_ptr)+0xd6) [0x974836] 14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr)+0x3ed) [0x97b89d] 15: (ReplicatedPG::do_request(std::tr1::shared_ptr&, ThreadPool::TPHandle&)+0x19d) [0x80b84d] 16: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3cd) [0x67720d] 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2f9) [0x6776f9] 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85c) [0xb3c7bc] 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb3e8b0] 20: (()+0x8192) [0x7fb2b9166192] 21: (clone()+0x6d) [0x7fb2b867351d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. I'm not sure why I would have encountered an issue since the data was at rest before the install (unless there is another step that was needed). Currently the cluster is recovering objects. Although `ceph osd stat` shows that the 'norecover' flag is still set. I'm going to wait out the recovery and see if the Ceph FS is OK. That would be huge if it is. But I am curious why I lost an OSD, and why recovery is happening with 'norecover' still set. -Chris > On Feb 28, 2017, at 4:51 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 02/27/17 18:01, Heller, Chris wrote: >> First I bring down the Ceph FS via `ceph mds cluster_down`. >> Second, to prevent OSDs from trying to repair data, I run `ceph osd set >> noout` >> Finally I stop the ceph processes in the following order: ceph-mds, >> ceph-mon, ceph-osd >> > This is the wrong procedure. Likely it will just involve more cpu and memory > usage on startup, not broken behavior (unless you run out of RAM). After all, > it has to recover from power outages, so any order ought to work, just some > are better. > > I am unsure on the cephfs part... but I would think you have it right, except > I wouldn't do `ceph mds cluster_down` (but don't know if it's right to)... > maybe try without that. I never used that except when I want to remove all > mds nodes and destroy all the cephfs data. And I didn't find any docs on what > it really even does, except it won't let you remove all your mds and destroy > the cephfs without it. > > The correct procedure as far as I know is: > > ## 1. cluster must be healthy and to set noout, norecover, norebalance, > nobackfill > ceph -s > for s in noout norecover norebalance nobackfill; do ceph osd set $s; done > > ## 2. shut down all OSDs and then the all MONs - not MONs before OSDs > # all nodes > service ceph stop osd > > # see that all osds are down > ceph osd tree > > # all nodes again > ceph -s > service ceph stop > > ## 3. start MONs before OSDs. > # This already happens on boot per node, but not cluster wide. But with the > flags set, it likely doesn't matter. It seems unnecessary on a small cluster. > > ## 4. unset the flags > # see that all osds are up > ceph -s > ceph osd tree > for s in noout norecover norebalance nobackfill; do ceph osd unset $s; done > > >> Note my cluster has 1 mds and 1 mon, and 7 osd. >> >> I then install the new OS and then bring the cluster back up by walking the >> steps in reverse: >> >> First I start the ceph processes in the following order: ceph-osd, ceph-mon, >> ceph-mds >> Second I restore OSD functionality with `ceph osd unset noout` >> Finally I bring up the Ceph FS via `ceph mds cluster_up` >> > adjust those steps too... mons start first > >> Everything works smoothly except the Ceph FS bring up.[...snip...] > >> How can I safely stop a Ceph cluster, so that it will cleanly start back up >> again? >> > Don't know about the cephfs problem... all I can say is try the right general > procedure and see if the result changes. > > (and I'd love to cite a source on why that's the right procedure and yours > isn't, but don't know what to cite... for > examplehttp://docs.ceph.com/docs/jewel/rados/operations/operating/#id8 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_jewel_rados_operations_operating_-23id8=DwMC-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Z8WoC0W2zZz1lpQRQ7ZPyt6UhQYV0sd_92NRYqdlNfs=ht25eyn3seVNB8DsSgfz4p1j4TIoNXEN2wBq0P4sU-Y=> > says to use -a in the arguments, but doesn't say whether that's systemd or > not, or what it does exactly. I have only seen it discussed a few places, > like the mailing list and IRC) >> -Chris >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lis
[ceph-users] Safely Upgrading OS on a live Ceph Cluster
(d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89984a] 2: (()+0x10350) [0x7f31d9c97350] 3: (gsignal()+0x39) [0x7f31d90d8c49] 4: (abort()+0x148) [0x7f31d90dc058] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] 6: (()+0x5e6f6) [0x7f31d99e16f6] 7: (()+0x5e723) [0x7f31d99e1723] 8: (()+0x5e942) [0x7f31d99e1942] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38] 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 13: (()+0x8192) [0x7f31d9c8f192] 14: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) ** in thread 7f31d30df700 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89984a] 2: (()+0x10350) [0x7f31d9c97350] 3: (gsignal()+0x39) [0x7f31d90d8c49] 4: (abort()+0x148) [0x7f31d90dc058] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] 6: (()+0x5e6f6) [0x7f31d99e16f6] 7: (()+0x5e723) [0x7f31d99e1723] 8: (()+0x5e942) [0x7f31d99e1942] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38] 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 13: (()+0x8192) [0x7f31d9c8f192] 14: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. How can I safely stop a Ceph cluster, so that it will cleanly start back up again? -Chris smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] civetweb deamon dies on https port
You look to have a typo in this line: rgw_frontends = "civetweb port=8080s ssl_certificate=/etc/pki/tls/ cephrgw01.crt" It would seem from the error it should be port=8080, not port=8080s. On Thu, 19 Jan 2017 at 08:59 Iban Cabrillowrote: > Dear cephers, >I just finish the integration between radosgw > (ceph-radosgw-10.2.5-0.el7.x86_64) and keystone. > >This is my ceph conf for radosgw: > > [client.rgw.cephrgw] > host = cephrgw01 > rgw_frontends = "civetweb port=8080s > ssl_certificate=/etc/pki/tls/cephrgw01.crt" > rgw_zone = RegionOne > keyring = /etc/ceph/ceph.client.rgw.cephrgw.keyring > log_file = /var/log/ceph/client.rgw.cephrgw.log > rgw_keystone_url = https://keystone:5000 > rgw_keystone_admin_user = > rgw_keystone_admin_password = YY > rgw_keystone_admin_tenant = service > rgw_keystone_accepted_roles = admin Member > rgw keystone admin project = service > rgw keystone admin domain = admin > rgw keystone api version = 2 > rgw_s3_auth_use_keystone = true > nss_db_path = /var/ceph/nss/ > rgw_keystone_verify_ssl = true > > Seems to be working fine using latets jewel version 10.2.5, but seems > that now I cannot listen on secure port. Older version was running fine the > rgw_frontends option (10.2.3), but now, Server starts (I can access to > https://cephrgw01:8080/) but after a couple of minutes the radosgw deamon > stops: > > error parsing int: 8080s: The option value '8080s' seems to be invalid > > Is there any change on this parameter with the new jewel version > (ceph-radosgw-10.2.5-0.el7.x86_64)? > > regards, I > > > > > -- > > > Iban Cabrillo Bartolome > Instituto de Fisica de Cantabria (IFCA) > Santander, Spain > Tel: +34942200969 <+34%20942%2020%2009%2069> > PGP PUBLIC KEY: > http://pgp.mit.edu/pks/lookup?op=get=0xD9DF0B3D6C8C08AC > > > Bertrand Russell:*"El problema con el mundo es que los estúpidos están > seguros de todo y los inteligentes están **llenos de dudas*" > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph.com
The site looks great! Good job! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Monitoring
Thanks. What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for those type and if so what criteria do you use? Thanks again! On Fri, Jan 13, 2017 at 3:28 PM, David Turner <david.tur...@storagecraft.com > wrote: > We don't use many critical alerts (that will have our NOC wake up an > engineer), but the main one that we do have is a check that tells us if > there are 2 or more hosts with osds that are down. We have clusters with > 60 servers in them, so having an osd die and backfill off of isn't > something to wake up for in the middle of the night, but having osds down > on 2 servers is 1 osd away from data loss. A quick reference to how to do > this check in bash is below. > > hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down | > grep host | wc -l` > if [ $hosts_with_down_osds -ge 2 ] > then > echo critical > elif [ $hosts_with_down_osds -eq 1 ] > then > echo warning > elif [ $hosts_with_down_osds -eq 0 ] > then > echo ok > else > echo unknown > fi > > -- > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > -- > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > -- > > -- > *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Chris > Jones [cjo...@cloudm2.com] > *Sent:* Friday, January 13, 2017 1:15 PM > *To:* ceph-us...@ceph.com > *Subject:* [ceph-users] Ceph Monitoring > > General question/survey: > > Those that have larger clusters, how are you doing alerting/monitoring? > Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about > collectd related but more on initial alerts of an issue or potential issue? > What threshold do you use basically? Just trying to get a pulse of what > others are doing. > > Thanks in advance. > > -- > Best Regards, > Chris Jones > Bloomberg > > > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Monitoring
General question/survey: Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts of an issue or potential issue? What threshold do you use basically? Just trying to get a pulse of what others are doing. Thanks in advance. -- Best Regards, Chris Jones Bloomberg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage system
Based on this limited info, Object storage if behind proxy. We use Ceph behind HAProxy and hardware load-balancers at Bloomberg. Our Chef recipes are at https://github.com/ceph/ceph-chef and https://github.com/bloomberg/chef-bcs. The chef-bcs cookbooks show the HAProxy info. Thanks, Chris On Wed, Jan 4, 2017 at 11:51 AM, Patrick McGarry <pmcga...@redhat.com> wrote: > Moving this to ceph-user list where it'll get some attention. > > On Thu, Dec 22, 2016 at 2:08 PM, SIBALA, SATISH <ss9...@att.com> wrote: > >> Hi, >> >> >> >> Could you please give me an recommendation on kind of Ceph storage to be >> used with NGINX proxy server (Object / Block / FileSystem)? >> >> >> >> Best Regards >> >> Satish >> >> [image: cid:image001.png@01D1D904.36628D60] >> >> >> > > > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hat > http://ceph.com || http://community.redhat.com > @scuttlemonkey || @ceph > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS FAILED assert(dn->get_linkage()->is_null())
Hi Goncarlo, In the end we ascertained that the assert was coming from reading corrupt data in the mds journal. We have followed the sections at the following link (http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/) in order down to (and including) MDS Table wipes (only wiping the "session" table in the final step). This resolved the problem we had with our mds asserting. We have also run a cephfs scrub to validate the data (ceph daemon mds.0 scrub_path / recursive repair), which has resulted in "metadata damage detected" health warning. This seems to perform a read of all objects involved in cephfs rados pools (anecdotal: performance of the scan against the data pool was much faster to process than the metadata pool itself). We are now working with the output of "ceph tell mds.0 damage ls", and looking at the following mailing list post as a starting point for proceeding with that: http://ceph-users.ceph.narkive.com/EfFTUPyP/how-to-fix-the-mds-damaged-issue Chris On Fri, 9 Dec 2016 at 19:26 Goncalo Borges <goncalo.bor...@sydney.edu.au> wrote: > Hi Sean, Rob. > > I saw on the tracker that you were able to resolve the mds assert by > manually cleaning the corrupted metadata. Since I am also hitting that > issue and I suspect that i will face an mds assert of the same type sooner > or later, can you please explain a bit further what operations did you do > to clean the problem? > Cheers > Goncalo > > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rob > Pickerill [r.picker...@gmail.com] > Sent: 09 December 2016 07:13 > To: Sean Redmond; John Spray > Cc: ceph-users > Subject: Re: [ceph-users] CephFS FAILED > assert(dn->get_linkage()->is_null()) > > Hi John / All > > Thank you for the help so far. > > To add a further point to Sean's previous email, I see this log entry > before the assertion failure: > > -6> 2016-12-08 15:47:08.483700 7fb133dca700 12 > mds.0.cache.dir(1000a453344) remove_dentry [dentry > #100/stray9/1000a453344/config [2,head] auth NULL (dver > sion lock) v=540 inode=0 0x55e8664fede0] > -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In > function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700 > time 2016-12-08 > 15:47:08.483704 > mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null()) > > And I can reference this with: > > root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys > 1000a453344. > 1470734502_head > config_head > > Would we also need to clean up this object, if so is there a safe we can > do this? > > Rob > > On Thu, 8 Dec 2016 at 19:58 Sean Redmond <sean.redmo...@gmail.com sean.redmo...@gmail.com>> wrote: > Hi John, > > Thanks for your pointers, I have extracted the onmap_keys and onmap_values > for an object I found in the metadata pool called '600.' and > dropped them at the below location > > https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0 > > Could you explain how is it possible to identify stray directory fragments? > > Thanks > > On Thu, Dec 8, 2016 at 6:30 PM, John Spray <jsp...@redhat.com jsp...@redhat.com>> wrote: > On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond <sean.redmo...@gmail.com > <mailto:sean.redmo...@gmail.com>> wrote: > > Hi, > > > > We had no changes going on with the ceph pools or ceph servers at the > time. > > > > We have however been hitting this in the last week and it maybe related: > > > > http://tracker.ceph.com/issues/17177 > > Oh, okay -- so you've got corruption in your metadata pool as a result > of hitting that issue, presumably. > > I think in the past people have managed to get past this by taking > their MDSs offline and manually removing the omap entries in their > stray directory fragments (i.e. using the `rados` cli on the objects > starting "600."). > > John > > > > > Thanks > > > > On Thu, Dec 8, 2016 at 3:34 PM, John Spray <jsp...@redhat.com jsp...@redhat.com>> wrote: > >> > >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond <sean.redmo...@gmail.com > <mailto:sean.redmo...@gmail.com>> > >> wrote: > >> > Hi, > >> > > >> > I have a CephFS cluster that is currently unable to start the mds > server > >> > as > >> > it is hitting an assert, the extract from the mds log is below, any > >> > pointers > >> > are welcome: > >> > > >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > >> > > >> >
Re: [ceph-users] rgw civetweb ssl official documentation?
We terminate all of our TLS at the load-balancer. To make it simple, use HAProxy in front of your single instance. BTW, the latest versions of HAProxy can out perform expensive hardware LBs. We use both at Bloomberg. -CJ On Wed, Dec 7, 2016 at 1:44 PM, Puff, Jonathon <jonathon.p...@netapp.com> wrote: > There’s a few documents out around this subject, but I can’t find anything > official. Can someone point me to any official documentation for deploying > this? Other alternatives appear to be a HAproxy frontend. Currently > running 10.2.3 with a single radosgw. > > > > -JP > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How are replicas spread in default crush configuration?
Kevin, After changing the pool size to 3, make sure the min_size is set to 1 to allow 2 of the 3 hosts to be offline. http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values [2] How many MONs do you have and are they on the same OSD hosts? If you have 3 MONs running on the OSD hosts and two go offline, you will not have a quorum of MONs and I/O will be blocked. I would also check your CRUSH map. I believe you want to make sure your rules have "step chooseleaf firstn 0 type host" and not "... type osd" so that replicas are on different hosts. I have not had to make that change before so you will want to read up on it first. Don't take my word for it. http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters [3] Hope that helps. Chris On 2016-11-23 1:32 pm, Kevin Olbrich wrote: > Hi, > > just to make sure, as I did not find a reference in the docs: > Are replicas spread across hosts or "just" OSDs? > > I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently > each OSD is a ZFS backed storage array. > Now I installed a server which is planned to host 4x OSDs (and setting size > to 3). > > I want to make sure we can resist two offline hosts (in terms of hardware). > Is my assumption correct? > > Mit freundlichen Grüßen / best regards, > Kevin Olbrich. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] Links: -- [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values [3] http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster having blocke requests very frequently
Maybe a long shot, but have you checked OSD memory usage? Are the OSD hosts low on RAM and swapping to disk? I am not familiar with your issue, but though that might cause it. Chris On 2016-11-14 3:29 pm, Brad Hubbard wrote: > Have you looked for clues in the output of dump_historic_ops ? > > On Tue, Nov 15, 2016 at 1:45 AM, Thomas Danan <thomas.da...@mycom-osi.com> > wrote: > > Thanks Luis, > > Here are some answers > > Journals are not on SSD and collocated with OSD daemons host. > > We look at the disk performances and did not notice anything wrong with > acceptable rw latency < 20ms. > > No issue on the network as well from what we have seen. > > There is only one pool in the cluster so pool size = cluster size. > Replication factor is default: 3 and there is no erasure coding. > > We tried to stop deep scrub but without notable effect. > > We have one near full OSD and adding new DN but I doubt this could be the > issue. > > I doubt we are hitting cluster limits but if it was the case, then adding new > DN should help. Also writing to primary OSD is working fine whereas writing > on secondary OSD is often blocked. Last is recovery can be very fast (several > GB/s) and seems never blocked, where client RW IOs are about several hundred > MB /s and are too much often blocked when writing replicas. > > Thomas > > FROM: Luis Periquito [mailto:periqu...@gmail.com] > SENT: lundi 14 novembre 2016 16:23 > TO: Thomas Danan > CC: ceph-users@lists.ceph.com > SUBJECT: Re: [ceph-users] ceph cluster having blocke requests very frequently > > Without knowing the cluster architecture it's hard to know exactly what may > be happening. > > How is the cluster hardware? Where are the journals? How busy are the disks > (% time busy)? What is the pool size? Are these replicated or EC pools? > > Have you tried tuning the deep-scrub processes? Have you tried stopping them > altogether? Are the journals on SSDs? As a first feeling the cluster may be > hitting it's limits (also you have at least one OSD getting full)... > > On Mon, Nov 14, 2016 at 3:16 PM, Thomas Danan <thomas.da...@mycom-osi.com> > wrote: > > Hi All, > > We have a cluster in production who is suffering from intermittent blocked > request (25 requests are blocked > 32 sec). The blocked request occurrences > are frequent and global to all OSDs. > > From the OSD daemon logs, I can see related messages: > > 16-11-11 18:25:29.917518 7fd28b989700 0 log_channel(cluster) log [WRN] : slow > request 30.429723 seconds old, received at 2016-11-11 18:24:59.487570: > osd_op(client.2406272.1:336025615 rbd_data.66e952ae8944a.00350167 > [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] > 0.8d3c9da5 snapc 248=[248,216] ondisk+write e201514) currently waiting for > subops from 210,499,821 > > . So I guess the issue is related to replication process when writing new > data on the cluster. Again it is never the same secondary OSDs that are > displayed in OSD daemon logs. > > As a result we are experiencing very important IO Write latency on ceph > client side (can be up to 1 hour !!!). > > We have checked Network health as well as disk health but we wre not able to > find any issue. > > Wanted to know if this issue was already observed or if you have ideas to > investigate / WA the issue. > > Many thanks... > > Thomas > > The cluster is composed with 37DN and 851 OSDs and 5 MONs > > The Ceph clients are accessing the client with RBD > > Cluster is Hammer 0.94.5 version > > cluster 1a26e029-3734-4b0e-b86e-ca2778d0c990 > > health HEALTH_WARN > > 25 requests are blocked > 32 sec > > 1 near full osd(s) > > noout flag(s) set > > monmap e3: 5 mons at > {NVMBD1CGK190D00=10.137.81.13:6789/0,nvmbd1cgy050d00=10.137.78.226:6789/0,nvmbd1cgy070d00=10.137.78.232:6789/0,nvmbd1cgy090d00=10.137.78.228:6789/0,nvmbd1cgy130d00=10.137.78.218:6789/0 > [1]} > > election epoch 664, quorum 0,1,2,3,4 > nvmbd1cgy130d00,nvmbd1cgy050d00,nvmbd1cgy090d00,nvmbd1cgy070d00,NVMBD1CGK190D00 > > > osdmap e205632: 851 osds: 850 up, 850 in > > flags noout > > pgmap v25919096: 10240 pgs, 1 pools, 197 TB data, 50664 kobjects > > 597 TB used, 233 TB / 831 TB avail > > 10208 active+clean > > 32 active+clean+scrubbing+deep > > client io 97822 kB/s rd, 205 MB/s wr, 2402 op/s > > THANK YOU > > THOMAS DANAN > > DIRECTOR OF PRODUCT DEVELOPMENT > > Office +33 1 49 03 77 53 [2] > > Mobile +33 7 76 35 76 43 [3] > > Skype thomas.danan > >
Re: [ceph-users] cephfs slow delete
Just a thought, but since a directory tree is a first class item in cephfs, could the wire protocol be extended with an “recursive delete” operation, specifically for cases like this? On 10/14/16, 4:16 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris <chel...@akamai.com> wrote: > Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary so there is a simple place to improve the throughput here. Good to know, I’ll work on a patch… Ah yeah, if you're in whatever they call the recursive tree delete function you can unroll that loop a whole bunch. I forget where the boundary is so you may need to go play with the JNI code; not sure. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs slow delete
Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary so there is a simple place to improve the throughput here. Good to know, I’ll work on a patch… On 10/14/16, 3:58 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Fri, Oct 14, 2016 at 11:41 AM, Heller, Chris <chel...@akamai.com> wrote: > Unfortunately, it was all in the unlink operation. Looks as if it took nearly 20 hours to remove the dir, roundtrip is a killer there. What can be done to reduce RTT to the MDS? Does the client really have to sequentially delete directories or can it have internal batching or parallelization? It's bound by the same syscall APIs as anything else. You can spin off multiple deleters; I'd either keep them on one client (if you want to work within a single directory) or if using multiple clients assign them to different portions of the hierarchy. That will let you parallelize across the IO latency until you hit a cap on the MDS' total throughput (should be 1-10k deletes/s based on latest tests IIRC). -Greg > > -Chris > > On 10/13/16, 4:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: > > On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris <chel...@akamai.com> wrote: > > I have a directory I’ve been trying to remove from cephfs (via > > cephfs-hadoop), the directory is a few hundred gigabytes in size and > > contains a few million files, but not in a single sub directory. I startd > > the delete yesterday at around 6:30 EST, and it’s still progressing. I can > > see from (ceph osd df) that the overall data usage on my cluster is > > decreasing, but at the rate its going it will be a month before the entire > > sub directory is gone. Is a recursive delete of a directory known to be a > > slow operation in CephFS or have I hit upon some bad configuration? What > > steps can I take to better debug this scenario? > > Is it the actual unlink operation taking a long time, or just the > reduction in used space? Unlinks require a round trip to the MDS > unfortunately, but you should be able to speed things up at least some > by issuing them in parallel on different directories. > > If it's the used space, you can let the MDS issue more RADOS delete > ops by adjusting the "mds max purge files" and "mds max purge ops" > config values. > -Greg > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs slow delete
Unfortunately, it was all in the unlink operation. Looks as if it took nearly 20 hours to remove the dir, roundtrip is a killer there. What can be done to reduce RTT to the MDS? Does the client really have to sequentially delete directories or can it have internal batching or parallelization? -Chris On 10/13/16, 4:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris <chel...@akamai.com> wrote: > I have a directory I’ve been trying to remove from cephfs (via > cephfs-hadoop), the directory is a few hundred gigabytes in size and > contains a few million files, but not in a single sub directory. I startd > the delete yesterday at around 6:30 EST, and it’s still progressing. I can > see from (ceph osd df) that the overall data usage on my cluster is > decreasing, but at the rate its going it will be a month before the entire > sub directory is gone. Is a recursive delete of a directory known to be a > slow operation in CephFS or have I hit upon some bad configuration? What > steps can I take to better debug this scenario? Is it the actual unlink operation taking a long time, or just the reduction in used space? Unlinks require a round trip to the MDS unfortunately, but you should be able to speed things up at least some by issuing them in parallel on different directories. If it's the used space, you can let the MDS issue more RADOS delete ops by adjusting the "mds max purge files" and "mds max purge ops" config values. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"
On 13/10/2016 11:49, Henrik Korkuc wrote: Is apt/dpkg doing something now? Is problem repeatable, e.g. by killing upgrade and starting again. Are there any stuck systemctl processes? I had no problems upgrading 10.2.x clusters to 10.2.3 On 16-10-13 13:41, Chris Murray wrote: On 22/09/2016 15:29, Chris Murray wrote: Hi all, Might anyone be able to help me troubleshoot an "apt-get dist-upgrade" which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"? I'm upgrading from 10.2.2. The two OSDs on this node are up, and think they are version 10.2.3, but the upgrade doesn't appear to be finishing ... ? Thank you in advance, Chris Hi, Are there possibly any pointers to help troubleshoot this? I've got a test system on which the same thing has happened. The cluster's status is "HEALTH_OK" before starting. I'm running Debian Jessie. dpkg.log only has the following: 2016-10-13 11:37:25 configure ceph-osd:amd64 10.2.3-1~bpo80+1 2016-10-13 11:37:25 status half-configured ceph-osd:amd64 10.2.3-1~bpo80+1 At this point, the ugrade gets stuck and doesn't go any further. Where could I look for the next clue? Thanks, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thank you Henrik, I see it's a systemctl process that's stuck. It is reproducible for me on every run of dpkg --configure -a And, indeed, reproducible across two separate machines. I'll pursue the stuck "/bin/systemctl start ceph-osd.target". Thanks again, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs slow delete
I have a directory I’ve been trying to remove from cephfs (via cephfs-hadoop), the directory is a few hundred gigabytes in size and contains a few million files, but not in a single sub directory. I startd the delete yesterday at around 6:30 EST, and it’s still progressing. I can see from (ceph osd df) that the overall data usage on my cluster is decreasing, but at the rate its going it will be a month before the entire sub directory is gone. Is a recursive delete of a directory known to be a slow operation in CephFS or have I hit upon some bad configuration? What steps can I take to better debug this scenario? -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"
On 22/09/2016 15:29, Chris Murray wrote: Hi all, Might anyone be able to help me troubleshoot an "apt-get dist-upgrade" which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"? I'm upgrading from 10.2.2. The two OSDs on this node are up, and think they are version 10.2.3, but the upgrade doesn't appear to be finishing ... ? Thank you in advance, Chris Hi, Are there possibly any pointers to help troubleshoot this? I've got a test system on which the same thing has happened. The cluster's status is "HEALTH_OK" before starting. I'm running Debian Jessie. dpkg.log only has the following: 2016-10-13 11:37:25 configure ceph-osd:amd64 10.2.3-1~bpo80+1 2016-10-13 11:37:25 status half-configured ceph-osd:amd64 10.2.3-1~bpo80+1 At this point, the ugrade gets stuck and doesn't go any further. Where could I look for the next clue? Thanks, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New OSD Nodes, pgs haven't changed state
I see on this list often that peering issues are related to networking and MTU sizes. Perhaps the HP 5400's or the managed switches did not have jumbo frames enabled? Hope that helps you determine the issue in case you want to move the nodes back to the other location. Chris On 2016-10-11 2:30 pm, Mike Jacobacci wrote: > Hi Goncalo, > > Thanks for your reply! I finally figured out that our issue was with the > physical setup of the nodes. Se had one OSD and MON node in our office and > the others are co-located at our ISP. We have an almost dark fiber going > between our two buildings connected via HP 5400's, but it really isn't since > there are some switches in between doing VLAN rewriting (ISP managed). > > Even though all the interfaces were communicating without issue, no data > would move across the nodes. I ended up moving all nodes into the same rack > and data immediately started moving and the cluster is now working! So it > seems the storage traffic was being dropped/blocked by something on our ISP > side. > > Cheers, > Mike > > On Mon, Oct 10, 2016 at 5:22 PM, Goncalo Borges > <goncalo.bor...@sydney.edu.au> wrote: > >> Hi Mike... >> >> I was hoping that someone with a bit more experience would answer you since >> I never had similar situation. So, I'll try to step in and help. >> >> The peering process means that the OSDs are agreeing on the state of objects >> in the PGs they share. The peering process can take some time and is a hard >> operation to execute from a ceph point of view, specially if a lot of >> peering happens at the same time. This is one of the reasons why also the pg >> increase should be done in very small steps (normally increases of 256 pgs). >> >> Is your cluster slowly decreasing the number of pgs in peering? and the >> number of active pgs increasing? If you see no evolution at all after this >> time, you can have a problem. >> >> pgs which do not leave the peering state may be because: >> - incorrect crush map >> - issues in osds >> - issues with the network >> >> Check that your network is working as expected and that you do not have >> firewalls blocking traffic and so on. >> >> A pg query for one of those peering pgs may provide some further information >> about what could be wrong. >> >> Looking to osd logs may also show a bit of light. >> >> Cheers >> Goncalo >> >> >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mike >> Jacobacci [mi...@flowjo.com] >> Sent: 10 October 2016 01:55 >> To: ceph-us...@ceph.com >> Subject: [ceph-users] New OSD Nodes, pgs haven't changed state >> >> Hi, >> >> Yesterday morning I added two more OSD nodes and changed the crushmap from >> disk to node. It looked to me like everything went ok besides some disks >> missing that I can re-add later, but the cluster status hasn't changed since >> then. Here is the output of ceph -w: >> >> cluster 395fb046-0062-4252-914c-013258c5575c >> health HEALTH_ERR >> 1761 pgs are stuck inactive for more than 300 seconds >> 1761 pgs peering >> 1761 pgs stuck inactive >> 8 requests are blocked > 32 sec >> crush map has legacy tunables (require bobtail, min is firefly) >> monmap e2: 3 mons at {birkeland=192.168.10.190:6789/0,immanuel=192.168.10.1 >> [1]<http://192.168.10.190:6789/0,immanuel=192.168.10.1 [1]> >> 25:6789/0,peratt=192.168.10.187:6789/0 [2]<http://192.168.10.187:6789/0 [2]>} >> >> election epoch 14, quorum 0,1,2 immanuel,peratt,birkeland >> osdmap e186: 26 osds: 26 up, 26 in; 1796 remapped pgs >> flags sortbitwise >> pgmap v6599413: 1796 pgs, 4 pools, 1343 GB data, 336 kobjects >> 4049 GB used, 92779 GB / 96829 GB avail >> 1761 remapped+peering >> 35 active+clean >> 2016-10-09 07:00:00.000776 mon.0 [INF] HEALTH_ERR; 1761 pgs are stuck >> inactive f or more than 300 seconds; 1761 pgs peering; 1761 pgs stuck >> inactive; 8 requests are blocked > 32 sec; crush map has legacy tunables >> (require bobtail, min is fir efly) >> >> I have legacy tunables on since Ceph is only backing our Xenserver >> infrastructure. The number of pgs remapping and clean haven't changed and >> there isn't seem to be that much data... Is this normal behavior? >> >> Here is my crushmap: >> >> # begin crush map >> tunable choose_local_tries 0 >> tunable choose_local_fallback_tries 0 >> tunable choose_total_tries 50 >> tunable chooseleaf_descend_once
[ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"
Hi all, Might anyone be able to help me troubleshoot an "apt-get dist-upgrade" which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"? I'm upgrading from 10.2.2. The two OSDs on this node are up, and think they are version 10.2.3, but the upgrade doesn't appear to be finishing ... ? Thank you in advance, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
So just to put more info out there, here is what I’m seeing with a Spark/HDFS client: 2016-09-21 20:09:25.076595 7fd61c16f700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53864 s=2 pgs=50445 cs=1 l=0 c=0x7fd5fdd371d0).fault, initiating reconnect 2016-09-21 20:09:25.077328 7fd60c579700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50445 cs=2 l=0 c=0x7fd5fdd371d0).connect got RESETSESSION 2016-09-21 20:09:25.077429 7fd60fd80700 0 client.585194220 ms_handle_remote_reset on 192.168.1.190:6802/32183 2016-09-21 20:20:55.990686 7fd61c16f700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=2 pgs=50630 cs=1 l=0 c=0x7fd5fdd371d0).fault, initiating reconnect 2016-09-21 20:20:55.990890 7fd60c579700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50630 cs=2 l=0 c=0x7fd5fdd371d0).fault 2016-09-21 20:21:09.385228 7fd60c579700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.154:6800/17142 pipe(0x7fd6401e8160 sd=184 :39160 s=1 pgs=0 cs=0 l=0 c=0x7fd6400433c0).fault And here is its session info from ‘session ls’: { "id": 585194220, "num_leases": 0, "num_caps": 16385, "state": "open", "replay_requests": 0, "reconnecting": false, "inst": "client.585194220 192.168.1.157:0\/634334964", "client_metadata": { "ceph_sha1": "d56bdf93ced6b80b07397d57e3fa68fe68304432", "ceph_version": "ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)", "entity_id": "hdfs.user", "hostname": "a192-168-1-157.d.a.com" } }, -Chris On 9/21/16, 9:27 PM, "Heller, Chris" <chel...@akamai.com> wrote: I also went and bumped mds_cache_size up to 1 million… still seeing cache pressure, but I might just need to evict those clients… On 9/21/16, 9:24 PM, "Heller, Chris" <chel...@akamai.com> wrote: What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> wrote: > I’m suspecting something similar, we have millions of files and can read a huge subset of them at a time, presently the client is Spark 1.5.2 which I suspect is leaving the closing of file descriptors up to the garbage collector. That said, I’d like to know if I could verify this theory using the ceph tools. I’ll try upping “mds cache size”, are there any other configuration settings I might adjust to perhaps ease the problem while I track it down in the HDFS tools layer? That's the big one. You can also go through the admin socket commands for things like "session ls" that will tell you how many files the client is holding on to and compare. > > -Chris > > On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: > > On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote: > > Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. > > That definitely could have had something to do with it, if say they > overloaded the MDS so much it got stuck in a directory read loop. > ...actually now I come to think of it, I think there was some problem > with Hadoop not being nice about closing files and so forcing clients > to keep them pinned, which will make the MDS pretty unhappy if they're > holding more than it's configured for. > > > > > Now here is the result of `ceph –s` > > > > # ceph -s > > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 > > health HEALTH_OK > > monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} > > election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 > > mdsmap e18
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
I also went and bumped mds_cache_size up to 1 million… still seeing cache pressure, but I might just need to evict those clients… On 9/21/16, 9:24 PM, "Heller, Chris" <chel...@akamai.com> wrote: What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> wrote: > I’m suspecting something similar, we have millions of files and can read a huge subset of them at a time, presently the client is Spark 1.5.2 which I suspect is leaving the closing of file descriptors up to the garbage collector. That said, I’d like to know if I could verify this theory using the ceph tools. I’ll try upping “mds cache size”, are there any other configuration settings I might adjust to perhaps ease the problem while I track it down in the HDFS tools layer? That's the big one. You can also go through the admin socket commands for things like "session ls" that will tell you how many files the client is holding on to and compare. > > -Chris > > On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: > > On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote: > > Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. > > That definitely could have had something to do with it, if say they > overloaded the MDS so much it got stuck in a directory read loop. > ...actually now I come to think of it, I think there was some problem > with Hadoop not being nice about closing files and so forcing clients > to keep them pinned, which will make the MDS pretty unhappy if they're > holding more than it's configured for. > > > > > Now here is the result of `ceph –s` > > > > # ceph -s > > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 > > health HEALTH_OK > > monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} > > election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 > > mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby > > osdmap e118886: 192 osds: 192 up, 192 in > > pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects > > 69601 GB used, 37656 GB / 104 TB avail > >11309 active+clean > > 13 active+clean+scrubbing > >6 active+clean+scrubbing+deep > > > > And here are the ops in flight: > > > > # ceph daemon mds.a190 dump_ops_in_flight > > { > > "ops": [], > > "num_ops": 0 > > } > > > > And a tail of the active mds log at debug_mds 5/5 > > > > 2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2 > > 2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session closed|closing|killing, dropping > > This is also pretty solid evidence that the MDS is zapping clients > when they misbehave. > > You can increase "mds cache size" past its default 10 dentries and > see if that alleviates (or just draws out) the problem. > -Greg > > > 2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717 > > 2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode > > 2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34> > > 2016-09-
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> wrote: > I’m suspecting something similar, we have millions of files and can read a huge subset of them at a time, presently the client is Spark 1.5.2 which I suspect is leaving the closing of file descriptors up to the garbage collector. That said, I’d like to know if I could verify this theory using the ceph tools. I’ll try upping “mds cache size”, are there any other configuration settings I might adjust to perhaps ease the problem while I track it down in the HDFS tools layer? That's the big one. You can also go through the admin socket commands for things like "session ls" that will tell you how many files the client is holding on to and compare. > > -Chris > > On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: > > On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote: > > Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. > > That definitely could have had something to do with it, if say they > overloaded the MDS so much it got stuck in a directory read loop. > ...actually now I come to think of it, I think there was some problem > with Hadoop not being nice about closing files and so forcing clients > to keep them pinned, which will make the MDS pretty unhappy if they're > holding more than it's configured for. > > > > > Now here is the result of `ceph –s` > > > > # ceph -s > > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 > > health HEALTH_OK > > monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} > > election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 > > mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby > > osdmap e118886: 192 osds: 192 up, 192 in > > pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects > > 69601 GB used, 37656 GB / 104 TB avail > >11309 active+clean > > 13 active+clean+scrubbing > >6 active+clean+scrubbing+deep > > > > And here are the ops in flight: > > > > # ceph daemon mds.a190 dump_ops_in_flight > > { > > "ops": [], > > "num_ops": 0 > > } > > > > And a tail of the active mds log at debug_mds 5/5 > > > > 2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2 > > 2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session closed|closing|killing, dropping > > This is also pretty solid evidence that the MDS is zapping clients > when they misbehave. > > You can increase "mds cache size" past its default 10 dentries and > see if that alleviates (or just draws out) the problem. > -Greg > > > 2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717 > > 2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode > > 2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34> > > 2016-09-21 20:15:55.247885 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 233) v1 from client.538555196 > > 2016-09-21 20:15:55.455566 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 365) v1 from client.507390467 > > 2016-09-21 20:15:55.807704 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 367) v1 from client.538485341 > >
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
I’m suspecting something similar, we have millions of files and can read a huge subset of them at a time, presently the client is Spark 1.5.2 which I suspect is leaving the closing of file descriptors up to the garbage collector. That said, I’d like to know if I could verify this theory using the ceph tools. I’ll try upping “mds cache size”, are there any other configuration settings I might adjust to perhaps ease the problem while I track it down in the HDFS tools layer? -Chris On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote: > Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. That definitely could have had something to do with it, if say they overloaded the MDS so much it got stuck in a directory read loop. ...actually now I come to think of it, I think there was some problem with Hadoop not being nice about closing files and so forcing clients to keep them pinned, which will make the MDS pretty unhappy if they're holding more than it's configured for. > > Now here is the result of `ceph –s` > > # ceph -s > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 > health HEALTH_OK > monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} > election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 > mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby > osdmap e118886: 192 osds: 192 up, 192 in > pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects > 69601 GB used, 37656 GB / 104 TB avail >11309 active+clean > 13 active+clean+scrubbing >6 active+clean+scrubbing+deep > > And here are the ops in flight: > > # ceph daemon mds.a190 dump_ops_in_flight > { > "ops": [], > "num_ops": 0 > } > > And a tail of the active mds log at debug_mds 5/5 > > 2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2 > 2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session closed|closing|killing, dropping This is also pretty solid evidence that the MDS is zapping clients when they misbehave. You can increase "mds cache size" past its default 10 dentries and see if that alleviates (or just draws out) the problem. -Greg > 2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717 > 2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode > 2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34> > 2016-09-21 20:15:55.247885 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 233) v1 from client.538555196 > 2016-09-21 20:15:55.455566 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 365) v1 from client.507390467 > 2016-09-21 20:15:55.807704 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 367) v1 from client.538485341 > 2016-09-21 20:15:56.243462 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 189) v1 from client.538577596 > 2016-09-21 20:15:56.986901 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 232) v1 from client.507430372 > 2016-09-21 20:15:57.026206 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885158 > 2016-09-21 20:15:57.369281 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390682 > 2016-09-21 20:15:57.445687 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538485996 > 2016-09-21 20:15:57.579268 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486021 > 2016-09-21 20:15:57.595568 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.5
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. Now here is the result of `ceph –s` # ceph -s cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 health HEALTH_OK monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby osdmap e118886: 192 osds: 192 up, 192 in pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects 69601 GB used, 37656 GB / 104 TB avail 11309 active+clean 13 active+clean+scrubbing 6 active+clean+scrubbing+deep And here are the ops in flight: # ceph daemon mds.a190 dump_ops_in_flight { "ops": [], "num_ops": 0 } And a tail of the active mds log at debug_mds 5/5 2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2 2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session closed|closing|killing, dropping 2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717 2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode 2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34> 2016-09-21 20:15:55.247885 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 233) v1 from client.538555196 2016-09-21 20:15:55.455566 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 365) v1 from client.507390467 2016-09-21 20:15:55.807704 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 367) v1 from client.538485341 2016-09-21 20:15:56.243462 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 189) v1 from client.538577596 2016-09-21 20:15:56.986901 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 232) v1 from client.507430372 2016-09-21 20:15:57.026206 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885158 2016-09-21 20:15:57.369281 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390682 2016-09-21 20:15:57.445687 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538485996 2016-09-21 20:15:57.579268 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486021 2016-09-21 20:15:57.595568 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390702 2016-09-21 20:15:57.604356 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390712 2016-09-21 20:15:57.693546 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390717 2016-09-21 20:15:57.819536 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885168 2016-09-21 20:15:57.894058 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390732 2016-09-21 20:15:57.983329 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390742 2016-09-21 20:15:58.077915 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486031 2016-09-21 20:15:58.141710 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885178 2016-09-21 20:15:58.159134 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885188 -Chris On 9/21/16, 11:23 AM, "Heller, Chris" <chel...@akamai.com> wrote: Perhaps related, I was watching the active mds with debug_mds set to 5/5, when I saw this in the log: 2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 c=0x5631ce0).fault with nothing to send, going to standby 2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3252234463 pipe(0x54d10
[ceph-users] Ceph Rust Librados
Ceph-rust for librados has been released. It's an API interface in Rust for all of librados that's a thin layer above the C APIs. There are low-level direct access and higher level Rust helpers that make working directly with librados simple. The official repo is: https://github.com/ceph/ceph-rust The Rust Crate is: ceph-rust Rust is a systems programing language that gives you the speed and low-level access of C but with the benefits of a higher level language. The main benefits of Rust are: 1. Speed 2. Prevents segfaults 3. Guarantees thread safety 4. Strong typing 5. Compiled You can find out more at: https://www.rust-lang.org Contributions are encouraged and welcomed. This is the base for a number of larger Ceph related projects. Updates to the library will be frequent. Also, there will be new Ceph tools coming soon and you can use the following for RGW/S3 access from Rust: (Supports V2 and V4 signatures) Crate: aws-sdk-rust - https://github.com/lambdastackio/aws-sdk-rust Thanks, Chris Jones ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
othing to send, going to standby 2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cf20).fault with nothing to send, going to standby 2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f7a100).fault with nothing to send, going to standby 2016-09-21 15:13:26.068095 7fbafff84700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0 c=0x41bc7e0).fault with nothing to send, going to standby 2016-09-21 15:13:26.068108 7fbb0de0e700 5 mds.0.953 handle_mds_map epoch 8471 from mon.3 2016-09-21 15:13:26.068114 7fbaf890e700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/1422203298 pipe(0x2963 sd=44 :6802 s=2 pgs=2 cs=1 l=0 c=0x3a740dc0).fault with not hing to send, going to standby 2016-09-21 15:13:26.068143 7fbae860c700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f79e40).fault with no thing to send, going to standby 2016-09-21 15:13:26.068190 7fbb040c5700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.218:0/3945638891 pipe(0x50c sd=53 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f4420).fault with nothi ng to send, going to standby 2016-09-21 15:13:26.068200 7fbaf961b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fa840).fault with not hing to send, going to standby 2016-09-21 15:13:26.068232 7fbaf981d700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f6940).fault with noth ing to send, going to standby 2016-09-21 15:13:26.068253 7fbaeac32700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fb760).fault with noth ing to send, going to standby 2016-09-21 15:13:26.068275 7fbb0de0e700 1 mds.-1.-1 handle_mds_map i (192.168.1.196:6802/13581) dne in the mdsmap, respawning myself 2016-09-21 15:13:26.068289 7fbb0de0e700 1 mds.-1.-1 respawn 2016-09-21 15:13:26.068294 7fbb0de0e700 1 mds.-1.-1 e: 'ceph-mds' 2016-09-21 15:13:26.173095 7f689baa8780 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581 2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors {default=true} 2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby On 9/21/16, 10:48 AM, "Heller, Chris" <chel...@akamai.com> wrote: I’ll see if I can capture the output the next time this issue arises, but in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active state, in general it’s as if nothing is wrong server side (at least from the summary). -Chris On 9/21/16, 10:46 AM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <chel...@akamai.com> wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, and the > fix so far has been to restart the active MDS (sometimes I need to restart > the subsequent active MDS as well). > > > > These clients are using the cephfs-hadoop API, so there is no kernel client, > or fuse api involved. When I see clients get stuck, there are messages > printed to stderr like the following: > > > > 2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >> > 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 > c=0x7feaa0a0c500).fault > > > > I’m at somewhat of a loss on where to begin debugging this issue, and wanted > to ping the list for ideas. What's the full output of "ceph -s" when this happens? Have you looked at the MDS' admin socket's ops-in-flight, and that of the clients?j http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well. > > > > I managed to dump the mds cache during one of the stalled moments, which > hopefully is a useful starting point: > > > > e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8 > mdscachedump.txt.gz (https:
Re: [ceph-users] Faulting MDS clients, HEALTH_OK
I’ll see if I can capture the output the next time this issue arises, but in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active state, in general it’s as if nothing is wrong server side (at least from the summary). -Chris On 9/21/16, 10:46 AM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <chel...@akamai.com> wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, and the > fix so far has been to restart the active MDS (sometimes I need to restart > the subsequent active MDS as well). > > > > These clients are using the cephfs-hadoop API, so there is no kernel client, > or fuse api involved. When I see clients get stuck, there are messages > printed to stderr like the following: > > > > 2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >> > 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 > c=0x7feaa0a0c500).fault > > > > I’m at somewhat of a loss on where to begin debugging this issue, and wanted > to ping the list for ideas. What's the full output of "ceph -s" when this happens? Have you looked at the MDS' admin socket's ops-in-flight, and that of the clients?j http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well. > > > > I managed to dump the mds cache during one of the stalled moments, which > hopefully is a useful starting point: > > > > e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8 > mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg) > > > > > > -Chris > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Faulting MDS clients, HEALTH_OK
I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic issue arise where in all my MDS clients will become stuck, and the fix so far has been to restart the active MDS (sometimes I need to restart the subsequent active MDS as well). These clients are using the cephfs-hadoop API, so there is no kernel client, or fuse api involved. When I see clients get stuck, there are messages printed to stderr like the following: 2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 c=0x7feaa0a0c500).fault I’m at somewhat of a loss on where to begin debugging this issue, and wanted to ping the list for ideas. I managed to dump the mds cache during one of the stalled moments, which hopefully is a useful starting point: e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8 mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg) -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to associate a cephfs client id to its process
Ok. I’ll see about tracking down the logs (set to stderr for these tasks), and the metadata stuff looks interesting for future association. Thanks, Chris On 9/14/16, 5:04 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 14, 2016 at 7:02 AM, Heller, Chris <chel...@akamai.com> wrote: > I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a > system I’ve been experimenting with. > > > > I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ value > of 16385, as seen when running ‘session ls’ on the active mds. This appears > to be one larger than the default value for ‘client_cache_size’ so I presume > some relation, though I have not seen any documentation to corroborate this. > > > > What I was hoping to do is track down which ceph client is actually holding > all these ‘caps’, but since my system can have work scheduled dynamically > and multiple clients can be running on the same host, its not obvious how to > associate the client ‘id’ as reported by ‘session ls’ with any one process > on the give host. > > > > Is there steps I can follow to back track the client ‘id’ to a process id? Hmm, it looks like we no longer directly associate the process ID with the client session. There is a "client metadata" config option you can fill in with arbitrary "key=value[,key2=value2]* strings if you can persuade Hadoop to set that to something useful on each individual process. If you have logging or admin sockets enabled then you should also be able to find them named by client ID and trace those back to the pid with standard linux tooling. I've created a ticket to put this back in as part of the standard metadata: http://tracker.ceph.com/issues/17276 -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to associate a cephfs client id to its process
I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a system I’ve been experimenting with. I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ value of 16385, as seen when running ‘session ls’ on the active mds. This appears to be one larger than the default value for ‘client_cache_size’ so I presume some relation, though I have not seen any documentation to corroborate this. What I was hoping to do is track down which ceph client is actually holding all these ‘caps’, but since my system can have work scheduled dynamically and multiple clients can be running on the same host, its not obvious how to associate the client ‘id’ as reported by ‘session ls’ with any one process on the give host. Is there steps I can follow to back track the client ‘id’ to a process id? -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pools per hypervisor?
We are using a single pool for all our RBD images. You could create different pools based on performance and replication needs. Say one with all SSDs and one with SATA. Then put your RBD images in the appropriate pool. Each host is also using the same user. You could use a different user for each hypervisor but that would be up to you. Chris > On Sep 11, 2016, at 9:04 PM, Thomas <tho...@tgmedia.co.nz> wrote: > > Hi Guys, > > Hoping to find help here as I can't seem to find anything on the net. > > I have a ceph cluster and I'd want to use rbd as block storage on our > hypervisors (say 30) to mount drives to our guests. Would you create users > and pools per hypervisor? > > As adding more pools to a cluster seems to be a problem if you're not sure > how many pools you'll end up using, e.g. I kept adding more pools with pg_num > 256 and now I'm at pool no. 4 and my cluster complains about 'too many PGs > per OSD (324 > max 300)' - any ideas ? > > Cheers, > Thomas > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph auth key generation algorithm documentation
I’d like to generate keys for ceph external to any system which would have ceph-authtool. Looking over the ceph website and googling have turned up nothing. Is the ceph auth key generation algorithm documented anywhere? -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Signature V2
I believe RGW Hammer and below use V2 and Jewel and above use V4. Thanks On Thu, Aug 18, 2016 at 7:32 AM, jan hugo prins <jpr...@betterbe.com> wrote: > did some more searching and according to some info I found RGW should > support V4 signatures. > > http://tracker.ceph.com/issues/10333 > http://tracker.ceph.com/issues/11858 > > The fact that everyone still modifies s3cmd to use Version 2 Signatures > suggests to me that we have a bug in this code. > > If I use V4 signatures most of my requests work fine, but some requests > fail on a signature error. > > Thanks, > Jan Hugo Prins > > > On 08/18/2016 12:46 PM, jan hugo prins wrote: > > Hi everyone. > > > > To connect to my S3 gateways using s3cmd I had to set the option > > signature_v2 in my s3cfg to true. > > If I didn't do that I would get Signature mismatch errors and this seems > > to be because Amazon uses Signature version 4 while the S3 gateway of > > Ceph only supports Signature Version 2. > > > > Now I see the following error in a Jave project we are building that > > should talk to S3. > > > > Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve > > invoke > > SEVERE: Servlet.service() for servlet [Default] in context with path > > [/VehicleData] threw exception > > com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx > > caused: com.amazonaws.services.s3.model.AmazonS3Exception: null > > (Service: Amazon S3; Status Code: 400; Error Code: > > XAmzContentSHA256Mismatch; Request ID: > > tx02cc6-0057b58a15-25bba-default), S3 Extended Request > > ID: 25bba-default-default > > at > > com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle( > DatasetRequestHandler.java:262) > > at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141) > > at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110) > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:646) > > > > To me this looks a bit the same, though I'm not a Java developer. > > Am I correct, and if so, can I tell the Java S3 client to use Version 2 > > signatures? > > > > > > -- > Met vriendelijke groet / Best regards, > > Jan Hugo Prins > Infra and Isilon storage consultant > > Better.be B.V. > Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527 > T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951 > jpr...@betterbe.com | www.betterbe.com > > This e-mail is intended exclusively for the addressee(s), and may not > be passed on to, or made available for use by any person other than > the addressee(s). Better.be B.V. rules out any and every liability > resulting from any electronic transmission. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up
I’d like to understand more why the down OSD would cause the PG to get stuck after CRUSH was able to locate enough OSD to map the PG. Is this some form of safety catch that prevents it from recovering, even though OSD.116 is no longer important for data integrity? Marking the OSD lost is an option here, but it’s not really lost … it just takes some time to get a machine rebooted. I’m still working out my operational procedures for CEPH and marking the OSD lost but having it pop back up once the system reboots could be an issue that I’m not yet sure how to resolve. Can an OSD be marked as ‘found’ once it returns to the network? -Chris From: Goncalo Borges <goncalo.bor...@sydney.edu.au> Date: Monday, August 15, 2016 at 11:36 PM To: "Heller, Chris" <chel...@akamai.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up Hi Chris... The precise osd set you see now [79,8,74] was obtained on epoch 104536 but this was after a lot of tries as showed by the recovery section. Actually, in the first try (on epoch 100767) osd 116 was selected somehow (maybe it was up at the time?) and probably the pg got stuck because it went down during the recover process? recovery_state": [ { "name": "Started\/Primary\/Peering\/GetInfo", "enter_time": "2016-08-11 11:45:06.052568", "requested_info_from": [] }, { "name": "Started\/Primary\/Peering", "enter_time": "2016-08-11 11:45:06.052558", "past_intervals": [ { "first": 100767, "last": 100777, "maybe_went_rw": 1, "up": [ 79, 116, 74 ], "acting": [ 79, 116, 74 ], "primary": 79, "up_primary": 79 }, The pg query also shows peering_blocked_by": [ { "osd": 116, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" } Maybe, you can check the documentation in [1] and see if you think you could follow the suggestion inside the pg and mark osd 116 as lost. This should be done after proper evaluation from you. Another thing I found strange is that in the recovery section, there are a lot of tries where you do not get a proper osd set. The very last recover try was on epoch 104540. { "first": 104536, "last": 104540, "maybe_went_rw": 1, "up": [ 2147483647, 8, 74 ], "acting": [ 2147483647, 8, 74 ], "primary": 8, "up_primary": 8 } From [2], "When CRUSH fails to find enough OSDs to map to a PG, it will show as a 2147483647 which is ITEM_NONE or no OSD found.". This could be an artifact of the peering being blocked by osd.116, or a genuine problem where you are not being able to get a proper osd set. That could be for a variety of reasons: from network issues, to osds being almost full or simply because the system can't get 3 osds in 3 different hosts. Cheers Goncalo [1] http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_rados_troubleshooting_troubleshooting-2Dpg_-23placement-2Dgroup-2Ddown-2Dpeering-2Dfailure=DQMDaQ=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Cq5DVZCgs9mbiZGc07mZmmOibYWa4CNvlbBNFpJAcuU=0vRtD0EvbI7L8KOHeGcZLDfYW3iNcY7bZMtHjU5MHqI=> [2] http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_rados_troubleshooting_troubleshooting-2Dpg_=DQMDaQ=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Cq5DVZCgs9mbiZGc07mZmmOibYWa4CNvlbBNFpJAcuU=M96YeyltKJ3cxXFQSoJrk8ezhgvD667Q11kYH9uFN1o=> On 08/16/2016 11:42 AM, Heller, Chris wrote: Output of `ceph pg dump_stuck` # ceph pg dump_stuck ok pg_stat state up up_primary
Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up
Output of `ceph pg dump_stuck` # ceph pg dump_stuck ok pg_stat state up up_primary acting acting_primary 4.2a8 down+peering[79,8,74] 79 [79,8,74] 79 4.c3down+peering[56,79,67] 56 [56,79,67] 56 -Chris From: Goncalo Borges <goncalo.bor...@sydney.edu.au> Date: Monday, August 15, 2016 at 9:03 PM To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>, "Heller, Chris" <chel...@akamai.com> Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up Hi Heller... Can you actually post the result of ceph pg dump_stuck ? Cheers G. On 08/15/2016 10:19 PM, Heller, Chris wrote: I’d like to better understand the current state of my CEPH cluster. I currently have 2 PG that are in the ‘stuck unclean’ state: # ceph health detail HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean pg 4.2a8 is stuck inactive for 124516.91, current state down+peering, last acting [79,8,74] pg 4.c3 is stuck inactive since forever, current state down+peering, last acting [56,79,67] pg 4.2a8 is stuck unclean for 124536.223284, current state down+peering, last acting [79,8,74] pg 4.c3 is stuck unclean since forever, current state down+peering, last acting [56,79,67] pg 4.2a8 is down+peering, acting [79,8,74] pg 4.c3 is down+peering, acting [56,79,67] While my cluster does currently have some down OSD, none are in the acting set for either PG: ceph osd tree | grep down 73 1.0 osd.73 down0 1.0 96 1.0 osd.96 down0 1.0 110 1.0 osd.110 down0 1.0 116 1.0 osd.116 down0 1.0 120 1.0 osd.120 down0 1.0 126 1.0 osd.126 down0 1.0 124 1.0 osd.124 down0 1.0 119 1.0 osd.119 down0 1.0 I’ve queried one of the two PG, and see that recovery is currently blocked on OSD.116, which is indeed down, but is not part of the acting set of OSD for that PG: http://pastebin.com/Rg2hK9GE<https://urldefense.proofpoint.com/v2/url?u=http-3A__pastebin.com_Rg2hK9GE=DQMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=1I7INncBAJrC1GhybLtpQDEPndNnH3g0mIg6r_dCqAk=eMFLR4yFAYyD9jbJfLHkeWwkyOqAyYN4yLpT-0xHjb8=> This is all with CEPH version 0.94.3: # ceph version ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Why does this PG remain ‘stuck unclean’? Is there some steps I can take to unstick it, given that all the acting OSD are up and in? (* Re-sent, now that I’m subscribed to list *) -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DQMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=1I7INncBAJrC1GhybLtpQDEPndNnH3g0mIg6r_dCqAk=oQbaHI6URK-ks5cOKdCtfn1wpegbytvQ4tm9HkbD5d0=> -- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up
I’d like to better understand the current state of my CEPH cluster. I currently have 2 PG that are in the ‘stuck unclean’ state: # ceph health detail HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean pg 4.2a8 is stuck inactive for 124516.91, current state down+peering, last acting [79,8,74] pg 4.c3 is stuck inactive since forever, current state down+peering, last acting [56,79,67] pg 4.2a8 is stuck unclean for 124536.223284, current state down+peering, last acting [79,8,74] pg 4.c3 is stuck unclean since forever, current state down+peering, last acting [56,79,67] pg 4.2a8 is down+peering, acting [79,8,74] pg 4.c3 is down+peering, acting [56,79,67] While my cluster does currently have some down OSD, none are in the acting set for either PG: ceph osd tree | grep down 73 1.0 osd.73 down0 1.0 96 1.0 osd.96 down0 1.0 110 1.0 osd.110 down0 1.0 116 1.0 osd.116 down0 1.0 120 1.0 osd.120 down0 1.0 126 1.0 osd.126 down0 1.0 124 1.0 osd.124 down0 1.0 119 1.0 osd.119 down0 1.0 I’ve queried one of the two PG, and see that recovery is currently blocked on OSD.116, which is indeed down, but is not part of the acting set of OSD for that PG: http://pastebin.com/Rg2hK9GE This is all with CEPH version 0.94.3: # ceph version ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Why does this PG remain ‘stuck unclean’? Is there some steps I can take to unstick it, given that all the acting OSD are up and in? (* Re-sent, now that I’m subscribed to list *) -Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW pools type
.rgw.buckets are all we have as EC. The remainder are replication. Thanks, CJ On Sun, Jun 12, 2016 at 4:12 AM, Василий Ангапов <anga...@gmail.com> wrote: > Hello! > > I have a question regarding RGW pools type: what pools can be Erasure > Coded? > More exactly, I have the following pools: > > .rgw.root (EC) > ed-1.rgw.control (EC) > ed-1.rgw.data.root (EC) > ed-1.rgw.gc (EC) > ed-1.rgw.intent-log (EC) > ed-1.rgw.buckets.data (EC) > ed-1.rgw.meta (EC) > ed-1.rgw.users.keys (REPL) > ed-1.rgw.users.email (REPL) > ed-1.rgw.users.uid (REPL) > ed-1.rgw.users.swift (REPL) > ed-1.rgw.users (REPL) > ed-1.rgw.log (REPL) > ed-1.rgw.buckets.index (REPL) > ed-1.rgw.buckets.non-ec (REPL) > ed-1.rgw.usage (REPL) > > Is that ok? > > Regards, Vasily > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Encryption for data at rest support
Hi Swami, Yes ceph supports encryption at rest using dmcrypt. The docs are here: http://docs.ceph.com/docs/jewel/rados/deployment/ceph-deploy-osd/ My team has integrated this functionality into the ceph-osd charm also if you'd like to try that out: https://jujucharms.com/ceph-osd/xenial/2 When combined with the ceph-mon charm you're up and running fast :) -Chris On 06/02/2016 03:57 AM, M Ranga Swami Reddy wrote: > Hello, > > Can you please share if the ceph supports the "data at rest" functionality? > If yes, how can I achieve this? Please share any docs available. > > Thanks > Swami > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Increasing pg_num
Hi Christian, On Tue, May 17, 2016 at 10:41:52AM +0900, Christian Balzer wrote: > On Tue, 17 May 2016 10:47:15 +1000 Chris Dunlop wrote: > Most your questions would be easily answered if you did spend a few > minutes with even the crappiest test cluster and observing things (with > atop and the likes). You're right of course. I'll set up a test cluster and start experimenting, which I should have done before asking questions here. > To wit, this is a test pool (12) created with 32 PGs and slightly filled > with data via rados bench: > --- > # ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\." > drwxr-xr-x 2 root root 4096 May 17 10:04 12.13_head > drwxr-xr-x 2 root root 4096 May 17 10:04 12.1e_head > drwxr-xr-x 2 root root 4096 May 17 10:04 12.b_head > # du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/ > 121M/var/lib/ceph/osd/ceph-8/current/12.13_head/ > --- > > After increasing that to 128 PGs we get this: > --- > # ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\." > drwxr-xr-x 2 root root 4096 May 17 10:18 12.13_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.1e_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.2b_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.33_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.3e_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.4b_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.53_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.5e_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.6b_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.73_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.7e_head > drwxr-xr-x 2 root root 4096 May 17 10:18 12.b_head > # du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/ > 25M /var/lib/ceph/osd/ceph-8/current/12.13_head/ > --- > > Now this was fairly uneventful even on my crappy test cluster, given the > small amount of data (which was mostly cached) and the fact that it's idle. > > However consider this with 100's of GB per PG and a busy cluster and you > get the idea where massive and very disruptive I/O comes from. Per above, I'll experiment with this, but my first thought is I suspect that's moving object/data files around rather than copying data, so the overheads are in directory operations rather than data copies - not that directory operations are free either of course. >> Hmmm, is there a generic command-line(ish) way of determining the number >> of OSDs involved in a pool? >> > Unless you have a pool with a very small pg_num and a very large cluster > the answer usually tends to be "all of them". Or, as in my case, several completely independent pools (i.e. different OSDs) in the one cluster. > And google ("ceph number of osds per pool") is your friend: > > http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd Crap. And I was just looking at that very page yesterday, in the context of the distribution of the PGs, and completely forgot about the SUM part. Thanks for taking the time to respond. Chris. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Increasing pg_num
On Mon, May 16, 2016 at 10:40:47PM +0200, Wido den Hollander wrote: > > Op 16 mei 2016 om 7:56 schreef Chris Dunlop <ch...@onthe.net.au>: > > Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num > > should be equal to the pg_num": under what circumstances might you want > > these different, apart from when actively increasing pg_num first then > > increasing pgp_num to match? (If they're supposed to be always the same, why > > not have a single parameter and do the "increase pg_num, then pgp_num" > > within ceph's internals?) > > pg_num is the actual amount of PGs. This you can increase without any actual > data moving. > > pgp_num is the number CRUSH uses in the calculations. pgp_num can't be > greater than pg_num for that reason. OK, I understand that from the docs. But why are they two separate parameters? E.g., why might you increase pg_num and not pgp_num? Or are the two parameters purely to separate splitting the PGs (pg_num) from moving data around (pgp_num)? > You can slowly increase pgp_num to make sure not all your data moves at the > same time. Why slowly increase pgp_num rather than rely on "osd max backfills"? I.e. what downsides are there to setting "osd max backfills" as appropriate, increasing pg_num in small steps to the target, then increasing pgp_num to the target in one step? If you're slowly increasing pgp_num, is the recommendation to "increase pg_num a bit, increase pgp_num a bit, repeat till target is reached" (and thus potentially moving some data multiple times), or is the recommendation to "increase pg_num a bit step by step to the target, then increase pgp_num bit by bit to the target"? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Increasing pg_num
On Tue, May 17, 2016 at 08:21:48AM +0900, Christian Balzer wrote: > On Mon, 16 May 2016 22:40:47 +0200 (CEST) Wido den Hollander wrote: > > > > pg_num is the actual amount of PGs. This you can increase without any > > actual data moving. > > Yes and no. > > Increasing the pg_num will split PGs, which causes potentially massive I/O. > Also AFAIK that I/O isn't regulated by the various recovery and backfill > parameters. Where is this potentially massive I/O coming from? I have this naive concept that the PGs are mathematically-calculated buckets, so splitting them would involve little or no I/O, although I can imagine there are management overheads (cpu, memory) involved in correctly maintaining state during the splitting process. > That's probably why recent Ceph versions will only let you increase pg_num > in smallish increments. Oh, I wasn't aware of that! Ok, so it looks like it's mon_osd_max_split_count, introduced by commit d8ccd73. Unfortunately it seems to be missing from the ceph docs. It's mentioned in the Suse docs: https://www.suse.com/documentation/ses-2/singlehtml/book_storage_admin/book_storage_admin.html#storage.bp.cluster_mntc.add_pgnum ...although, if I'm understanding "mon_osd_max_split_count" correctly, their script for calculating the maximum to which you can increase pg_num is incorrect in that it's calculating "current pg_num + mon_osd_max_split_count" when it should be "current pg_num + (mon_osd_max_split_count * number of pool OSDs)". Hmmm, is there a generic command-line(ish) way of determining the number of OSDs involved in a pool? > Moving data (as in redistributing amongst the OSD based on CRUSH) will > indeed not happen until pgp_num is also increased. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.7 Hammer released
On Fri, May 13, 2016 at 10:21:51AM -0400, Sage Weil wrote: > This Hammer point release fixes several minor bugs. It also includes a > backport of an improved ‘ceph osd reweight-by-utilization’ command for > handling OSDs with higher-than-average utilizations. > > We recommend that all hammer v0.94.x users upgrade. Per http://download.ceph.com/debian-hammer/pool/main/c/ceph/ ceph-common_0.94.7-1trusty_amd64.deb11-May-2016 16:08 5959876 ceph-common_0.94.7-1xenial_amd64.deb11-May-2016 15:54 6037236 ceph-common_0.94.7-1xenial_arm64.deb11-May-2016 16:06 5843722 ceph-common_0.94.7-1~bpo80+1_amd64.deb 11-May-2016 16:08 6028036 Once again, no debian wheezy (~bpo70) version? Ubuntu Precise missed out this time too. Oddly, the date on the previously released wheezy version changed at the same time as the 0.94.7 releases above, it was previously 15-Dec-2015 15:32: ceph-common_0.94.5-1~bpo70+1_amd64.deb 11-May-2016 15:57 9868188 Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Increasing pg_num
Hi, I'm trying to understand the potential impact on an active cluster of increasing pg_num/pgp_num. The conventional wisdom, as gleaned from the mailing lists and general google fu, seems to be to increase pg_num followed by pgp_num, both in small increments, to the target size, using "osd max backfills" (and perhaps "osd recovery max active"?) to control the rate and thus performance impact of data movement. I'd really like to understand what's going on rather than "cargo culting" it. I'm currently on Hammer, but I'm hoping the answers are broadly applicable across all versions for others following the trail. Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num should be equal to the pg_num": under what circumstances might you want these different, apart from when actively increasing pg_num first then increasing pgp_num to match? (If they're supposed to be always the same, why not have a single parameter and do the "increase pg_num, then pgp_num" within ceph's internals?) What do "osd backfill scan min" and "osd backfill scan max" actually control? The docs say "The minimum/maximum number of objects per backfill scan" but what does this actually mean and how does it affect the impact (if at all)? Is "osd recovery max active" actually relevant to this situation? It's mentioned in various places related to increasing pg_num/pgp_num but my understanding is it's related to recovery (e.g. osd falls out and comes back again and needs to catch up) rather than back filling (migrating pgs misplaced due to increasing pg_num, crush map changes etc.) Previously (back in Dumpling days): http://article.gmane.org/gmane.comp.file-systems.ceph.user/11490 From: Gregory Farnum Subject: Re: Throttle pool pg_num/pgp_num increase impact Newsgroups: gmane.comp.file-systems.ceph.user Date: 2014-07-08 17:01:30 GMT On Tuesday, July 8, 2014, Kostis Fardelas wrote: > Should we be worried that the pg/pgp num increase on the bigger pool will > have a 300X larger impact? The impact won't be 300 times bigger, but it will be bigger. There are two things impacting your cluster here 1) the initial "split" of the affected PGs into multiple child PGs. You can mitigate this by stepping through pg_num at small multiples. 2) the movement of data to its new location (when you adjust pgp_num). This can be adjusted by setting the "OSD max backfills" and related parameters; check the docs. -Greg Am I correct thinking "small multiples" in this context is along the lines of "1.1" rather than "2" or "4"?. Is there really much impact when increasing pg_num in a single large step e.g. 1024 to 4096? If so, what causes this impact? An initial trial of increasing pg_num by 10% (1024 to 1126) on one of my pools showed it completed in a matter of tens of seconds, too short to really measure any performance impact. But I'm concerned this could be exponential to the size of the step such that increasing by a large step (e.g. the rest of the way from 1126 to 4096) could cause problems. Given the use of "osd max backfills" to limit the impact of the data movement associated with increasing pgp_num, is there any advantage or disadvantage to increasing pgp_num in small increments (e.g. 10% at a time) vs "all at once", apart from small increments likely moving some data multiple times? E.g. with a large step is there a higher potential for problems if something else happens to the cluster the same time (e.g. an OSD dies) because the current state of the system is further from the expected state, or something like that? If small increments of pgp_num are advisable, should the process be "increase pg_num by a small increment, increase pgp_num to match, repeat until target reached", or is that no advantage to increasing pg_num (in multiple small increments or single large step) to the target, then increasing pgp_num in small increments to the target - and why? Given that increasing pg_num/pgp_num seem almost inevitable for a growing cluster, and that increasing these can be one of the most performance-impacting operations you can perform on a cluster, perhaps a document going into these details would be appropriate? Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Maximum MON Network Throughput Requirements
Mons and RGWs only use the public network but Mons can have a good deal of traffic. I would not recommend 1Gb but if looking for lower bandwidth then 10Gb would be good for most. It all depends in the overall size of the cluster. You mentioned 40Gb. If the nodes are high density then 40Gb but if they are lower density then 20Gb would be fine. -CJ On Mon, May 2, 2016 at 12:09 PM, Brady Deetz <bde...@gmail.com> wrote: > I'm working on finalizing designs for my Ceph deployment. I'm currently > leaning toward 40gbps ethernet for interconnect between OSD nodes and to my > MDS servers. But, I don't really want to run 40 gig to my mon servers > unless there is a reason. Would there be an issue with using 1 gig on my > monitor servers? > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] adding cache tier in productive hammer environment
Hi Oliver, Have you tried tuning some of the cluster settings to fix the IO errors in the VMs? We found some of the same issues when reweighting, backfilling and removing large snapshots. By minimizing the number of concurrent backfills and prioritizing client IO we can now add/remove OSDs without the VMs throwing those nasty IO errors. We have been running a 3 node cluster for about a year now on Hammer with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD images. Here are the things we changed: ceph tell osd.* injectargs '--osd-max-backfills 1' ceph tell osd.* injectargs '--osd-max-recovery-threads 1' ceph tell osd.* injectargs '--osd-recovery-op-priority 1' ceph tell osd.* injectargs '--osd-client-op-priority 63' ceph tell osd.* injectargs '--osd-recovery-max-active 1' ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1' Recovery may take a little longer while backfilling, but the cluster is still responsive and we have happy VMs now. I've collected these from various posts from the ceph-users list. Maybe they will help you if you haven't tried them already. Chris On 2016-04-07 4:18 am, Oliver Dzombic wrote: > Hi Christian, > > thank you for answering, i appriciate your time ! > > --- > > Its used for RBD hosted vm's and also cephfs hosted vm's. > > Well the basic problem is/was that single OSD's simply go out/down. > Ending in SATA BUS error's for the VM's which have to be rebooted, if > they anyway can, because as long as OSD's are missing in that szenario, > the customer cant start their vm's. > > Installing/checking munin discovered a very high drive utilization. And > this way simply an overload of the cluster. > > The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x > SSD for journal. > > So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD > for journal ). And, as first aid, reducing the replication from 3 to 2 > to reduce the (write) load of the cluster. > > I planed to wait until the new LTS is out, but i already added now > another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for > tier cache ( changing strategy and increasing the number of drives while > reducing the size - was an design mistake from me ). > > osdmap e31602: 28 osds: 28 up, 28 in > flags noscrub,nodeep-scrub > pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects > 39270 GB used, 88290 GB / 124 TB avail > 1428 active+clean > > The range goes from 200 op/s to around 5000 op/s. > > The current avarage drive utilization is 20-30%. > > If we have backfill ( osd out/down ) or reweight the utilization of HDD > drives is streight 90-100%. > > Munin shows on all drives ( except the SSD's ) a dislatency of avarage > 170 ms. A minumum of 80-130 ms, and a maximum of 300-600ms. > > Currently, the 4 initial nodes are in datacenter A and the 3 other nodes > are, together with most of the VM's in datacenter B. > > I am currently cleaning the 4 initial nodes by doing > > ceph osd reweight to peut a peut reducing the usage, to remove the osd's > completely from there and just keeping up the monitors. > > The complete cluster have to move to one single datacenter together with > all VM's. > > --- > > I am reducing the number of nodes because out of administrative view, > its not very handy. I prefere extending the hardware power in terms of > CPU, RAM and HDD. > > So the endcluster will look like: > > 3x OSD Nodes, each: > > 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e > to connect to external JBOD servers holding the cold storage HDD's. > Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's. > > I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms ) > very useful in a ceph environment. But then again, maybe with a cache > tier the impact/difference is not really that big. > > That together with Samsung SM863 240 GB SSD's for journal and cache > tier, connected to the board directly or to a seperated Adaptec HBA > 1000-16i. > > So far the current idea/theory/plan. > > --- > > But to that point, its a long road. Last night i was doing a reweight of > 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i > had to restart the osd. ( with again IO errors in some of the vm's ). > > So based on your article, the cache tier solved your problem, and i > think i have basically the same. > > --- > > So a very good hint is, to activate the whole tier cache in the night, > when things are a bit more smooth. > > Any suggestions / critics / advices are highly welcome :-) > > Thank you! > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Inter
[ceph-users] OSD mounts without BTRFS compression
Hello all, Please can someone offer some advice. In ceph.conf, I use: osd_mkfs_type = btrfs osd_mount_options_btrfs = noatime,nodiratime,compress-force=lzo filestore btrfs snap= false However, some of my OSDs are becoming much more full than others, as not all are being mounted with the compress-force option. Is this a CEPH issue or a BTRFS issue? Or other? Take one host. Note that sdc1 is mounted twice, neither of which have the compress-force option. /dev/sdc1 on /var/lib/ceph/tmp/mnt.AywYKY type btrfs (rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) /dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) /dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs (rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub vol=/) /dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) After a reboot, it's a sdd1 this time. /dev/sdd1 on /var/lib/ceph/tmp/mnt.kWh2NA type btrfs (rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) /dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs (rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub vol=/) /dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) /dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs (rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol= /) Where should I look next? I'm on 0.94.6 Thanks in advance, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
On Wed, Mar 23, 2016 at 01:22:45AM +0100, Loic Dachary wrote: > On 23/03/2016 01:12, Chris Dunlop wrote: >> On Wed, Mar 23, 2016 at 01:03:06AM +0100, Loic Dachary wrote: >>> On 23/03/2016 00:39, Chris Dunlop wrote: >>>> "The old OS'es" that were being supported up to v0.94.5 includes debian >>>> wheezy. It would be quite surprising and unexpected to drop support for an >>>> OS in the middle of a stable series. >>> >>> I'm unsure if wheezy is among the old OS'es. It predates my involvement in >>> the stable releases effort. I know for sure el6 and 12.04 are supported for >>> 0.94.x. >> >> From http://download.ceph.com/debian-hammer/pool/main/c/ceph/ >> >> ceph-common_0.94.1-1~bpo70+1_i386.deb 15-Dec-2015 15:32 >>10217628 >> ceph-common_0.94.3-1~bpo70+1_amd64.deb 19-Oct-2015 18:54 >> 9818964 >> ceph-common_0.94.4-1~bpo70+1_amd64.deb 26-Oct-2015 20:48 >> 9868020 >> ceph-common_0.94.5-1~bpo70+1_amd64.deb 15-Dec-2015 15:32 >> 9868188 >> >> That's all debian wheezy. >> >> (Huh. I'd never noticed 0.94.1 was i386 only!) >> > > Indeed. Were these packages created as a lucky side effect or because there > was a commitment at some point ? I'm curious to know the answer as well :-) Who would know? Sage? (cc'ed) Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Loïc, On Wed, Mar 23, 2016 at 01:03:06AM +0100, Loic Dachary wrote: > On 23/03/2016 00:39, Chris Dunlop wrote: >> "The old OS'es" that were being supported up to v0.94.5 includes debian >> wheezy. It would be quite surprising and unexpected to drop support for an >> OS in the middle of a stable series. > > I'm unsure if wheezy is among the old OS'es. It predates my involvement in > the stable releases effort. I know for sure el6 and 12.04 are supported for > 0.94.x. From http://download.ceph.com/debian-hammer/pool/main/c/ceph/ ceph-common_0.94.1-1~bpo70+1_i386.deb 15-Dec-2015 15:32 10217628 ceph-common_0.94.3-1~bpo70+1_amd64.deb 19-Oct-2015 18:54 9818964 ceph-common_0.94.4-1~bpo70+1_amd64.deb 26-Oct-2015 20:48 9868020 ceph-common_0.94.5-1~bpo70+1_amd64.deb 15-Dec-2015 15:32 9868188 That's all debian wheezy. (Huh. I'd never noticed 0.94.1 was i386 only!) Cheers, Chris, OnTheNet ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Loïc, On Wed, Mar 23, 2016 at 12:14:27AM +0100, Loic Dachary wrote: > On 22/03/2016 23:49, Chris Dunlop wrote: >> Hi Stable Release Team for v0.94, >> >> Let's try again... Any news on a release of v0.94.6 for debian wheezy >> (bpo70)? > > I don't think publishing a debian wheezy backport for v0.94.6 is planned. > Maybe it's a good opportunity to initiate a community effort ? Would you like > to work with me on this ? It's my understanding, from statements by both Sage and yourself, that existing OS'es would continue to be supported in the stable series, e.g.: On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: > I think you misread what Sage wrote : "The intention was to continue > building stable releases (0.94.x) on the old list of supported platforms > (which inclues 12.04 and el6)". In other words, the old OS'es are still > supported. Their absence is a glitch in the release process that will be > fixed. "The old OS'es" that were being supported up to v0.94.5 includes debian wheezy. It would be quite surprising and unexpected to drop support for an OS in the middle of a stable series. If that is indeed what's happening, and it's not just an oversight, I'd prefer to put my efforts into moving to a supported OS rather than keeping the older OS on life support. Just to be clear, I understand it is quite a burden maintaining releases for old OSes, I'm only voicing mild surprise and a touch of regret: I'm very happy with the Ceph project! Cheers, Chris, OnTheNet ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Stable Release Team for v0.94, Let's try again... Any news on a release of v0.94.6 for debian wheezy (bpo70)? Cheers, Chris On Thu, Mar 17, 2016 at 12:43:15PM +1100, Chris Dunlop wrote: > Hi Chen, > > On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote: >> It’s already there, in >> http://download.ceph.com/debian-hammer/pool/main/c/ceph/. > > I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would > be bpo70. > > Cheers, > > Chris > >> On 3/17/16, 7:20 AM, "Chris Dunlop" <ch...@onthe.net.au> wrote: >> >>> Hi Stable Release Team for v0.94, >>> >>> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote: >>>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: >>>>> I think you misread what Sage wrote : "The intention was to >>>>> continue building stable releases (0.94.x) on the old list of >>>>> supported platforms (which inclues 12.04 and el6)". In other >>>>> words, the old OS'es are still supported. Their absence is a >>>>> glitch in the release process that will be fixed. >>>> >>>> Any news on a release of v0.94.6 for debian wheezy? >>> >>> Any news on a release of v0.94.6 for debian wheezy? >>> >>> Cheers, >>> >>> Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Stable Release Team for v0.94, On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote: > On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: >> I think you misread what Sage wrote : "The intention was to >> continue building stable releases (0.94.x) on the old list of >> supported platforms (which inclues 12.04 and el6)". In other >> words, the old OS'es are still supported. Their absence is a >> glitch in the release process that will be fixed. > > Any news on a release of v0.94.6 for debian wheezy? Any news on a release of v0.94.6 for debian wheezy? Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Chen, On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote: > It’s already there, in > http://download.ceph.com/debian-hammer/pool/main/c/ceph/. I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would be bpo70. Cheers, Chris > On 3/17/16, 7:20 AM, "Chris Dunlop" <ch...@onthe.net.au> wrote: > >> Hi Stable Release Team for v0.94, >> >> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote: >>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: >>>> I think you misread what Sage wrote : "The intention was to >>>> continue building stable releases (0.94.x) on the old list of >>>> supported platforms (which inclues 12.04 and el6)". In other >>>> words, the old OS'es are still supported. Their absence is a >>>> glitch in the release process that will be fixed. >>> >>> Any news on a release of v0.94.6 for debian wheezy? >> >> Any news on a release of v0.94.6 for debian wheezy? >> >> Cheers, >> >> Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Loic, On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: > I think you misread what Sage wrote : "The intention was to > continue building stable releases (0.94.x) on the old list of > supported platforms (which inclues 12.04 and el6)". In other > words, the old OS'es are still supported. Their absence is a > glitch in the release process that will be fixed. Any news on a release of v0.94.6 for debian wheezy? Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Restrict cephx commands
Hey Ceph Users! I'm wondering if it's possible to restrict the ceph keyring to only being able to run certain commands. I think the answer to this is no but I just wanted to ask. I haven't seen any documentation indicating whether or not this is possible. Anyone know? Thanks, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi, The "old list of supported platforms" includes debian wheezy. Will v0.94.6 be built for this? Chris On Mon, Feb 29, 2016 at 10:57:53AM -0500, Sage Weil wrote: > The intention was to continue building stable releases (0.94.x) on the old > list of supported platforms (which inclues 12.04 and el6). I think it was > just an oversight that they weren't built this time around. I the > overhead to doing so is just keeping a 12.04 and el6 jenkins build slave > around. > > Doing this builds in the existing environment sounds much better than > trying to pull in externally built binaries... > > sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Another corruption detection/correction question - exposure between 'event' and 'repair'?
After messing up some of my data in the past (my own doing, playing with BTRFS in old kernels), I've been extra cautious and now run a ZFS mirror across multiple RBD images. It's led me to believe that I have a faulty SSD in one of my hosts: sdb without a journal - fine (but slow) sdc without a journal - fine (but slow) sdd without a journal - fine (but slow) sdb with sda4 as journal - checksum errors appear in ZFS sdc with sda5 as journal - checksum errors appear in ZFS sdd with sda6 as journal - checksum errors appear in ZFS So, I believe the SSD in sda is in some way defective, but my question is around the detection and correction of this 'corruption'. "nodeep-scrub flag(s) set" currently, due to the performance impact. But, if it were set, it seems to find problems, which I can then repair. However ... is this a safe repair, using a good copy each object? Will it be with NewStore? I still seem to get errors regularly bubbling their way up into ZFS, but I can't reliably ascertain whether they're the result of a corruption which has happened *before* the next Ceph deep scrub (therefore still exposed anyway in this timeframe?), or is *after* a repair? I'm obviously hoping for an eventual scenario where this is all transparent to the ZFS layer and it stops detecting checksum errors :) Thanks, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg stuck in peering state
Hi Reno, "Peering", as far as I understand it, is the osds trying to talk to each other. You have approximately 1 OSD worth of pgs stuck (i.e. 264 / 8), and osd.0 appears in each of the stuck pgs, alongside either osd.2 or osd.3. I'd start by checking the comms between osd.0 and osds 2 and 3 (including the MTU). Cheers, Chris On Fri, Dec 18, 2015 at 02:50:18PM +0100, Reno Rainz wrote: > Hi all, > > I reboot all my osd node after, I got some pg stuck in peering state. > > root@ceph-osd-3:/var/log/ceph# ceph -s > cluster 186717a6-bf80-4203-91ed-50d54fe8dec4 > health HEALTH_WARN > clock skew detected on mon.ceph-osd-2 > 33 pgs peering > 33 pgs stuck inactive > 33 pgs stuck unclean > Monitor clock skew detected > monmap e1: 3 mons at {ceph-osd-1= > 10.200.1.11:6789/0,ceph-osd-2=10.200.1.12:6789/0,ceph-osd-3=10.200.1.13:6789/0 > } > election epoch 14, quorum 0,1,2 ceph-osd-1,ceph-osd-2,ceph-osd-3 > osdmap e66: 8 osds: 8 up, 8 in > pgmap v1346: 264 pgs, 3 pools, 272 MB data, 653 objects > 808 MB used, 31863 MB / 32672 MB avail > 231 active+clean > 33 peering > root@ceph-osd-3:/var/log/ceph# > > > root@ceph-osd-3:/var/log/ceph# ceph pg dump_stuck > ok > pg_stat state up up_primary acting acting_primary > 4.2d peering [2,0] 2 [2,0] 2 > 1.57 peering [3,0] 3 [3,0] 3 > 1.24 peering [3,0] 3 [3,0] 3 > 1.52 peering [0,2] 0 [0,2] 0 > 1.50 peering [2,0] 2 [2,0] 2 > 1.23 peering [3,0] 3 [3,0] 3 > 4.54 peering [2,0] 2 [2,0] 2 > 4.19 peering [3,0] 3 [3,0] 3 > 1.4b peering [0,3] 0 [0,3] 0 > 1.49 peering [0,3] 0 [0,3] 0 > 0.17 peering [0,3] 0 [0,3] 0 > 4.17 peering [0,3] 0 [0,3] 0 > 4.16 peering [0,3] 0 [0,3] 0 > 0.10 peering [0,3] 0 [0,3] 0 > 1.11 peering [0,2] 0 [0,2] 0 > 4.b peering [0,2] 0 [0,2] 0 > 1.3c peering [0,3] 0 [0,3] 0 > 0.c peering [0,3] 0 [0,3] 0 > 1.3a peering [3,0] 3 [3,0] 3 > 0.38 peering [2,0] 2 [2,0] 2 > 1.39 peering [0,2] 0 [0,2] 0 > 4.33 peering [2,0] 2 [2,0] 2 > 4.62 peering [2,0] 2 [2,0] 2 > 4.3 peering [0,2] 0 [0,2] 0 > 0.6 peering [0,2] 0 [0,2] 0 > 0.4 peering [2,0] 2 [2,0] 2 > 0.3 peering [2,0] 2 [2,0] 2 > 1.60 peering [0,3] 0 [0,3] 0 > 0.2 peering [3,0] 3 [3,0] 3 > 4.6 peering [3,0] 3 [3,0] 3 > 1.30 peering [0,3] 0 [0,3] 0 > 1.2f peering [0,2] 0 [0,2] 0 > 1.2a peering [3,0] 3 [3,0] 3 > root@ceph-osd-3:/var/log/ceph# > > > root@ceph-osd-3:/var/log/ceph# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -9 4.0 root default > -8 4.0 region eu-west-1 > -6 2.0 datacenter eu-west-1a > -2 2.0 host ceph-osd-1 > 0 1.0 osd.0 up 1.0 1.0 > 1 1.0 osd.1 up 1.0 1.0 > -4 2.0 host ceph-osd-3 > 4 1.0 osd.4 up 1.0 1.0 > 5 1.0 osd.5 up 1.0 1.0 > -7 2.0 datacenter eu-west-1b > -3 2.0 host ceph-osd-2 > 2 1.0 osd.2 up 1.0 1.0 > 3 1.0 osd.3 up 1.0 1.0 > -5 2.0 host ceph-osd-4 > 6 1.0 osd.6 up 1.0 1.0 > 7 1.0 osd.7 up 1.0 1.0 > root@ceph-osd-3:/var/log/ceph# > > Do you have guys any idea ? Why they stay in this state ? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deploying a Ceph storage cluster using Warewulf on Centos-7
Hi Chu, If you can use Chef then: https://github.com/ceph/ceph-chef An example of an actual project can be found at: https://github.com/bloomberg/chef-bcs Chris On Wed, Sep 23, 2015 at 4:11 PM, Chu Ruilin <ruilin...@gmail.com> wrote: > Hi, all > > I don't know which automation tool is best for deploying Ceph and I'd like > to know about. I'm comfortable with Warewulf since I've been using it for > HPC clusters. I find it quite convenient for Ceph too. I wrote a set of > scripts that can deploy a Ceph cluster quickly. Here is how I did it just > using virtualbox: > > > http://ruilinchu.blogspot.com/2015/09/deploying-ceph-storage-cluster-using.html > > comments are welcome! > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Best Regards, Chris Jones cjo...@cloudm2.com (p) 770.655.0770 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: large files hang
Hi Bryan, Have you checked your MTUs? I was recently bitten by large packets not getting through where small packets would. (This list, Dec 14, "All pgs stuck peering".) Small files working but big files not working smells like it could be a similar problem. Cheers, Chris On Thu, Dec 17, 2015 at 07:43:54PM +, Bryan Wright wrote: > Hi folks, > > This is driving me crazy. I have a ceph filesystem that behaves normally > when I "ls" files, and behaves normally when I copy smallish files on or off > of the filesystem, but large files (~ GB size) hang after copying a few > megabytes. > > This is ceph 0.94.5 under Centos 6.7 under kernel 4.3.3-1.el6.elrepo.x86_64. > I've tried 64-bit and 32-bit clients with several different kernels, but > all behave the same. > > After copying the first few bytes I get a stream of "slow request" messages > for the osds, like this: > > 2015-12-17 14:20:40.458306 osd.208 [WRN] slow request 1922.166564 seconds > old, received at 2015-12-17 13:48:38.291683: osd_op(mds.0.14956:851 > 100010a7b92.000d [stat] 0.5d427a9a RETRY=5 > ack+retry+read+rwordered+known_if_redirected e193868) currently reached_pg > > It's not a single OSD misbehaving. It seems to be any OSD. The OSDs have > plenty of disk space, and there's nothing in the osd logs that points to a > problem. > > How can I find out what's blocking these requests? > > Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pgs stuck peering
On Mon, Dec 14, 2015 at 09:29:20PM +0800, Jaze Lee wrote: > Should we add big packet test in heartbeat? Right now the heartbeat > only test the little packet. If the MTU is mismatched, the heartbeat > can not find that. It would certainly have saved me a great deal of stress! I imagine you wouldn't want it doing a big packet test every heartbeat, perhaps every 10th or some configurable number. Something for the developers to consider? (cc'ed) Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] All pgs stuck peering
Hi, ceph 0.94.5 After restarting one of our three osd hosts to increase the RAM and change from linux 3.18.21 to 4.1., the cluster is stuck with all pgs peering: # ceph -s cluster c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 health HEALTH_WARN 3072 pgs peering 3072 pgs stuck inactive 3072 pgs stuck unclean 1450 requests are blocked > 32 sec noout flag(s) set monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0} election epoch 74462, quorum 0,1,2 b2,b4,b5 osdmap e356963: 59 osds: 59 up, 59 in flags noout pgmap v69385733: 3072 pgs, 3 pools, 11973 GB data, 3340 kobjects 31768 GB used, 102 TB / 133 TB avail 3072 peering What can I do to diagnose (or better yet, fix!) this? Downgrading back to 3.18.21 hasn't helped. Each host (now) has 192G RAM. One has 17 osds, the other two have 21 osds each. I can see there's traffic going between the osd ports on the various osd hosts, but all small packets (122 or 131 bytes). Just prior to upgrading this osd host another one had also been upgraded (RAM + linux). The cluster had no trouble at that point and was healthy within a few minutes of that server starting up. The cluster has been working fine for years up to now, having had rolling upgrades since dumpling. Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pgs stuck peering
"up_primary": 0 }, { "first": 356959, "last": 356963, "maybe_went_rw": 1, "up": [ 6, 0 ], "acting": [ 6, 0 ], "primary": 6, "up_primary": 6 }, { "first": 356964, "last": 357025, "maybe_went_rw": 1, "up": [ 0 ], "acting": [ 0 ], "primary": 0, "up_primary": 0 }, { "first": 357026, "last": 357026, "maybe_went_rw": 0, "up": [], "acting": [], "primary": -1, "up_primary": -1 }, { "first": 357027, "last": 357041, "maybe_went_rw": 1, "up": [ 0 ], "acting": [ 0 ], "primary": 0, "up_primary": 0 }, { "first": 357042, "last": 357081, "maybe_went_rw": 1, "up": [ 6, 0 ], "acting": [ 6, 0 ], "primary": 6, "up_primary": 6 }, { "first": 357082, "last": 357082, "maybe_went_rw": 0, "up": [ 6 ], "acting": [ 6 ], "primary": 6, "up_primary": 6 }, { "first": 357083, "last": 357088, "maybe_went_rw": 0, "up": [ 6, 0 ], "acting": [ 6, 0 ], "primary": 6, "up_primary": 6 }, { "first": 357089, "last": 357089, "maybe_went_rw": 0, "up": [ 0 ], "acting": [ 0 ], "primary": 0, "up_primary": 0 }, { "first": 357090, "last": 357167, "maybe_went_rw": 1, "up": [ 6, 0 ], "acting": [ 6, 0 ], "primary": 6, "up_primary": 6 }, { "first": 357168, "last": 357217, "maybe_went_rw": 1, "up": [ 0 ], "acting": [ 0 ], "primary": 0, "up_primary": 0 } ], "probing_osds": [ "0", "6" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2015-12-14 12:54:41.084717" } ], "agent_state": {} } Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com