Re: [ceph-users] Encryption questions
On Fri, Jan 11, 2019 at 11:24 AM Sergio A. de Carvalho Jr. < scarvalh...@gmail.com> wrote: > Thanks for the answers, guys! > > Am I right to assume msgr2 (http://docs.ceph.com/docs/mimic/dev/msgr2/) > will provide encryption between Ceph daemons as well as between clients and > daemons? > > Does anybody know if it will be available in Nautilus? > That’s the intention; people are scrambling a bit to get it in soon enough to validate before the release. > > On Fri, Jan 11, 2019 at 8:10 AM Tobias Florek wrote: > >> Hi, >> >> as others pointed out, traffic in ceph is unencrypted (internal traffic >> as well as client traffic). I usually advise to set up IPSec or >> nowadays wireguard connections between all hosts. That takes care of >> any traffic going over the wire, including ceph. >> >> Cheers, >> Tobias Florek >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
Hi Ilya, thank you for the clarification. After setting the "osd_map_messages_max" to 10 the io errors and the MDS error "MDS_CLIENT_LATE_RELEASE" are gone. The messages of "mon session lost, hunting for new new mon" didn't go away... can it be that this is related to https://tracker.ceph.com/issues/23537 Best, Martin On Thu, Jan 24, 2019 at 10:16 PM Ilya Dryomov wrote: > > On Thu, Jan 24, 2019 at 6:21 PM Andras Pataki > wrote: > > > > Hi Ilya, > > > > Thanks for the clarification - very helpful. > > I've lowered osd_map_messages_max to 10, and this resolves the issue > > about the kernel being unhappy about large messages when the OSDMap > > changes. One comment here though: you mentioned that Luminous uses 40 > > as the default, which is indeed the case. The documentation for > > Luminous (and master), however, says that the default is 100. > > Looks like that page hasn't been kept up to date. I'll fix that > section. > > > > > One other follow-up question on the kernel client about something I've > > been seeing while testing. Does the kernel client clean up when the MDS > > asks due to cache pressure? On a machine I ran something that touches a > > lot of files, so the kernel client accumulated over 4 million caps. > > Many hours after all the activity finished (i.e. many hours after > > anything accesses ceph on that node) the kernel client still holds > > millions of caps, and the MDS periodically complains about clients not > > responding to cache pressure. How is this supposed to be handled? > > Obviously asking the kernel to drop caches via /proc/sys/vm/drop_caches > > does a very thorough cleanup, but something in the middle would be better. > > The kernel client sitting on way too many caps for way too long is > a long standing issue. Adding Zheng who has recently been doing some > work to facilitate cap releases and put a limit on the overall cap > count. > > Thanks, > > Ilya > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd reboot hung
Looks like your network deactivated before the rbd volume was unmounted. This is a known issue without a good programmatic workaround and you’ll need to adjust your configuration. On Tue, Jan 22, 2019 at 9:17 AM Gao, Wenjun wrote: > I’m using krbd to map a rbd device to a VM, it appears when the device is > mounted, reboot OS will hung for more than 7min, in baremetal case, it > could be more than 15min, even using the latest kernel 5.0.0, the problem > still occurs. > > Here are the console logs with 4.15.18 kernel and mimic rbd client, reboot > seems to be stuck in umount rbd operation > > *[ OK ] Stopped Update UTMP about System Boot/Shutdown.* > > *[ OK ] Stopped Create Volatile Files and Directories.* > > *[ OK ] Stopped target Local File Systems.* > > * Unmounting /run/user/110281572...* > > * Unmounting /var/tmp...* > > * Unmounting /root/test...* > > * Unmounting /run/user/78402...* > > * Unmounting Configuration File System...* > > *[ OK ] Stopped Configure read-only root support.* > > *[ OK ] Unmounted /var/tmp.* > > *[ OK ] Unmounted /run/user/78402.* > > *[ OK ] Unmounted /run/user/110281572.* > > *[ OK ] Stopped target Swap.* > > *[ OK ] Unmounted Configuration File System.* > > *[ 189.919062] libceph: mon4 XX.XX.XX.XX:6789 session lost, hunting for > new mon* > > *[ 189.950085] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 189.950764] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 190.687090] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 190.694197] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 191.711080] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 191.745254] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 193.695065] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 193.727694] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 197.087076] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 197.121077] libceph: mon4 XX.XX.XX.XX:6789 connect error* > > *[ 197.663082] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 197.680671] libceph: mon4 XX.XX.XX.XX:6789 connect error* > > *[ 198.687122] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 198.719253] libceph: mon4 XX.XX.XX.XX:6789 connect error* > > *[ 200.671136] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 200.702717] libceph: mon4 XX.XX.XX.XX:6789 connect error* > > *[ 204.703115] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 204.736586] libceph: mon4 XX.XX.XX.XX:6789 connect error* > > *[ 209.887141] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 209.918721] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 210.719078] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 210.750378] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 211.679118] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 211.712246] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 213.663116] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 213.696943] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 217.695062] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 217.728511] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 225.759109] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 225.775869] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 233.951062] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 233.951997] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 234.719114] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 234.720083] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 235.679112] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 235.680060] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 237.663088] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 237.664121] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 241.695082] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 241.696500] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 249.823095] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 249.824101] libceph: mon3 XX.XX.XX.XX:6789 connect error* > > *[ 264.671119] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 264.672102] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 265.695109] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 265.696106] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 266.719145] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 266.720204] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 268.703121] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 268.704110] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 272.671115] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 272.672159] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 281.055087] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 281.056577] libceph: mon0 XX.XX.XX.XX:6789 connect error* > > *[ 294.879098] libceph: connect XX.XX.XX.XX:6789 error -101* > > *[ 294.88
Re: [ceph-users] backfill_toofull while OSDs are not full
This doesn’t look familiar to me. Is the cluster still doing recovery so we can at least expect them to make progress when the “out” OSDs get removed from the set? On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander wrote: > Hi, > > I've got a couple of PGs which are stuck in backfill_toofull, but none > of them are actually full. > > "up": [ > 999, > 1900, > 145 > ], > "acting": [ > 701, > 1146, > 1880 > ], > "backfill_targets": [ > "145", > "999", > "1900" > ], > "acting_recovery_backfill": [ > "145", > "701", > "999", > "1146", > "1880", > "1900" > ], > > I checked all these OSDs, but they are all <75% utilization. > > full_ratio 0.95 > backfillfull_ratio 0.9 > nearfull_ratio 0.9 > > So I started checking all the PGs and I've noticed that each of these > PGs has one OSD in the 'acting_recovery_backfill' which is marked as out. > > In this case osd.1880 is marked as out and thus it's capacity is shown > as zero. > > [ceph@ceph-mgr ~]$ ceph osd df|grep 1880 > 1880 hdd 4.545990 0 B 0 B 0 B 00 27 > [ceph@ceph-mgr ~]$ > > This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown > side-effect of one of the OSDs being marked as out? > > Thanks, > > Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] create osd failed due to cephx authentication
Hi, I add --no-mon-config option after the command cause the error shows that I can try this option. And the command executes successfully. Now I have added OSDs into the cluster and seems well. I'm wondering whether this option have any effect to osd or not? Thanks Marc Roos 于2019年1月25日周五 上午4:16写道: > > > ceph osd create > > ceph osd rm osd.15 > > sudo -u ceph mkdir /var/lib/ceph/osd/ceph-15 > > ceph-disk prepare --bluestore --zap-disk /dev/sdc (bluestore) > > blkid /dev/sdb1 > echo "UUID=a300d511-8874-4655-b296-acf489d5cbc8 > /var/lib/ceph/osd/ceph-15 xfs defaults 0 0" >> /etc/fstab > mount /var/lib/ceph/osd/ceph-15 > chown ceph.ceph -R /var/lib/ceph/osd/ceph-15 > > > sudo -u ceph ceph-osd -i 15 --mkfs --mkkey --osd-uuid > sudo -u ceph ceph auth add osd.15 osd 'allow *' mon 'allow profile osd' > mgr 'allow profile osd' -i /var/lib/ceph/osd/ceph-15/keyring > > ceph osd create > > sudo -u ceph ceph osd crush add osd.15 0.4 host=c04 > > systemctl start ceph-osd@15 > systemctl enable ceph-osd@15 > > > > > > -Original Message- > From: Zhenshi Zhou [mailto:deader...@gmail.com] > Sent: 24 January 2019 10:32 > To: ceph-users > Subject: [ceph-users] create osd failed due to cephx authentication > > Hi, > > I'm installing a new ceph cluster manually. I get errors when I create > osd: > > # ceph-osd -i 0 --mkfs --mkkey > 2019-01-24 17:07:44.045 7f45f497b1c0 -1 auth: unable to find a keyring > on /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory > 2019-01-24 17:07:44.045 7f45f497b1c0 -1 monclient: ERROR: missing > keyring, cannot use cephx for authentication > > Some informations are provided, did I miss anything? > > # cat /etc/ceph/ceph.conf: > [global] > ... > [osd.0] > host = ceph-osd1 > osd data = /var/lib/ceph/osd/ceph-0 > bluestore block path = /dev/disk/by-partlabel/bluestore_block_0 > bluestore block db path = /dev/disk/by-partlabel/bluestore_block_db_0 > bluestore block wal path = /dev/disk/by-partlabel/bluestore_block_wal_0 > > # ls /var/lib/ceph/osd/ceph-0 > type > > # cat /var/lib/ceph/osd/ceph-0/type > bluestore > > # ceph -s > cluster: > id: 7712ab7e-3c38-44b3-96d3-4e1de9da0ff6 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 > mgr: ceph-mon1(active), standbys: ceph-mon2, ceph-mon3 > osd: 1 osds: 0 up, 0 in > > data: > pools: 0 pools, 0 pgs > objects: 0 objects, 0 B > usage: 0 B used, 0 B / 0 B avail > pgs: > > # ceph --version > ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic > (stable) > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Commercial support
Hi Ketil, We also offer independent ceph consulting and and operate productive cluster for more than 4 years and up 2500 osds. You can meet many in person at the next cephalocon in Barcelona. (https://ceph.com/cephalocon/barcelona-2019/) Regards, Joachim Clyso GmbH Homepage: https://www.clyso.com Am 24.01.2019 um 15:23 schrieb Matthew Vernon: Hi, On 23/01/2019 22:28, Ketil Froyn wrote: How is the commercial support for Ceph? More specifically, I was recently pointed in the direction of the very interesting combination of CephFS, Samba and ctdb. Is anyone familiar with companies that provide commercial support for in-house solutions like this? To add to the answers you've already had: Ubuntu also offer Ceph & Swift support: https://www.ubuntu.com/support/plans-and-pricing#storage Croit offer their own managed Ceph product, but do also offer support/consulting for Ceph installs, I think: https://croit.io/ There are some smaller consultancies, too, including 42on which is run by Wido den Hollander who you will have seen posting here: https://www.42on.com/ Regards, Matthew disclaimer: I have no commercial relationship to any of the above ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Salvage CEPHFS after lost PG
Thanks Marc, When I next have physical access to the cluster, I’ll add some more OSDs. Would that cause the hanging though? No takers on the bluestore salvage? thanks, rik. > On 20 Jan 2019, at 20:36, Marc Roos wrote: > > > If you have a backfillfull, no pg's will be able to migrate. > Better is to just add harddrives, because at least one of your osd's is > to full. > > I know you can set the backfillfull ratio's with commands like these > ceph tell osd.* injectargs '--mon_osd_full_ratio=0.97' > ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.95' > > ceph tell osd.* injectargs '--mon_osd_full_ratio=0.95' > ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.90' > > Or maybe decrease the weight of the full osd, check the osds with 'ceph > osd status' and make sure your nodes have even distribution of the > storage. > > > > > > > > > > > > -Original Message- > From: Rik [mailto:r...@kawaja.net] > Sent: zondag 20 januari 2019 8:47 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Salvage CEPHFS after lost PG > > Hi all, > > > > > I'm looking for some suggestions on how to do something inappropriate. > > > > > In a nutshell, I've lost the WAL/DB for three bluestore OSDs on a small > cluster and, as a result of those three OSDs going offline, I've lost a > placement group (7.a7). How I achieved this feat is an embarrassing > mistake, which I don't think has bearing on my question. > > > > > The OSDs were created a few months ago with ceph-deploy: > > /usr/local/bin/ceph-deploy --overwrite-conf osd create --bluestore > --data /dev/vdc1 --block-db /dev/vdf1 ceph-a > > > > > With the 3 OSDs out, I'm sitting at OSD_BACKFILLFULL. > > > > > First, the PG 7.a7 belongs to the data pool, rather than the metadata > pool and if I run "cephfs-data-scan pg_files / 7.a7" then I get a list > of 4149 files/objects but then it hangs. I don't understand why this > would hang if it's only the data pool which is impacted (since pg_files > only operates on the metadata pool?). > > > > > The ceph-log shows: > > cluster [WRN] slow request 30.894832 seconds old, received at 2019-01-20 > 18:00:12.555398: client_request(client.25017730:21 > > 8006 lookup #0x10001c8ce15/01 2019-01-20 18:00:12.550421 > caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting > > > > > Is the hang perhaps related to the OSD_BACKFILLFULL? If so, I could add > some completely new OSDs to fix that problem. I have held off doing that > for now as that will trigger a whole lot of data movement which might be > unnecessary. > > > > > Or is the hang indeed related to the missing PG? > > > > > Second, if I try to copy files out of the CEPHFS filesystem, I get a few > hundred files and then it too hangs. None of the files I’m attempting > to copy are listed in the pg_files output (although since the pg_files > hangs, perhaps it hadn't got to those files yet). Again, should I not be > able to access files which are not associated with the a missing data > pool PG? > > > > > Lastly, I want to know if there is some way to recreate the WAL/DB while > leaving the OSD data intact and/or fool one of the OSDs into thinking > everything is OK, allowing it to serve up the data it has in the missing > PG. > > > > > From reading the mailing list and documentation, I know that this is not > a "safe" operation: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021713.html > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024268.html > > > > > However, my current status indicates an unusable CEPHFS and limited > access to the data. I'd like to get as much data off it as possible and > then I expect to have to recreate it. With a combination of the backups > I have and what I can salvage from the cluster, I should hopefully have > most of what I need. > > > > > I know what I *should* have done, but now I'm at this point, I know I'm > asking for something which would never be required on a properly-run > cluster. > > > > > If it really is not possible to get the (possibly corrupt) PG back > again, can I get the cluster back so the remainder of the files are > accessible? > > > > > Currently running mimic 13.2.4 on all nodes. > > > > > Status: > > $ ceph health detail - > https://gist.github.com/kawaja/f59d231179b3186748eca19aae26bcd4 > > $ ceph fs get main - > https://gist.github.com/kawaja/a7ab0b285d53dee6a950a4310be4fa5a > > > > > Any advice on where I could go from here would be greatly appreciated. > > > > > thanks, > > rik. > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Creating bootstrap keys
Greetings, I have a ceph cluster that I've been running since the argonaut release. I've been upgrading it over the years and now it's mostly on mimic. A number of the tools (eg. ceph-volume) require the bootstrap keys that are assumed to be in /var/lib/ceph/bootstrap-*/. Because of the history of this cluster, I do not have these keys. What is the correct way to create them? Thanks -- Randall Smith Computing Services Adams State University http://www.adams.edu/ 719-587-7741 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw s3 subuser permissions
This should do it sort of. { "Id": "Policy1548367105316", "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1548367099807", "Effect": "Allow", "Action": "s3:ListBucket", "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" }, "Resource": "arn:aws:s3:::archive" }, { "Sid": "Stmt1548369229354", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" }, "Resource": "arn:aws:s3:::archive/folder2/*" } ] } -Original Message- From: Matt Benjamin [mailto:mbenj...@redhat.com] Sent: 24 January 2019 21:36 To: Marc Roos Cc: ceph-users Subject: Re: [ceph-users] Radosgw s3 subuser permissions Hi Marc, I'm not actually certain whether the traditional ACLs permit any solution for that, but I believe with bucket policy, you can achieve precise control within and across tenants, for any set of desired resources (buckets). Matt On Thu, Jan 24, 2019 at 3:18 PM Marc Roos wrote: > > > It is correct that it is NOT possible for s3 subusers to have > different permissions on folders created by the parent account? > Thus the --access=[ read | write | readwrite | full ] is for > everything the parent has created, and it is not possible to change > that for specific folders/buckets? > > radosgw-admin subuser create --uid='Company$archive' > --subuser=testuser > --key-type=s3 > > Thus if archive created this bucket/folder structure. > └── bucket > ├── folder1 > ├── folder2 > └── folder3 > └── folder4 > > It is not possible to allow testuser to only write in folder2? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Commercial support
Hello Ketil, we as croit offer commercial support for Ceph as well as our own Ceph based unified storage solution. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Mi., 23. Jan. 2019 um 23:29 Uhr schrieb Ketil Froyn : > Hi, > > How is the commercial support for Ceph? More specifically, I was recently > pointed in the direction of the very interesting combination of CephFS, > Samba and ctdb. Is anyone familiar with companies that provide commercial > support for in-house solutions like this? > > Regards, Ketil > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs kernel client hung after eviction
Hi, I have a cephfs kernel client (Ubuntu 18.04 4.15.0-34-generic) that's completely hung after the client was evicted by the MDS. The client logged: Jan 24 17:31:46 client kernel: [10733559.309496] libceph: FULL or reached pool quota Jan 24 17:32:26 client kernel: [10733599.232213] libceph: mon0 n.n.n.n:6789 session lost, hunting for new mon And the MDS logged: 2019-01-24 17:36:38.859 7f3ac7844700 0 log_channel(cluster) log [WRN] : evicting unresponsive client client:cephfs-client (86527773), after 300.081 seconds Looking in mdsc shows: % head /sys/kernel/debug/ceph/[id].client86527773/mdsc 20 mds0getattr #103d042 21 mds0getattr #103d042 22 mds0getattr #103d042 23 mds0getattr #103d042 24 mds0getattr #103d042 25 mds0getattr #103d042 26 mds0getattr #103d042 27 mds0getattr #103d042 28 mds0getattr #103d042 29 mds0getattr #103d042 But osdc hangs when I try to access it. I've tried umount -f but it hangs too. umount -l hides the problem (df no longer hangs), but any processes that were trying to access the mount are still blocked. I've also tried switching back and forth to standby MDSs in case that unblocked something. There are no current OSD blacklist entries either. It looks like rebooting is the only option, but that's somewhat of a pain to do. There's lots of people using this machine :-( Any ideas? Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt
Am 24.01.19 um 22:34 schrieb Alfredo Deza: I have both, plain and luks. At the moment I played around with the plain dmcrypt OSDs and run into this problem. I didn't test the luks crypted OSDs. There is support in the JSON file to define the type of encryption with the key: encryption_type If this is undefined it will default to 'plain'. So that tells me that we may indeed have a problem here. Ah yes. I set in my json this: "encryption_type": "plain" But as far as I see if this Key is missing plain is default. So this should be OK. I'm not sure what might be needed here, but I do recall having some trouble trying to understand what ceph-disk was doing. That is capture in this comment https://github.com/ceph/ceph/blob/v12.2.10/src/ceph-volume/ceph_volume/devices/simple/activate.py#L150-L155 > Do you think that might be related? The comment confusing me a bit. As far as I read the code. The base64 encoded key is retrieved from the monitor decoded and then passed via stdin to the cryptsetup command. At first I thought this was all OK. But after some investigation what cryptsetup is doing I think there should be a passed the following option as well to cryptsetup. --hash Plain ceph-disk uses the local keyfile which is *not* base64 encoded. The way ceph-disk passes the keyfile to cryptsetup, cryptsetup will not hash the key with some default algorithm. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt
On Thu, Jan 24, 2019 at 4:13 PM mlausch wrote: > > > > Am 24.01.19 um 22:02 schrieb Alfredo Deza: > >> > >> Ok with a new empty journal the OSD will not start. I have now rescued > >> the data with dd and the recrypt it with a other key and copied the > >> data back. This worked so far > >> > >> Now I encoded the key with base64 and put it to the key-value store. > >> Also created the neccessary authkeys. Creating the json File by hand > >> was quiet easy. > >> > >> But now there is one problem. > >> ceph-disk opens the crypt like > >> cryptsetup --key-file /etc/ceph/dmcrypt-keys/foobar ... > >> ceph-volume pipes the key via stdin like this > >> cat foobar | cryptsetup --key-file - ... > >> > >> The big problem. if the key is given via stdin cryptsetup hashes this > >> key per default with some hash. Only if I set --hash plain it works. I > >> think this is a bug in ceph-volume. > >> > >> Can someone confirm this? > > > > Ah right, this is when it was supported to have keys in a file. > > > > What type of keys do you have: LUKS or plain? > > I have both, plain and luks. > At the moment I played around with the plain dmcrypt OSDs and run into > this problem. I didn't test the luks crypted OSDs. There is support in the JSON file to define the type of encryption with the key: encryption_type If this is undefined it will default to 'plain'. So that tells me that we may indeed have a problem here. I'm not sure what might be needed here, but I do recall having some trouble trying to understand what ceph-disk was doing. That is capture in this comment https://github.com/ceph/ceph/blob/v12.2.10/src/ceph-volume/ceph_volume/devices/simple/activate.py#L150-L155 Do you think that might be related? > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
On Thu, Jan 24, 2019 at 6:21 PM Andras Pataki wrote: > > Hi Ilya, > > Thanks for the clarification - very helpful. > I've lowered osd_map_messages_max to 10, and this resolves the issue > about the kernel being unhappy about large messages when the OSDMap > changes. One comment here though: you mentioned that Luminous uses 40 > as the default, which is indeed the case. The documentation for > Luminous (and master), however, says that the default is 100. Looks like that page hasn't been kept up to date. I'll fix that section. > > One other follow-up question on the kernel client about something I've > been seeing while testing. Does the kernel client clean up when the MDS > asks due to cache pressure? On a machine I ran something that touches a > lot of files, so the kernel client accumulated over 4 million caps. > Many hours after all the activity finished (i.e. many hours after > anything accesses ceph on that node) the kernel client still holds > millions of caps, and the MDS periodically complains about clients not > responding to cache pressure. How is this supposed to be handled? > Obviously asking the kernel to drop caches via /proc/sys/vm/drop_caches > does a very thorough cleanup, but something in the middle would be better. The kernel client sitting on way too many caps for way too long is a long standing issue. Adding Zheng who has recently been doing some work to facilitate cap releases and put a limit on the overall cap count. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt
Am 24.01.19 um 22:02 schrieb Alfredo Deza: Ok with a new empty journal the OSD will not start. I have now rescued the data with dd and the recrypt it with a other key and copied the data back. This worked so far Now I encoded the key with base64 and put it to the key-value store. Also created the neccessary authkeys. Creating the json File by hand was quiet easy. But now there is one problem. ceph-disk opens the crypt like cryptsetup --key-file /etc/ceph/dmcrypt-keys/foobar ... ceph-volume pipes the key via stdin like this cat foobar | cryptsetup --key-file - ... The big problem. if the key is given via stdin cryptsetup hashes this key per default with some hash. Only if I set --hash plain it works. I think this is a bug in ceph-volume. Can someone confirm this? Ah right, this is when it was supported to have keys in a file. What type of keys do you have: LUKS or plain? I have both, plain and luks. At the moment I played around with the plain dmcrypt OSDs and run into this problem. I didn't test the luks crypted OSDs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
On Thu, Jan 24, 2019 at 8:16 PM Martin Palma wrote: > > We are experiencing the same issues on clients with CephFS mounted > using the kernel client and 4.x kernels. > > The problem shows up when we add new OSDs, on reboots after > installing patches and when changing the weight. > > Here the logs of a misbehaving client; > > [6242967.890611] libceph: mon4 10.8.55.203:6789 session established > [6242968.010242] libceph: osd534 10.7.55.23:6814 io error > [6242968.259616] libceph: mon1 10.7.55.202:6789 io error > [6242968.259658] libceph: mon1 10.7.55.202:6789 session lost, hunting > for new mon > [6242968.359031] libceph: mon4 10.8.55.203:6789 session established > [6242968.622692] libceph: osd534 10.7.55.23:6814 io error > [6242968.692274] libceph: mon4 10.8.55.203:6789 io error > [6242968.692337] libceph: mon4 10.8.55.203:6789 session lost, hunting > for new mon > [6242968.694216] libceph: mon0 10.7.55.201:6789 session established > [6242969.099862] libceph: mon0 10.7.55.201:6789 io error > [6242969.099888] libceph: mon0 10.7.55.201:6789 session lost, hunting > for new mon > [6242969.224565] libceph: osd534 10.7.55.23:6814 io error > > Additional to the MON io error we also got some OSD io errors. This isn't surprising -- the kernel client can receive osdmaps from both monitors and OSDs. > > Moreover when the error occurs several clients causes a > "MDS_CLIENT_LATE_RELEASE" error on the MDS server. > > We are currently running on Luminous 12.2.10 and have around 580 OSDs > and 5 monitor nodes. The cluster is running on CentOS 7.6. > > The ‘osd_map_message_max’ setting is set to the default value of 40. > But we are still getting these errors. My advise is the same: set it to 20 or even 10. The problem is this setting is in terms of the number of osdmaps instead of the size of the resulting message. I've filed http://tracker.ceph.com/issues/38040 Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
Hi Ilya, Thanks for the clarification - very helpful. I've lowered osd_map_messages_max to 10, and this resolves the issue about the kernel being unhappy about large messages when the OSDMap changes. One comment here though: you mentioned that Luminous uses 40 as the default, which is indeed the case. The documentation for Luminous (and master), however, says that the default is 100. One other follow-up question on the kernel client about something I've been seeing while testing. Does the kernel client clean up when the MDS asks due to cache pressure? On a machine I ran something that touches a lot of files, so the kernel client accumulated over 4 million caps. Many hours after all the activity finished (i.e. many hours after anything accesses ceph on that node) the kernel client still holds millions of caps, and the MDS periodically complains about clients not responding to cache pressure. How is this supposed to be handled? Obviously asking the kernel to drop caches via /proc/sys/vm/drop_caches does a very thorough cleanup, but something in the middle would be better. Andras On 1/16/19 1:45 PM, Ilya Dryomov wrote: On Wed, Jan 16, 2019 at 7:12 PM Andras Pataki wrote: Hi Ilya/Kjetil, I've done some debugging and tcpdump-ing to see what the interaction between the kernel client and the mon looks like. Indeed - CEPH_MSG_MAX_FRONT defined as 16Mb seems low for the default mon messages for our cluster (with osd_mon_messages_max at 100). We have about 3500 osd's, and the kernel advertises itself as older than This is too big, especially for a fairly large cluster such as yours. The default was reduced to 40 in luminous. Given about 3500 OSDs, you might want to set it to 20 or even 10. Luminous, so it gets full map updates. The FRONT message size on the wire I saw was over 24Mb. I'll try setting osd_mon_messages_max to 30 and do some more testing, but from the debugging it definitely seems like the issue. Is the kernel driver really not up to date to be considered at least a Luminous client by the mon (i.e. it has some feature really missing)? I looked at the bits, and the MON seems to want is bit 59 in ceph features shared by FS_BTIME, FS_CHANGE_ATTR, MSG_ADDR2. Can the kernel client be used when setting require-min-compat to luminous (either with the 4.19.x kernel or the Redhat/Centos 7.6 kernel)? Some background here would be helpful. Yes, the kernel client is missing support for that feature bit, however 4.13+ and RHEL 7.5+ _can_ be used with require-min-compat-client set to luminous. See http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt
On Thu, Jan 24, 2019 at 3:17 PM Manuel Lausch wrote: > > > > On Wed, 23 Jan 2019 16:32:08 +0100 > Manuel Lausch wrote: > > > > > > > > The key api for encryption is *very* odd and a lot of its quirks are > > > undocumented. For example, ceph-volume is stuck supporting naming > > > files and keys 'lockbox' > > > (for backwards compatibility) but there is no real lockbox anymore. > > > Another quirk is that when storing the secret in the monitor, it is > > > done using the following convention: > > > > > > dm-crypt/osd/{OSD FSID}/luks > > > > > > The 'luks' part there doesn't indicate anything about the type of > > > encryption (!!) so regardless of the type of encryption (luks or > > > plain) the key would still go there. > > > > > > If you manage to get the keys into the monitors you still wouldn't > > > be able to scan OSDs to produce the JSON files, but you would be > > > able to create the JSON file with the > > > metadata that ceph-volume needs to run the OSD. > > > > I think it is not that problem to create the json files by myself. > > Moving the Keys to the monitors and creating appropriate auth-keys > > should be more or less easy as well. > > > > The problem I see is, that there are individual keys for the journal > > and data partition while the new process useses only one key for both > > partitions. > > > > maybe I can recreate the journal partition with the other key. But is > > this possible? Are there important data ramaining on the journal after > > clean stopping the OSD which I cannot throw away without trashing the > > whole OSD? > > > > Ok with a new empty journal the OSD will not start. I have now rescued > the data with dd and the recrypt it with a other key and copied the > data back. This worked so far > > Now I encoded the key with base64 and put it to the key-value store. > Also created the neccessary authkeys. Creating the json File by hand > was quiet easy. > > But now there is one problem. > ceph-disk opens the crypt like > cryptsetup --key-file /etc/ceph/dmcrypt-keys/foobar ... > ceph-volume pipes the key via stdin like this > cat foobar | cryptsetup --key-file - ... > > The big problem. if the key is given via stdin cryptsetup hashes this > key per default with some hash. Only if I set --hash plain it works. I > think this is a bug in ceph-volume. > > Can someone confirm this? Ah right, this is when it was supported to have keys in a file. What type of keys do you have: LUKS or plain? > > there is the related code I mean in ceph-volume > https://github.com/ceph/ceph/blob/v12.2.10/src/ceph-volume/ceph_volume/util/encryption.py#L59 > > Regards > Manuel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-users Digest, Vol 72, Issue 20
I tried setting noout and that did provide a bit better result. Basically I could stop the OSD on the inactive server and everything still worked (after a 2-3 second pause) but then when I rebooted the inactive server everything hung again until it came back online and resynced with the cluster. This is what I saw in ceph -s: cluster eb2003cf-b16d-4551-adb7-892469447f89 health HEALTH_WARN 128 pgs degraded 124 pgs stuck unclean 128 pgs undersized recovery 805252/1610504 objects degraded (50.000%) mds cluster is degraded 1/2 in osds are down noout flag(s) set monmap e1: 3 mons at {FILE1=10.1.1.201:6789/0,FILE2=10.1.1.202:6789/0,MON1=10.1.1.90:6789/0} election epoch 216, quorum 0,1,2 FILE1,FILE2,MON1 fsmap e796: 1/1/1 up {0=FILE2=up:rejoin} osdmap e360: 2 osds: 1 up, 2 in; 128 remapped pgs flags noout,sortbitwise,require_jewel_osds pgmap v7056802: 128 pgs, 3 pools, 164 GB data, 786 kobjects 349 GB used, 550 GB / 899 GB avail 805252/1610504 objects degraded (50.000%) 128 active+undersized+degraded client io 1379 B/s rd, 1 op/s rd, 0 op/s wr These are the commands I ran and the results: ceph osd set noout systemctl stop ceph-mds@FILE2.service # Everything still works on the clients... systemctl stop ceph-osd@0.service # This was on FILE2 wile FILE1 was the active fsmap # Fails over quickly, can still read content on the clients.. # Rebooted FILE2 # File access on the clients locked up until FILE2 rejoined This is on Ubuntu 16 with kernel 4.4.0-141, so I'm not sure if that qualifies for David's warning about old kernels... Is there a command or a logfile I can look at that will better help to diagnose this issue? Is three servers (with only 2 OSDs) enough to run a HA cluster on ceph, or does it just die when it doesn't have 3 active servers for a quorum? Would installing MDS and MON on a 4th box (but sticking with 2 OSDs) be what's required to resolve this? I really don't want to do that, but if I have to I guess I can look into find another box. On 2019-01-21 5:01 p.m., ceph-users-requ...@lists.ceph.com wrote: Message: 14 Date: Mon, 21 Jan 2019 10:05:15 +0100 From: Robert Sander To:ceph-users@lists.ceph.com Subject: Re: [ceph-users] How To Properly Failover a HA Setup Message-ID:<587dac75-96bc-8719-ee62-38e71491c...@heinlein-support.de> Content-Type: text/plain; charset="utf-8" On 21.01.19 09:22, Charles Tassell wrote: Hello Everyone, ? I've got a 3 node Jewel cluster setup, and I think I'm missing something.? When I want to take one of my nodes down for maintenance (kernel upgrades or the like) all of my clients (running the kernel module for the cephfs filesystem) hang for a couple of minutes before the redundant servers kick in. Have you set the noout flag before doing cluster maintenance? ceph osd set noout and afterwards ceph osd unset noout Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating to a dedicated cluster network
Split networks is rarely worth it. One fast network is usually better. And since you mentioned having only two interfaces: one bond is way better than two independent interfaces. IPv4/6 dual stack setups will be supported in Nautilus, you currently have to use either IPv4 or IPv6. Jumbo frames: often mentioned but usually not worth it. (Yes, I know that this is somewhat controversial and increasing MTU is often a standard trick for performance tuning, but I still have to see have a benchmark that actually shows a significant performance improvements. Some quick tests show that I can save around 5-10% CPU load on a system doing ~50 gbit/s of IO traffic which is almost nothing given the total system load) Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Jan 23, 2019 at 11:41 AM Jan Kasprzak wrote: > > Jakub Jaszewski wrote: > : Hi Yenya, > : > : Can I ask how your cluster looks and why you want to do the network > : splitting? > > Jakub, > > we have deployed the Ceph cluster originally as a proof of concept for > a private cloud. We run OpenNebula and Ceph on about 30 old servers > with old HDDs (2 OSDs per host), all connected via 1 Gbit ethernet > with 10Gbit backbone. Since then our private cloud got pretty popular > among our users, so we are planning to upgrade it to a smaller amount > of modern servers. The new servers have two 10GbE interfaces, so the primary > reasoning behind it is "why not use them both when we already have them". > Of course, interface teaming/bonding is another option. > > Currently I see the network being saturated only when doing a live > migration of a VM between the physical hosts, and then during a Ceph > cluster rebalance. > > So, I don't think moving to a dedicated cluster network is a necessity for us. > > Anyway, does anybody use the cluster network with larger MTU (jumbo frames)? > > : We used to set up 9-12 OSD nodes (12-16 HDDs each) clusters using 2x10Gb > : for access and 2x10Gb for cluster network, however, I don't see the reasons > : to not use just one network for next cluster setup. > > > -Yenya > > : śr., 23 sty 2019, 10:40: Jan Kasprzak napisał(a): > : > : > Hello, Ceph users, > : > > : > is it possible to migrate already deployed Ceph cluster, which uses > : > public network only, to a split public/dedicated networks? If so, > : > can this be done without service disruption? I have now got a new > : > hardware which makes this possible, but I am not sure how to do it. > : > > : > Another question is whether the cluster network can be done > : > solely on top of IPv6 link-local addresses without any public address > : > prefix. > : > > : > When deploying this cluster (Ceph Firefly, IIRC), I had problems > : > with mixed IPv4/IPv6 addressing, and ended up with ms_bind_ipv6 = false > : > in my Ceph conf. > : > > : > Thanks, > : > > : > -Yenya > : > > : > -- > : > | Jan "Yenya" Kasprzak > : > | > : > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 > : > | > : > This is the world we live in: the way to deal with computers is to google > : > the symptoms, and hope that you don't have to watch a video. --P. Zaitcev > : > ___ > : > ceph-users mailing list > : > ceph-users@lists.ceph.com > : > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > : > > > -- > | Jan "Yenya" Kasprzak | > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > This is the world we live in: the way to deal with computers is to google > the symptoms, and hope that you don't have to watch a video. --P. Zaitcev > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configure libvirt to 'see' already created snapshots of a vm rbd image
Hmmm... if i am Not wrong, this Information have to be put within the config files from you... there isnt a mechanism which extracts this via rbd snap ls ... Am 7. Januar 2019 13:16:36 MEZ schrieb Marc Roos : > > >How do you configure libvirt so it sees the snapshots already created >on >the rbd image it is using for the vm? > >I have already a vm running connected to the rbd pool via >protocol='rbd', and rbd snap ls is showing for snapshots. > > > > > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw s3 subuser permissions
Hi Marc, I'm not actually certain whether the traditional ACLs permit any solution for that, but I believe with bucket policy, you can achieve precise control within and across tenants, for any set of desired resources (buckets). Matt On Thu, Jan 24, 2019 at 3:18 PM Marc Roos wrote: > > > It is correct that it is NOT possible for s3 subusers to have different > permissions on folders created by the parent account? > Thus the --access=[ read | write | readwrite | full ] is for everything > the parent has created, and it is not possible to change that for > specific folders/buckets? > > radosgw-admin subuser create --uid='Company$archive' --subuser=testuser > --key-type=s3 > > Thus if archive created this bucket/folder structure. > └── bucket > ├── folder1 > ├── folder2 > └── folder3 > └── folder4 > > It is not possible to allow testuser to only write in folder2? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Performance issue due to tuned
Hi all, I want to share a performance issue I just encountered on a test cluster of mine, specifically related to tuned. I started by setting the "throughput-performance" tuned profile on my OSD nodes and ran some benchmarks. I then applied that same profile to my client node, which intuitively sounds like a reasonable thing to do (I do want to tweak my client to maximize throughput if that's possible). Long story short, I found out that one of the tweaks made by the "throughput-performance" profile is to increase kernel.sched_wakeup_granularity_ns = 1500 which reduces the maximum throughput I'm able to get from 1080 MB/s to 1060 MB/s (-2.8%). The default value for sched_wakeup_granularity_ns depends on the distro, on my system the default is 7.5ms. More info about the benchmark: - The benchmark tool is 'rados bench' - The cluster has about 10 nodes with older hardware - The client node has only 4 CPUs, the OSD nodes have 16 CPUs and 5 OSDs each - The throughput difference is always reproducible - This was a read workload so that there is less volatility in the results - I had all the data in BlueStore's cache on the OSD nodes so that accessing the HDDs wouldn't skew the results - I was looking at the difference of throughput once the benchmark reaches its permanent regime, during which the throughput is very stable (not surprising for a sequential read workload served from memory) I have a theory which explains the reason for this reduced throughput. The sched_wakeup_granularity_ns setting sets the minimum time a process runs on a CPU before it can get preempted, so it looks like there might be too much of a delay for rados bench's threads to get scheduled on-cpu (higher latency from the moment a thread is woken up and goes in the CPU runqueue to the time it is scheduled in and starts running) which effectively results in a lower throughput overall. We can measure that latency using 'perf sched timehist': time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --- -- -- - - - 3279952.180957 [0002] msgr-worker-1[50098/50094] 0.154 0.021 0.135 it is shown in the 5th column (sch delay). If we look at the average of 'sch delay' for a lower throughput run, we get: $> perf sched timehist -i perf.data.slow | egrep 'msgr|rados' | awk '{ total += $5; count++ } END { print total/count }' 0.0243015 And for a higher throughput run: $> perf sched timehist -i perf.data.fast | egrep 'msgr|rados' | awk '{ total += $5; count++ } END { print total/count }' 0.00401659 There is on average a 20ms longer delay for "wakeup-to-sched-in" with the throughput-performance profile enabled on the client due to the sched_wakeup_granularity_ns setting. The fact that there are few CPUs on that node doesn't help. If I set the number of concurrent IOs to 1, I get the same throughput for both values of sched_wakeup_granularity, because there is (almost) always an available CPU, which means that rados bench's threads don't have to wait as long to get scheduled in and start consuming data. On the other hand, increasing sched_wakeup_granularity_ns on the OSD nodes doesn't reduce the throughput because there are more CPUs than there are OSDs, and the wakeup-to-sched delay is "diluted" by the latency of reading/writing/moving data around. I'm curious to know if this theory makes sense, and if other people have encountered similar situations (with tuned or otherwise). Mohamad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] logging of cluster status (Jewel vs Luminous and later)
Hi Matthew, Some of the logging was intentionally removed because it used to clutter up the logs. However, we are bringing back some of the useful stuff back and have a tracker ticket https://tracker.ceph.com/issues/37886 open for it. Thanks, Neha On Thu, Jan 24, 2019 at 12:13 PM Stefan Kooman wrote: > > Quoting Matthew Vernon (m...@sanger.ac.uk): > > Hi, > > > > On our Jewel clusters, the mons keep a log of the cluster status e.g. > > > > 2019-01-24 14:00:00.028457 7f7a17bef700 0 log_channel(cluster) log [INF] : > > HEALTH_OK > > 2019-01-24 14:00:00.646719 7f7a46423700 0 log_channel(cluster) log [INF] : > > pgmap v66631404: 173696 pgs: 10 active+clean+scrubbing+deep, 173686 > > active+clean; 2271 TB data, 6819 TB used, 9875 TB / 16695 TB avail; 1313 > > MB/s rd, 236 MB/s wr, 12921 op/s > > > > This is sometimes useful after a problem, to see when thing started going > > wrong (which can be helpful for incident response and analysis) and so on. > > There doesn't appear to be any such logging in Luminous, either by mons or > > mgrs. What am I missing? > > Our mons keep a log in /var/log/ceph/ceph.log (running luminous 12.2.8). > Is that log present on your systems? > > Gr. Stefan > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw s3 subuser permissions
It is correct that it is NOT possible for s3 subusers to have different permissions on folders created by the parent account? Thus the --access=[ read | write | readwrite | full ] is for everything the parent has created, and it is not possible to change that for specific folders/buckets? radosgw-admin subuser create --uid='Company$archive' --subuser=testuser --key-type=s3 Thus if archive created this bucket/folder structure. └── bucket ├── folder1 ├── folder2 └── folder3 └── folder4 It is not possible to allow testuser to only write in folder2? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt
On Wed, 23 Jan 2019 16:32:08 +0100 Manuel Lausch wrote: > > > > The key api for encryption is *very* odd and a lot of its quirks are > > undocumented. For example, ceph-volume is stuck supporting naming > > files and keys 'lockbox' > > (for backwards compatibility) but there is no real lockbox anymore. > > Another quirk is that when storing the secret in the monitor, it is > > done using the following convention: > > > > dm-crypt/osd/{OSD FSID}/luks > > > > The 'luks' part there doesn't indicate anything about the type of > > encryption (!!) so regardless of the type of encryption (luks or > > plain) the key would still go there. > > > > If you manage to get the keys into the monitors you still wouldn't > > be able to scan OSDs to produce the JSON files, but you would be > > able to create the JSON file with the > > metadata that ceph-volume needs to run the OSD. > > I think it is not that problem to create the json files by myself. > Moving the Keys to the monitors and creating appropriate auth-keys > should be more or less easy as well. > > The problem I see is, that there are individual keys for the journal > and data partition while the new process useses only one key for both > partitions. > > maybe I can recreate the journal partition with the other key. But is > this possible? Are there important data ramaining on the journal after > clean stopping the OSD which I cannot throw away without trashing the > whole OSD? > Ok with a new empty journal the OSD will not start. I have now rescued the data with dd and the recrypt it with a other key and copied the data back. This worked so far Now I encoded the key with base64 and put it to the key-value store. Also created the neccessary authkeys. Creating the json File by hand was quiet easy. But now there is one problem. ceph-disk opens the crypt like cryptsetup --key-file /etc/ceph/dmcrypt-keys/foobar ... ceph-volume pipes the key via stdin like this cat foobar | cryptsetup --key-file - ... The big problem. if the key is given via stdin cryptsetup hashes this key per default with some hash. Only if I set --hash plain it works. I think this is a bug in ceph-volume. Can someone confirm this? there is the related code I mean in ceph-volume https://github.com/ceph/ceph/blob/v12.2.10/src/ceph-volume/ceph_volume/util/encryption.py#L59 Regards Manuel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] create osd failed due to cephx authentication
ceph osd create ceph osd rm osd.15 sudo -u ceph mkdir /var/lib/ceph/osd/ceph-15 ceph-disk prepare --bluestore --zap-disk /dev/sdc (bluestore) blkid /dev/sdb1 echo "UUID=a300d511-8874-4655-b296-acf489d5cbc8 /var/lib/ceph/osd/ceph-15 xfs defaults 0 0" >> /etc/fstab mount /var/lib/ceph/osd/ceph-15 chown ceph.ceph -R /var/lib/ceph/osd/ceph-15 sudo -u ceph ceph-osd -i 15 --mkfs --mkkey --osd-uuid sudo -u ceph ceph auth add osd.15 osd 'allow *' mon 'allow profile osd' mgr 'allow profile osd' -i /var/lib/ceph/osd/ceph-15/keyring ceph osd create sudo -u ceph ceph osd crush add osd.15 0.4 host=c04 systemctl start ceph-osd@15 systemctl enable ceph-osd@15 -Original Message- From: Zhenshi Zhou [mailto:deader...@gmail.com] Sent: 24 January 2019 10:32 To: ceph-users Subject: [ceph-users] create osd failed due to cephx authentication Hi, I'm installing a new ceph cluster manually. I get errors when I create osd: # ceph-osd -i 0 --mkfs --mkkey 2019-01-24 17:07:44.045 7f45f497b1c0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory 2019-01-24 17:07:44.045 7f45f497b1c0 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication Some informations are provided, did I miss anything? # cat /etc/ceph/ceph.conf: [global] ... [osd.0] host = ceph-osd1 osd data = /var/lib/ceph/osd/ceph-0 bluestore block path = /dev/disk/by-partlabel/bluestore_block_0 bluestore block db path = /dev/disk/by-partlabel/bluestore_block_db_0 bluestore block wal path = /dev/disk/by-partlabel/bluestore_block_wal_0 # ls /var/lib/ceph/osd/ceph-0 type # cat /var/lib/ceph/osd/ceph-0/type bluestore # ceph -s cluster: id: 7712ab7e-3c38-44b3-96d3-4e1de9da0ff6 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 mgr: ceph-mon1(active), standbys: ceph-mon2, ceph-mon3 osd: 1 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: # ceph --version ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] logging of cluster status (Jewel vs Luminous and later)
Quoting Matthew Vernon (m...@sanger.ac.uk): > Hi, > > On our Jewel clusters, the mons keep a log of the cluster status e.g. > > 2019-01-24 14:00:00.028457 7f7a17bef700 0 log_channel(cluster) log [INF] : > HEALTH_OK > 2019-01-24 14:00:00.646719 7f7a46423700 0 log_channel(cluster) log [INF] : > pgmap v66631404: 173696 pgs: 10 active+clean+scrubbing+deep, 173686 > active+clean; 2271 TB data, 6819 TB used, 9875 TB / 16695 TB avail; 1313 > MB/s rd, 236 MB/s wr, 12921 op/s > > This is sometimes useful after a problem, to see when thing started going > wrong (which can be helpful for incident response and analysis) and so on. > There doesn't appear to be any such logging in Luminous, either by mons or > mgrs. What am I missing? Our mons keep a log in /var/log/ceph/ceph.log (running luminous 12.2.8). Is that log present on your systems? Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
We are experiencing the same issues on clients with CephFS mounted using the kernel client and 4.x kernels. The problem shows up when we add new OSDs, on reboots after installing patches and when changing the weight. Here the logs of a misbehaving client; [6242967.890611] libceph: mon4 10.8.55.203:6789 session established [6242968.010242] libceph: osd534 10.7.55.23:6814 io error [6242968.259616] libceph: mon1 10.7.55.202:6789 io error [6242968.259658] libceph: mon1 10.7.55.202:6789 session lost, hunting for new mon [6242968.359031] libceph: mon4 10.8.55.203:6789 session established [6242968.622692] libceph: osd534 10.7.55.23:6814 io error [6242968.692274] libceph: mon4 10.8.55.203:6789 io error [6242968.692337] libceph: mon4 10.8.55.203:6789 session lost, hunting for new mon [6242968.694216] libceph: mon0 10.7.55.201:6789 session established [6242969.099862] libceph: mon0 10.7.55.201:6789 io error [6242969.099888] libceph: mon0 10.7.55.201:6789 session lost, hunting for new mon [6242969.224565] libceph: osd534 10.7.55.23:6814 io error Additional to the MON io error we also got some OSD io errors. Moreover when the error occurs several clients causes a "MDS_CLIENT_LATE_RELEASE" error on the MDS server. We are currently running on Luminous 12.2.10 and have around 580 OSDs and 5 monitor nodes. The cluster is running on CentOS 7.6. The ‘osd_map_message_max’ setting is set to the default value of 40. But we are still getting these errors. Best, Martin On Wed, Jan 16, 2019 at 7:46 PM Ilya Dryomov wrote: > > On Wed, Jan 16, 2019 at 7:12 PM Andras Pataki > wrote: > > > > Hi Ilya/Kjetil, > > > > I've done some debugging and tcpdump-ing to see what the interaction > > between the kernel client and the mon looks like. Indeed - > > CEPH_MSG_MAX_FRONT defined as 16Mb seems low for the default mon > > messages for our cluster (with osd_mon_messages_max at 100). We have > > about 3500 osd's, and the kernel advertises itself as older than > > This is too big, especially for a fairly large cluster such as yours. > The default was reduced to 40 in luminous. Given about 3500 OSDs, you > might want to set it to 20 or even 10. > > > Luminous, so it gets full map updates. The FRONT message size on the > > wire I saw was over 24Mb. I'll try setting osd_mon_messages_max to 30 > > and do some more testing, but from the debugging it definitely seems > > like the issue. > > > > Is the kernel driver really not up to date to be considered at least a > > Luminous client by the mon (i.e. it has some feature really missing)? I > > looked at the bits, and the MON seems to want is bit 59 in ceph features > > shared by FS_BTIME, FS_CHANGE_ATTR, MSG_ADDR2. Can the kernel client be > > used when setting require-min-compat to luminous (either with the 4.19.x > > kernel or the Redhat/Centos 7.6 kernel)? Some background here would be > > helpful. > > Yes, the kernel client is missing support for that feature bit, however > 4.13+ and RHEL 7.5+ _can_ be used with require-min-compat-client set to > luminous. See > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/027002.html > > Thanks, > > Ilya > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] create osd failed due to cephx authentication
Hi, I'm installing a new ceph cluster manually. I get errors when I create osd: # ceph-osd -i 0 --mkfs --mkkey 2019-01-24 17:07:44.045 7f45f497b1c0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory 2019-01-24 17:07:44.045 7f45f497b1c0 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication Some informations are provided, did I miss anything? # cat /etc/ceph/ceph.conf: [global] ... [osd.0] host = ceph-osd1 osd data = /var/lib/ceph/osd/ceph-0 bluestore block path = /dev/disk/by-partlabel/bluestore_block_0 bluestore block db path = /dev/disk/by-partlabel/bluestore_block_db_0 bluestore block wal path = /dev/disk/by-partlabel/bluestore_block_wal_0 # ls /var/lib/ceph/osd/ceph-0 type # cat /var/lib/ceph/osd/ceph-0/type bluestore # ceph -s cluster: id: 7712ab7e-3c38-44b3-96d3-4e1de9da0ff6 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 mgr: ceph-mon1(active), standbys: ceph-mon2, ceph-mon3 osd: 1 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: # ceph --version ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Commercial support
Hi, On 23/01/2019 22:28, Ketil Froyn wrote: How is the commercial support for Ceph? More specifically, I was recently pointed in the direction of the very interesting combination of CephFS, Samba and ctdb. Is anyone familiar with companies that provide commercial support for in-house solutions like this? To add to the answers you've already had: Ubuntu also offer Ceph & Swift support: https://www.ubuntu.com/support/plans-and-pricing#storage Croit offer their own managed Ceph product, but do also offer support/consulting for Ceph installs, I think: https://croit.io/ There are some smaller consultancies, too, including 42on which is run by Wido den Hollander who you will have seen posting here: https://www.42on.com/ Regards, Matthew disclaimer: I have no commercial relationship to any of the above -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] logging of cluster status (Jewel vs Luminous and later)
Hi, On our Jewel clusters, the mons keep a log of the cluster status e.g. 2019-01-24 14:00:00.028457 7f7a17bef700 0 log_channel(cluster) log [INF] : HEALTH_OK 2019-01-24 14:00:00.646719 7f7a46423700 0 log_channel(cluster) log [INF] : pgmap v66631404: 173696 pgs: 10 active+clean+scrubbing+deep, 173686 active+clean; 2271 TB data, 6819 TB used, 9875 TB / 16695 TB avail; 1313 MB/s rd, 236 MB/s wr, 12921 op/s This is sometimes useful after a problem, to see when thing started going wrong (which can be helpful for incident response and analysis) and so on. There doesn't appear to be any such logging in Luminous, either by mons or mgrs. What am I missing? Thanks, Matthew -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com