[ceph-users] Memory usage of OSD
Hi, I noticed strange situation in one of our clusters. The OSD deamons are taking too much RAM. We are running 12.2.12 and have default configuration of osd_memory_target (4GiB). Heap dump shows: osd.2969 dumping heap profile now. MALLOC: 6381526944 ( 6085.9 MiB) Bytes in use by application MALLOC: +0 (0.0 MiB) Bytes in page heap freelist MALLOC: +173373288 ( 165.3 MiB) Bytes in central cache freelist MALLOC: + 17163520 ( 16.4 MiB) Bytes in transfer cache freelist MALLOC: + 95339512 ( 90.9 MiB) Bytes in thread cache freelists MALLOC: + 28995744 ( 27.7 MiB) Bytes in malloc metadata MALLOC: MALLOC: = 6696399008 ( 6386.2 MiB) Actual memory used (physical + swap) MALLOC: +218267648 ( 208.2 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 691456 ( 6594.3 MiB) Virtual address space used MALLOC: MALLOC: 408276 Spans in use MALLOC: 75 Thread heaps in use MALLOC: 8192 Tcmalloc page size Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. IMO "Bytes in use by application" should be less than osd_memory_target. Am I correct? I checked heap dump with google-pprof and got following results. Total: 149.4 MB 60.5 40.5% 40.5% 60.5 40.5% rocksdb::UncompressBlockContentsForCompressionType 34.2 22.9% 63.4% 34.2 22.9% ceph::buffer::create_aligned_in_mempool 11.9 7.9% 71.3% 12.1 8.1% std::_Rb_tree::_M_emplace_hint_unique 10.7 7.1% 78.5% 71.2 47.7% rocksdb::ReadBlockContents Does it mean that most of RAM is used by rocksdb? How can I take a deeper look into memory usage ? Regards, Rafał Wądołowski ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs taking too much memory, for pglog
Hi Mark Thank you for your feedback! The maximum number of PGs per OSD is only 123. But we have PGs with a lot of objects. For RGW, there is an EC pool 8+3 with 1024 PGs with 900M objects, maybe this is the problematic part. The OSDs are 510 hdd, 32 ssd. Not sure, do you suggest to use something like ceph-objectstore-tool --op trim-pg-log ? When done correctly, would the risk be a lot of backfilling? Or also data loss? Also, to get up the cluster is one thing, to keep it running seems to be a real challenge right now (OOM killer) ... Cheers Harry On 13.05.20 07:10, Mark Nelson wrote: Hi Herald, Changing the bluestore cache settings will have no effect at all on pglog memory consumption. You can try either reducing the number of PGs (you might want to check and see how many PGs you have and specifically how many PGs on that OSD), or decrease the number of pglog entries per PG. Keep in mind that fewer PG log entries may impact recovery. FWIW, 8.5GB of memory usage for pglog implies that you have a lot of PGs per OSD, so that's probably the first place to look. Good luck! Mark On 5/12/20 5:10 PM, Harald Staub wrote: Several OSDs of one of our clusters are down currently because RAM usage has increased during the last days. Now it is more than we can handle on some systems. Frequently OSDs get killed by the OOM killer. Looking at "ceph daemon osd.$OSD_ID dump_mempools", it shows that nearly all (about 8.5 GB) is taken by osd_pglog, e.g. "osd_pglog": { "items": 461859, "bytes": 8445595868 }, We tried to reduce it, with "osd memory target" and even with "bluestore cache autotune = false" (together with "bluestore cache size hdd"), but there was no effect at all. I remember the pglog_hardlimit parameter, but that is already set by default with Nautilus I read. I.e. this is on Nautilus, 14.2.8. Is there a way to limit this pglog memory? Cheers Harry ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs taking too much memory, for pglog
Hi Herald, Changing the bluestore cache settings will have no effect at all on pglog memory consumption. You can try either reducing the number of PGs (you might want to check and see how many PGs you have and specifically how many PGs on that OSD), or decrease the number of pglog entries per PG. Keep in mind that fewer PG log entries may impact recovery. FWIW, 8.5GB of memory usage for pglog implies that you have a lot of PGs per OSD, so that's probably the first place to look. Good luck! Mark On 5/12/20 5:10 PM, Harald Staub wrote: Several OSDs of one of our clusters are down currently because RAM usage has increased during the last days. Now it is more than we can handle on some systems. Frequently OSDs get killed by the OOM killer. Looking at "ceph daemon osd.$OSD_ID dump_mempools", it shows that nearly all (about 8.5 GB) is taken by osd_pglog, e.g. "osd_pglog": { "items": 461859, "bytes": 8445595868 }, We tried to reduce it, with "osd memory target" and even with "bluestore cache autotune = false" (together with "bluestore cache size hdd"), but there was no effect at all. I remember the pglog_hardlimit parameter, but that is already set by default with Nautilus I read. I.e. this is on Nautilus, 14.2.8. Is there a way to limit this pglog memory? Cheers Harry ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Read speed low in cephfs volume exposed as samba share using vfs_ceph
Hi, I am running a small 3 node Ceph Nautilus 14.2.8 cluster on Ubuntu 18.04. I am testing cluster to expose cephfs volume in samba v4 share for the user to access from windows for latter use. Samba Version 4.7.6-Ubuntu and mount.cifs version: 6.8. When I did a test with DD Write (600 MB/s) and md5sum file Read speed is (300 - 400 MB/s) from ceph kernel mount. The same volume I have exposed in samba using "vfs_ceph" and mounted it through CIFS in another ubuntu18.04 as client. Now, when I perform DD write I get the speed of 600 MB/s and md5sum of file Read speed is only 65 MB/s. There is a different result when I try to read the same file using smbclinet getting the speed of 101 MB/s. Why is this difference what could be the issue? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
Matching other fields in the token as part of the Condition Statement is work in progress, but isnt there in nautilus. Thanks, Pritha On Tue, May 12, 2020 at 10:21 PM Wyllys Ingersoll < wyllys.ingers...@keepertech.com> wrote: > Does STS support using other fields from the token as part of the > Condition statement? For example looking for specific "sub" identities or > matching on custom token fields like lists of roles? > > > > On Tue, May 12, 2020 at 11:50 AM Matt Benjamin > wrote: > >> yay! thanks Wyllys, Pritha >> >> Matt >> >> On Tue, May 12, 2020 at 11:38 AM Wyllys Ingersoll >> wrote: >> > >> > >> > Thanks for the hint, I fixed my keycloak configuration for that >> application client so the token only includes a single audience value and >> now it works fine. >> > >> > thanks!! >> > >> > >> > On Tue, May 12, 2020 at 11:11 AM Wyllys Ingersoll < >> wyllys.ingers...@keepertech.com> wrote: >> >> >> >> The "aud" field in the introspection result is a list, not a single >> string. >> >> >> >> On Tue, May 12, 2020 at 11:02 AM Pritha Srivastava < >> prsri...@redhat.com> wrote: >> >>> >> >>> app_id must match with the 'aud' field in the token introspection >> result (In the example the value of 'aud' is 'customer-portal') >> >>> >> >>> Thanks, >> >>> Pritha >> >>> >> >>> On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll < >> wyllys.ingers...@keepertech.com> wrote: >> >> >> Running Nautilus 14.2.9 and trying to follow the STS example given >> here: https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy >> for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. >> I am able to see in the rgw debug logs that the token being passed from the >> client is passing the introspection check, but it always ends up failing >> the final authorization to access the requested bucket resource and is >> rejected with a 403 status "AccessDenied". >> >> I configured my policy as described in the 2nd example on the STS >> page above. I suspect the problem is with the "StringEquals" condition >> statement in the AssumeRolePolicy document (I could be wrong though). >> >> The example shows using the keycloak URI followed by ":app_id" >> matching with the name of the keycloak client application >> ("customer-portal" in the example). My keycloak setup does not have any >> such field in the introspection result and I can't seem to figure out how >> to make this all work. >> >> I cranked up the logging to 20/20 and still did not see any hints as >> to what part of the policy is causing the access to be denied. >> >> Any suggestions? >> >> -Wyllys Ingersoll >> >> ___ >> Dev mailing list -- d...@ceph.io >> To unsubscribe send an email to dev-le...@ceph.io >> > >> > ___ >> > Dev mailing list -- d...@ceph.io >> > To unsubscribe send an email to dev-le...@ceph.io >> >> >> >> -- >> >> Matt Benjamin >> Red Hat, Inc. >> 315 West Huron Street, Suite 140A >> Ann Arbor, Michigan 48103 >> >> http://www.redhat.com/en/technologies/storage >> >> tel. 734-821-5101 >> fax. 734-769-8938 >> cel. 734-216-5309 >> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
Hello David I have physical devices i can use to mirror the OSD's no problem. But i dont't think those disks are actually failing since there is no bad sector on them and they are brand new with no issues reading from. But they got corrupt OSD superblock which i believe happend because of bad SAS controller or unclean shutdown and i can't find any way to get the data off them or repair the OSD superblock. On Tue, May 12, 2020 at 11:47 PM David Turner wrote: > Do you have access to another Ceph cluster with enough available space to > create rbds that you dd these failing disks into? That's what I'm doing > right now with some failing disks. I've recovered 2 out of 6 osds that > failed in this way. I would recommend against using the same cluster for > this, but a stage cluster or something would be great. > > On Tue, May 12, 2020, 7:36 PM Kári Bertilsson > wrote: > >> Hi Paul >> >> I was able to mount both OSD's i need data from successfully using >> "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse >> --mountpoint /osd92/" >> >> I see the PG slices that are missing in the mounted folder >> "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the >> mounted folder and that works fine. >> >> But when i try to export it fails. I get the same error when trying to >> list. >> >> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list >> --debug >> Output @ https://pastebin.com/nXScEL6L >> >> Any ideas ? >> >> On Tue, May 12, 2020 at 12:17 PM Paul Emmerich >> wrote: >> >> > First thing I'd try is to use objectstore-tool to scrape the >> > inactive/broken PGs from the dead OSDs using it's PG export feature. >> > Then import these PGs into any other OSD which will automatically >> recover >> > it. >> > >> > Paul >> > >> > -- >> > Paul Emmerich >> > >> > Looking for help with your Ceph cluster? Contact us at https://croit.io >> > >> > croit GmbH >> > Freseniusstr. 31h >> > 81247 München >> > www.croit.io >> > Tel: +49 89 1896585 90 >> > >> > >> > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson >> > wrote: >> > >> >> Yes >> >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1 >> >> >> >> On Tue, May 12, 2020 at 10:39 AM Eugen Block wrote: >> >> >> >> > Can you share your osd tree and the current ceph status? >> >> > >> >> > >> >> > Zitat von Kári Bertilsson : >> >> > >> >> > > Hello >> >> > > >> >> > > I had an incidence where 3 OSD's crashed at once completely and >> won't >> >> > power >> >> > > up. And during recovery 3 OSD's in another host have somehow become >> >> > > corrupted. I am running erasure coding with 8+2 setup using crush >> map >> >> > which >> >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few >> >> PG's >> >> > > down. Unfortunately these PG's seem to overlap almost all data on >> the >> >> > pool, >> >> > > so i believe the entire pool is mostly lost after only these 2% of >> >> PG's >> >> > > down. >> >> > > >> >> > > I am running ceph 14.2.9. >> >> > > >> >> > > OSD 92 log https://pastebin.com/5aq8SyCW >> >> > > OSD 97 log https://pastebin.com/uJELZxwr >> >> > > >> >> > > ceph-bluestore-tool repair without --deep showed "success" but >> OSD's >> >> > still >> >> > > fail with the log above. >> >> > > >> >> > > Log from trying ceph-bluestore-tool repair --deep which is still >> >> running, >> >> > > not sure if it will actually fix anything and log looks pretty bad. >> >> > > https://pastebin.com/gkqTZpY3 >> >> > > >> >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 >> >> --op >> >> > > list" gave me input/output error. But everything in SMART looks OK, >> >> and i >> >> > > see no indication of hardware read error in any logs. Same for both >> >> OSD. >> >> > > >> >> > > The OSD's with corruption have absolutely no bad sectors and likely >> >> have >> >> > > only a minor corruption but at important locations. >> >> > > >> >> > > Any ideas on how to recover this kind of scenario ? Any tips would >> be >> >> > > highly appreciated. >> >> > > >> >> > > Best regards, >> >> > > Kári Bertilsson >> >> > > ___ >> >> > > ceph-users mailing list -- ceph-users@ceph.io >> >> > > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> > >> >> > >> >> > ___ >> >> > ceph-users mailing list -- ceph-users@ceph.io >> >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> > >> >> ___ >> >> ceph-users mailing list -- ceph-users@ceph.io >> >> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
Do you have access to another Ceph cluster with enough available space to create rbds that you dd these failing disks into? That's what I'm doing right now with some failing disks. I've recovered 2 out of 6 osds that failed in this way. I would recommend against using the same cluster for this, but a stage cluster or something would be great. On Tue, May 12, 2020, 7:36 PM Kári Bertilsson wrote: > Hi Paul > > I was able to mount both OSD's i need data from successfully using > "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse > --mountpoint /osd92/" > > I see the PG slices that are missing in the mounted folder > "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the > mounted folder and that works fine. > > But when i try to export it fails. I get the same error when trying to > list. > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list > --debug > Output @ https://pastebin.com/nXScEL6L > > Any ideas ? > > On Tue, May 12, 2020 at 12:17 PM Paul Emmerich > wrote: > > > First thing I'd try is to use objectstore-tool to scrape the > > inactive/broken PGs from the dead OSDs using it's PG export feature. > > Then import these PGs into any other OSD which will automatically recover > > it. > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > > > > > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson > > wrote: > > > >> Yes > >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1 > >> > >> On Tue, May 12, 2020 at 10:39 AM Eugen Block wrote: > >> > >> > Can you share your osd tree and the current ceph status? > >> > > >> > > >> > Zitat von Kári Bertilsson : > >> > > >> > > Hello > >> > > > >> > > I had an incidence where 3 OSD's crashed at once completely and > won't > >> > power > >> > > up. And during recovery 3 OSD's in another host have somehow become > >> > > corrupted. I am running erasure coding with 8+2 setup using crush > map > >> > which > >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few > >> PG's > >> > > down. Unfortunately these PG's seem to overlap almost all data on > the > >> > pool, > >> > > so i believe the entire pool is mostly lost after only these 2% of > >> PG's > >> > > down. > >> > > > >> > > I am running ceph 14.2.9. > >> > > > >> > > OSD 92 log https://pastebin.com/5aq8SyCW > >> > > OSD 97 log https://pastebin.com/uJELZxwr > >> > > > >> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's > >> > still > >> > > fail with the log above. > >> > > > >> > > Log from trying ceph-bluestore-tool repair --deep which is still > >> running, > >> > > not sure if it will actually fix anything and log looks pretty bad. > >> > > https://pastebin.com/gkqTZpY3 > >> > > > >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 > >> --op > >> > > list" gave me input/output error. But everything in SMART looks OK, > >> and i > >> > > see no indication of hardware read error in any logs. Same for both > >> OSD. > >> > > > >> > > The OSD's with corruption have absolutely no bad sectors and likely > >> have > >> > > only a minor corruption but at important locations. > >> > > > >> > > Any ideas on how to recover this kind of scenario ? Any tips would > be > >> > > highly appreciated. > >> > > > >> > > Best regards, > >> > > Kári Bertilsson > >> > > ___ > >> > > ceph-users mailing list -- ceph-users@ceph.io > >> > > To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > >> > > >> > ___ > >> > ceph-users mailing list -- ceph-users@ceph.io > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
Hi Paul I was able to mount both OSD's i need data from successfully using "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse --mountpoint /osd92/" I see the PG slices that are missing in the mounted folder "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the mounted folder and that works fine. But when i try to export it fails. I get the same error when trying to list. # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list --debug Output @ https://pastebin.com/nXScEL6L Any ideas ? On Tue, May 12, 2020 at 12:17 PM Paul Emmerich wrote: > First thing I'd try is to use objectstore-tool to scrape the > inactive/broken PGs from the dead OSDs using it's PG export feature. > Then import these PGs into any other OSD which will automatically recover > it. > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson > wrote: > >> Yes >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1 >> >> On Tue, May 12, 2020 at 10:39 AM Eugen Block wrote: >> >> > Can you share your osd tree and the current ceph status? >> > >> > >> > Zitat von Kári Bertilsson : >> > >> > > Hello >> > > >> > > I had an incidence where 3 OSD's crashed at once completely and won't >> > power >> > > up. And during recovery 3 OSD's in another host have somehow become >> > > corrupted. I am running erasure coding with 8+2 setup using crush map >> > which >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few >> PG's >> > > down. Unfortunately these PG's seem to overlap almost all data on the >> > pool, >> > > so i believe the entire pool is mostly lost after only these 2% of >> PG's >> > > down. >> > > >> > > I am running ceph 14.2.9. >> > > >> > > OSD 92 log https://pastebin.com/5aq8SyCW >> > > OSD 97 log https://pastebin.com/uJELZxwr >> > > >> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's >> > still >> > > fail with the log above. >> > > >> > > Log from trying ceph-bluestore-tool repair --deep which is still >> running, >> > > not sure if it will actually fix anything and log looks pretty bad. >> > > https://pastebin.com/gkqTZpY3 >> > > >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 >> --op >> > > list" gave me input/output error. But everything in SMART looks OK, >> and i >> > > see no indication of hardware read error in any logs. Same for both >> OSD. >> > > >> > > The OSD's with corruption have absolutely no bad sectors and likely >> have >> > > only a minor corruption but at important locations. >> > > >> > > Any ideas on how to recover this kind of scenario ? Any tips would be >> > > highly appreciated. >> > > >> > > Best regards, >> > > Kári Bertilsson >> > > ___ >> > > ceph-users mailing list -- ceph-users@ceph.io >> > > To unsubscribe send an email to ceph-users-le...@ceph.io >> > >> > >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Difficulty creating a topic for bucket notifications
Hi, I am trying to create a topic so that I can use it to listen for object creation notifications on a bucket. If I make my API call without supplying AWS authorization headers, the topic creation succeeds, and it can be seen by using a ListTopics call. However, in order to attach a topic to a bucket, the topic and bucket must have the same owner. So I tried creating a topic using AWS auth. The credential header I tried was the same as what I use for get/put items to a bucket: Credential=/20200512/us-east-1/s3/aws4_request However in this case rather than succeeding I get a NotImplemented error. If I tried changing the service in the credential to something other than s3, like "Credential=/20200512/us-east-1/s3/aws4_request" I instead get a SignatureDoesNotMatch error. What is the right way to authenticate a CreateTopic request? Thanks, Alexis ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] OSDs taking too much memory, for pglog
Several OSDs of one of our clusters are down currently because RAM usage has increased during the last days. Now it is more than we can handle on some systems. Frequently OSDs get killed by the OOM killer. Looking at "ceph daemon osd.$OSD_ID dump_mempools", it shows that nearly all (about 8.5 GB) is taken by osd_pglog, e.g. "osd_pglog": { "items": 461859, "bytes": 8445595868 }, We tried to reduce it, with "osd memory target" and even with "bluestore cache autotune = false" (together with "bluestore cache size hdd"), but there was no effect at all. I remember the pglog_hardlimit parameter, but that is already set by default with Nautilus I read. I.e. this is on Nautilus, 14.2.8. Is there a way to limit this pglog memory? Cheers Harry ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unable to reshard bucket
Thank you..I looked through both logs and noticed this in the cancel one: osd_op(unknown.0.0:4164 41.2 41:55b0279d:reshard::reshard.09:head [call rgw.reshard_remove] snapc 0=[] ondisk+write+known_if_redirected e24984) v8 -- 0x7fe9b3625710 con 0 osd_op_reply(4164 reshard.09 [call] v24984'105796943 uv105796922 ondisk = -2 ((2) No such file or directory)) v8 162+0+0 (203651653 0 0) 0x7fe9880044a0 con 0x7fe9b3625b70 ERROR: failed to remove entry from reshard log, oid=reshard.09 tenant= bucket=foo Is there anything else that I should look for? It looks like the cancel process thinks that reshard.09 is present (and probably blocking my attempts at resharding) but it's not actually there and thus can't be removed. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unable to reshard bucket
Perhaps the next step is to examine the generated logs from: radosgw-admin reshard status --bucket=foo --debug-rgw=20 --debug-ms=1 radosgw-admin reshard cancel --bucket foo --debug-rgw=20 --debug-ms=1 Eric -- J. Eric Ivancich he / him / his Red Hat Storage Ann Arbor, Michigan, USA > On May 11, 2020, at 12:25 PM, Timothy Geier wrote: > > Hello all, > > I'm having an issue with a bucket that refuses to be resharded..for the > record, the cluster was recently upgraded from 13.2.4 to 13.2.10. > > # radosgw-admin reshard add --bucket foo --num-shards 3300 > ERROR: the bucket is currently undergoing resharding and cannot be added to > the reshard list at this time > > # radosgw-admin reshard list > [] > > # radosgw-admin reshard status --bucket=foo > [ > { > "reshard_status": "not-resharding", > "new_bucket_instance_id": "", > "num_shards": -1 > }, > > > # radosgw-admin reshard cancel --bucket foo > ERROR: failed to remove entry from reshard log, oid=reshard.09 > tenant= bucket=foo > > # radosgw-admin reshard stale-instances list > [] > > Is there anything else I should check to troubleshoot this? I was able to > reshard another bucket since the upgrade, so I suspect there's something > lingering that's blocking this. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster network and public network
> I think, however, that a disappearing back network has no real consequences > as the heartbeats always go over both. FWIW this has not been my experience, at least through Luminous. What I’ve seen is that when the cluster/replication net is configured but unavailable, OSD heartbeats fail and peers report them to the mons as down. The mons send out a map accordingly, and the affected OSDs report “I’m not dead yet!”. Flap flap flap. YMMV ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
Does STS support using other fields from the token as part of the Condition statement? For example looking for specific "sub" identities or matching on custom token fields like lists of roles? On Tue, May 12, 2020 at 11:50 AM Matt Benjamin wrote: > yay! thanks Wyllys, Pritha > > Matt > > On Tue, May 12, 2020 at 11:38 AM Wyllys Ingersoll > wrote: > > > > > > Thanks for the hint, I fixed my keycloak configuration for that > application client so the token only includes a single audience value and > now it works fine. > > > > thanks!! > > > > > > On Tue, May 12, 2020 at 11:11 AM Wyllys Ingersoll < > wyllys.ingers...@keepertech.com> wrote: > >> > >> The "aud" field in the introspection result is a list, not a single > string. > >> > >> On Tue, May 12, 2020 at 11:02 AM Pritha Srivastava > wrote: > >>> > >>> app_id must match with the 'aud' field in the token introspection > result (In the example the value of 'aud' is 'customer-portal') > >>> > >>> Thanks, > >>> Pritha > >>> > >>> On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll < > wyllys.ingers...@keepertech.com> wrote: > > > Running Nautilus 14.2.9 and trying to follow the STS example given > here: https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy > for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. > I am able to see in the rgw debug logs that the token being passed from the > client is passing the introspection check, but it always ends up failing > the final authorization to access the requested bucket resource and is > rejected with a 403 status "AccessDenied". > > I configured my policy as described in the 2nd example on the STS > page above. I suspect the problem is with the "StringEquals" condition > statement in the AssumeRolePolicy document (I could be wrong though). > > The example shows using the keycloak URI followed by ":app_id" > matching with the name of the keycloak client application > ("customer-portal" in the example). My keycloak setup does not have any > such field in the introspection result and I can't seem to figure out how > to make this all work. > > I cranked up the logging to 20/20 and still did not see any hints as > to what part of the policy is causing the access to be denied. > > Any suggestions? > > -Wyllys Ingersoll > > ___ > Dev mailing list -- d...@ceph.io > To unsubscribe send an email to dev-le...@ceph.io > > > > ___ > > Dev mailing list -- d...@ceph.io > > To unsubscribe send an email to dev-le...@ceph.io > > > > -- > > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-821-5101 > fax. 734-769-8938 > cel. 734-216-5309 > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster network and public network
Hi MJ, this should work. Note that when using cloned devices all traffic will go through the same VLAN. In that case, I believe you an simply remove the cluster network definition and use just one IP, there is no point having the second IP on the same VLAN. You will probably have to do "noout,nodown" for the flip-over, which probably required a restart of each OSD. I think, however, that a disappearing back network has no real consequences as the heartbeats always go over both. There might be stuck replication traffic for a while, but even this can be avoided with "osd pause". Our configuration with 2 VLANS is this: public network: ceph0.81: flags=4163 mtu 9000 cluster network: ceph0.82: flags=4163 mtu 9000 ceph0: flags=5187 mtu 9000 em1: flags=6211 mtu 9000 em2: flags=6211 mtu 9000 p1p1: flags=6211 mtu 9000 p1p2: flags=6211 mtu 9000 p2p1: flags=6211 mtu 9000 p2p2: flags=6211 mtu 9000 If you already have 2 VLANs with different IDs, then this flip-over is trivial. I did it without service outage. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: mj Sent: 12 May 2020 13:12:47 To: ceph-users@ceph.io Subject: [ceph-users] Re: Cluster network and public network Hi, On 11/05/2020 08:50, Wido den Hollander wrote: > Great to hear! I'm still behind this idea and all the clusters I design > have a single (or LACP) network going to the host. > > One IP address per node where all traffic goes over. That's Ceph, SSH, > (SNMP) Monitoring, etc. > > Wido We have an 'old-style' cluster, seperated LAN/cluster network. We would like to move over to the 'new-style'. Is it as easy as: define the NICs in a 2x10G LACP bond0, and add both NICs to the bond0 config, and add configure like: > auto bond0 > iface bond0 inet static > address 192.168.0.5 > netmask 255.255.255.0 and add our cluster IP as a second IP, like > auto bond0:1 > iface bond0:1 inet static > address 192.168.10.160 > netmask 255.255.255.0 On all nodes, reboot, and everything will work? Or are there ceph specifics to consider? Thanks, MJ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Apply/Commit vs Read/Write Op Latency
Hello, Bumping this in hopes that someone can shed some light on this. I've tried to find details on these metrics but I've come up empty handed. Thank you, John ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] DocuBetter Meeting -- EMEA 13 May 2020
There is a general documentation meeting called the "DocuBetter Meeting", and it is held every two weeks. The next DocuBetter Meeting will be on 13 May 2020 at 0830 PST, and will run for thirty minutes. Everyone with a documentation-related request or complaint is invited. The meeting will be held here: https://bluejeans.com/908675367 Send documentation-related requests and complaints to me by replying to this email and CCing me at zac.do...@gmail.com. The next DocuBetter meeting is scheduled for: 13 May 2020 0830 PST 13 May 2020 1630 UTC 14 May 2020 0230 AEST Etherpad: https://pad.ceph.com/p/Ceph_Documentation Meeting: https://bluejeans.com/908675367 Thanks, everyone. Zac Dover ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Zeroing out rbd image or volume
thanks a lot for all. Looks like dd zero does not help much about improving security, but OSD encryption would be sufficent. best regards, Samuel huxia...@horebdata.cn From: Wido den Hollander Date: 2020-05-12 14:03 To: Paul Emmerich; Dillaman, Jason CC: Marc Roos; ceph-users Subject: [ceph-users] Re: Zeroing out rbd image or volume On 5/12/20 1:54 PM, Paul Emmerich wrote: > And many hypervisors will turn writing zeroes into an unmap/trim (qemu > detect-zeroes=unmap), so running trim on the entire empty disk is often the > same as writing zeroes. > So +1 for encryption being the proper way here > +1 And to add to this: No, a newly created RBD image will never have 'left over' bits and bytes from a previous RBD image. I had to explain this multiple times to people which were used to old (i)SCSI setups where partitions could have leftover data from a previously created LUN. With RBD this won't happen. Wido > > Paul > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
yay! thanks Wyllys, Pritha Matt On Tue, May 12, 2020 at 11:38 AM Wyllys Ingersoll wrote: > > > Thanks for the hint, I fixed my keycloak configuration for that application > client so the token only includes a single audience value and now it works > fine. > > thanks!! > > > On Tue, May 12, 2020 at 11:11 AM Wyllys Ingersoll > wrote: >> >> The "aud" field in the introspection result is a list, not a single string. >> >> On Tue, May 12, 2020 at 11:02 AM Pritha Srivastava >> wrote: >>> >>> app_id must match with the 'aud' field in the token introspection result >>> (In the example the value of 'aud' is 'customer-portal') >>> >>> Thanks, >>> Pritha >>> >>> On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll >>> wrote: Running Nautilus 14.2.9 and trying to follow the STS example given here: https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. I am able to see in the rgw debug logs that the token being passed from the client is passing the introspection check, but it always ends up failing the final authorization to access the requested bucket resource and is rejected with a 403 status "AccessDenied". I configured my policy as described in the 2nd example on the STS page above. I suspect the problem is with the "StringEquals" condition statement in the AssumeRolePolicy document (I could be wrong though). The example shows using the keycloak URI followed by ":app_id" matching with the name of the keycloak client application ("customer-portal" in the example). My keycloak setup does not have any such field in the introspection result and I can't seem to figure out how to make this all work. I cranked up the logging to 20/20 and still did not see any hints as to what part of the policy is causing the access to be denied. Any suggestions? -Wyllys Ingersoll ___ Dev mailing list -- d...@ceph.io To unsubscribe send an email to dev-le...@ceph.io > > ___ > Dev mailing list -- d...@ceph.io > To unsubscribe send an email to dev-le...@ceph.io -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
Thanks for the hint, I fixed my keycloak configuration for that application client so the token only includes a single audience value and now it works fine. thanks!! On Tue, May 12, 2020 at 11:11 AM Wyllys Ingersoll < wyllys.ingers...@keepertech.com> wrote: > The "aud" field in the introspection result is a list, not a single string. > > On Tue, May 12, 2020 at 11:02 AM Pritha Srivastava > wrote: > >> app_id must match with the 'aud' field in the token introspection result >> (In the example the value of 'aud' is 'customer-portal') >> >> Thanks, >> Pritha >> >> On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll < >> wyllys.ingers...@keepertech.com> wrote: >> >>> >>> Running Nautilus 14.2.9 and trying to follow the STS example given here: >>> https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy >>> for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. >>> I am able to see in the rgw debug logs that the token being passed from the >>> client is passing the introspection check, but it always ends up failing >>> the final authorization to access the requested bucket resource and is >>> rejected with a 403 status "AccessDenied". >>> >>> I configured my policy as described in the 2nd example on the STS page >>> above. I suspect the problem is with the "StringEquals" condition statement >>> in the AssumeRolePolicy document (I could be wrong though). >>> >>> The example shows using the keycloak URI followed by ":app_id" matching >>> with the name of the keycloak client application ("customer-portal" in the >>> example). My keycloak setup does not have any such field in the >>> introspection result and I can't seem to figure out how to make this all >>> work. >>> >>> I cranked up the logging to 20/20 and still did not see any hints as to >>> what part of the policy is causing the access to be denied. >>> >>> Any suggestions? >>> >>> -Wyllys Ingersoll >>> >>> ___ >>> Dev mailing list -- d...@ceph.io >>> To unsubscribe send an email to dev-le...@ceph.io >>> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
The "aud" field in the introspection result is a list, not a single string. On Tue, May 12, 2020 at 11:02 AM Pritha Srivastava wrote: > app_id must match with the 'aud' field in the token introspection result > (In the example the value of 'aud' is 'customer-portal') > > Thanks, > Pritha > > On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll < > wyllys.ingers...@keepertech.com> wrote: > >> >> Running Nautilus 14.2.9 and trying to follow the STS example given here: >> https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy >> for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. >> I am able to see in the rgw debug logs that the token being passed from the >> client is passing the introspection check, but it always ends up failing >> the final authorization to access the requested bucket resource and is >> rejected with a 403 status "AccessDenied". >> >> I configured my policy as described in the 2nd example on the STS page >> above. I suspect the problem is with the "StringEquals" condition statement >> in the AssumeRolePolicy document (I could be wrong though). >> >> The example shows using the keycloak URI followed by ":app_id" matching >> with the name of the keycloak client application ("customer-portal" in the >> example). My keycloak setup does not have any such field in the >> introspection result and I can't seem to figure out how to make this all >> work. >> >> I cranked up the logging to 20/20 and still did not see any hints as to >> what part of the policy is causing the access to be denied. >> >> Any suggestions? >> >> -Wyllys Ingersoll >> >> ___ >> Dev mailing list -- d...@ceph.io >> To unsubscribe send an email to dev-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW STS Support in Nautilus ?
app_id must match with the 'aud' field in the token introspection result (In the example the value of 'aud' is 'customer-portal') Thanks, Pritha On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll < wyllys.ingers...@keepertech.com> wrote: > > Running Nautilus 14.2.9 and trying to follow the STS example given here: > https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy > for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider. > I am able to see in the rgw debug logs that the token being passed from the > client is passing the introspection check, but it always ends up failing > the final authorization to access the requested bucket resource and is > rejected with a 403 status "AccessDenied". > > I configured my policy as described in the 2nd example on the STS page > above. I suspect the problem is with the "StringEquals" condition statement > in the AssumeRolePolicy document (I could be wrong though). > > The example shows using the keycloak URI followed by ":app_id" matching > with the name of the keycloak client application ("customer-portal" in the > example). My keycloak setup does not have any such field in the > introspection result and I can't seem to figure out how to make this all > work. > > I cranked up the logging to 20/20 and still did not see any hints as to > what part of the policy is causing the access to be denied. > > Any suggestions? > > -Wyllys Ingersoll > > ___ > Dev mailing list -- d...@ceph.io > To unsubscribe send an email to dev-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Add lvm in cephadm
Hello, Thank you very much Joshua, it worked. I have set up three nodes with the cephadm tool, which was very easy. But I asked myself, what if node 1 goes down? Before cephadm I just could manage everything from the other nodes with the ceph commands. Now I'm a bit stuck, because this cephadm container is just running on one node. I've installed it on the second one, but i'm getting a "[errno 13] RADOS permission denied (error connecting to the cluster)". Do I need some special "cephadm" keyring from the first node? Which one? And where to put it? Caphadm might be an easy to handle solution, but for me as a beginner, the added layer is very complicated to get in. We are trying to build a new Ceph cluster (never got in touch with it before) but I might not go with octopus, but instead use nautilus with ceph-deploy. That's a bit easyer to understand, and the documentation out there is way better. Thanks in advance, Simon Von: Joshua Schmid Gesendet: Dienstag, 5. Mai 2020 16:39:29 An: Simon Sutter Cc: ceph-users@ceph.io Betreff: Re: [ceph-users] Re: Add lvm in cephadm On 20/05/05 08:46, Simon Sutter wrote: > Sorry I missclicked, here the second part: > > > ceph-volume --cluster ceph lvm prepare --data /dev/centos_node1/ceph > But that gives me just: > > Running command: /usr/bin/ceph-authtool --gen-print-key > Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd > --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new > f3b442b1-68f7-456a-9991-92254e7c9c30 > stderr: [errno 13] RADOS permission denied (error connecting to the cluster) > --> RuntimeError: Unable to create a new OSD id Hey Simon, This still works but is now encapsulated in a cephadm command. ceph orch daemon add osd : so in your case: ceph orch daemon add osd $host:centos_node1/ceph hth -- Joshua Schmid Software Engineer SUSE Enterprise Storage ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw user access questions
Rgw users are a higher-level feature, and they don't have a direct relationship to rados pools. Their permissions are controlled at the bucket/object level by the S3/Swift APIs. I would start by reading about S3's ACLs and bucket policies. On Mon, May 11, 2020 at 1:42 AM Vishwas Bm wrote: > > Hi, > > I am a newbie to ceph. I have gone through the ceph docs, we are planning > to use rgw for object storage. > > From the docs, what I have understood is that there are two types of users: > 1) ceph storage user > 2) radosgw user > > I am able to create user of both the types. But I am not able to understand > how to restrict the rgw user access to a pool. > > My questions are below: > 1) How to restrict the access of a rgw user to a particular pool ? Can this > be done using placement groups ? > > 2) Is it possible to restrict rgw user access to a particular namespace in > a pool ? > > 3) I can understand the flow till he is able to write to a bucket using the > .index pool object. But I am not able to understand the flow how the rgw > user can write objects in pool. Where can I check the permissions ? > > *Thanks & Regards,* > > *Vishwas * > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
First thing I'd try is to use objectstore-tool to scrape the inactive/broken PGs from the dead OSDs using it's PG export feature. Then import these PGs into any other OSD which will automatically recover it. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson wrote: > Yes > ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1 > > On Tue, May 12, 2020 at 10:39 AM Eugen Block wrote: > > > Can you share your osd tree and the current ceph status? > > > > > > Zitat von Kári Bertilsson : > > > > > Hello > > > > > > I had an incidence where 3 OSD's crashed at once completely and won't > > power > > > up. And during recovery 3 OSD's in another host have somehow become > > > corrupted. I am running erasure coding with 8+2 setup using crush map > > which > > > takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's > > > down. Unfortunately these PG's seem to overlap almost all data on the > > pool, > > > so i believe the entire pool is mostly lost after only these 2% of PG's > > > down. > > > > > > I am running ceph 14.2.9. > > > > > > OSD 92 log https://pastebin.com/5aq8SyCW > > > OSD 97 log https://pastebin.com/uJELZxwr > > > > > > ceph-bluestore-tool repair without --deep showed "success" but OSD's > > still > > > fail with the log above. > > > > > > Log from trying ceph-bluestore-tool repair --deep which is still > running, > > > not sure if it will actually fix anything and log looks pretty bad. > > > https://pastebin.com/gkqTZpY3 > > > > > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 > --op > > > list" gave me input/output error. But everything in SMART looks OK, > and i > > > see no indication of hardware read error in any logs. Same for both > OSD. > > > > > > The OSD's with corruption have absolutely no bad sectors and likely > have > > > only a minor corruption but at important locations. > > > > > > Any ideas on how to recover this kind of scenario ? Any tips would be > > > highly appreciated. > > > > > > Best regards, > > > Kári Bertilsson > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
Yes ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1 On Tue, May 12, 2020 at 10:39 AM Eugen Block wrote: > Can you share your osd tree and the current ceph status? > > > Zitat von Kári Bertilsson : > > > Hello > > > > I had an incidence where 3 OSD's crashed at once completely and won't > power > > up. And during recovery 3 OSD's in another host have somehow become > > corrupted. I am running erasure coding with 8+2 setup using crush map > which > > takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's > > down. Unfortunately these PG's seem to overlap almost all data on the > pool, > > so i believe the entire pool is mostly lost after only these 2% of PG's > > down. > > > > I am running ceph 14.2.9. > > > > OSD 92 log https://pastebin.com/5aq8SyCW > > OSD 97 log https://pastebin.com/uJELZxwr > > > > ceph-bluestore-tool repair without --deep showed "success" but OSD's > still > > fail with the log above. > > > > Log from trying ceph-bluestore-tool repair --deep which is still running, > > not sure if it will actually fix anything and log looks pretty bad. > > https://pastebin.com/gkqTZpY3 > > > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op > > list" gave me input/output error. But everything in SMART looks OK, and i > > see no indication of hardware read error in any logs. Same for both OSD. > > > > The OSD's with corruption have absolutely no bad sectors and likely have > > only a minor corruption but at important locations. > > > > Any ideas on how to recover this kind of scenario ? Any tips would be > > highly appreciated. > > > > Best regards, > > Kári Bertilsson > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Zeroing out rbd image or volume
On 5/12/20 1:54 PM, Paul Emmerich wrote: > And many hypervisors will turn writing zeroes into an unmap/trim (qemu > detect-zeroes=unmap), so running trim on the entire empty disk is often the > same as writing zeroes. > So +1 for encryption being the proper way here > +1 And to add to this: No, a newly created RBD image will never have 'left over' bits and bytes from a previous RBD image. I had to explain this multiple times to people which were used to old (i)SCSI setups where partitions could have leftover data from a previously created LUN. With RBD this won't happen. Wido > > Paul > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Zeroing out rbd image or volume
And many hypervisors will turn writing zeroes into an unmap/trim (qemu detect-zeroes=unmap), so running trim on the entire empty disk is often the same as writing zeroes. So +1 for encryption being the proper way here Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Tue, May 12, 2020 at 1:52 PM Jason Dillaman wrote: > I would also like to add that the OSDs can (and will) use redirect on write > techniques (not to mention the physical device hardware as well). > Therefore, your zeroing of the device might just cause the OSDs to allocate > new extents of zeros while the old extents remain intact (albeit > unreferenced and available for future writes). The correct solution would > be to layer LUKS/dm-crypt on top of the RBD device if you need a strong > security guarantee about a specific image, or use encrypted OSDs if the > concern is about the loss of the OSD physical device. > > On Tue, May 12, 2020 at 6:58 AM Marc Roos > wrote: > > > > > dd if=/dev/zero of=rbd :) but if you have encrypted osd's, what > > would be the use of this? > > > > > > > > -Original Message- > > From: huxia...@horebdata.cn [mailto:huxia...@horebdata.cn] > > Sent: 12 May 2020 12:55 > > To: ceph-users > > Subject: [ceph-users] Zeroing out rbd image or volume > > > > Hi, Ceph folks, > > > > Is there a rbd command, or any other way, to zero out rbd images or > > volume? I would like to write all zero data to an rbd image/volume > > before remove it. > > > > Any comments would be appreciated. > > > > best regards, > > > > samuel > > Horebdata AG > > Switzerland > > > > > > > > > > huxia...@horebdata.cn > > ___ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > > email to ceph-users-le...@ceph.io > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > -- > Jason > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Zeroing out rbd image or volume
I would also like to add that the OSDs can (and will) use redirect on write techniques (not to mention the physical device hardware as well). Therefore, your zeroing of the device might just cause the OSDs to allocate new extents of zeros while the old extents remain intact (albeit unreferenced and available for future writes). The correct solution would be to layer LUKS/dm-crypt on top of the RBD device if you need a strong security guarantee about a specific image, or use encrypted OSDs if the concern is about the loss of the OSD physical device. On Tue, May 12, 2020 at 6:58 AM Marc Roos wrote: > > dd if=/dev/zero of=rbd :) but if you have encrypted osd's, what > would be the use of this? > > > > -Original Message- > From: huxia...@horebdata.cn [mailto:huxia...@horebdata.cn] > Sent: 12 May 2020 12:55 > To: ceph-users > Subject: [ceph-users] Zeroing out rbd image or volume > > Hi, Ceph folks, > > Is there a rbd command, or any other way, to zero out rbd images or > volume? I would like to write all zero data to an rbd image/volume > before remove it. > > Any comments would be appreciated. > > best regards, > > samuel > Horebdata AG > Switzerland > > > > > huxia...@horebdata.cn > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Jason ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Write Caching to hot tier not working as expected
Thanks Eric. Using your command for SET reported that the OSD may need a restart (which sets it back to default anyway) but the below seems to work: ceph tell osd.24 config set objecter_inflight_op_bytes 1073741824 ceph tell osd.24 config set objecter_inflight_ops 10240 reading back the settings looks right: [root@ceph00 ~]# ceph daemon osd.24 config show | grep objecter "debug_objecter": "0/1", "objecter_completion_locks_per_session": "32", "objecter_debug_inject_relock_delay": "false", "objecter_inflight_op_bytes": "1073741824", "objecter_inflight_ops": "10240", "objecter_inject_no_watch_ping": "false", "objecter_retry_writes_after_first_reply": "false", "objecter_tick_interval": "5.00", "objecter_timeout": "10.00", "osd_objecter_finishers": "1", Ive done that for the three OSDs that are in the cache tier. But the performance is unchanged - the writes still spill over to the HDD pool. Still, your idea sounds close - it does feel like something in the cache tier is hitting a limit. Regards, Steve -Original Message- From: Eric Smith Sent: Monday, 11 May 2020 9:11 PM To: Steve Hughes ; ceph-users@ceph.io Subject: RE: [ceph-users] Re: Write Caching to hot tier not working as expected Reading and setting them should be pretty easy: READ (Run from the host where OSD is hosted): ceph daemon osd. config show | grep objecter SET (Assuming these can be set in memory): ceph tell osd. injectargs "--objecter-inflight-op-bytes=1073741824" (Change to 1GB/sec throttle) To persist these you should add them to the ceph.conf (I'm not sure what section though - you might have to test this). And yes - the information is sketchy I agree - I don't really have any input here. That's the best I can do for now Eric -Original Message- From: Steve Hughes Sent: Monday, May 11, 2020 6:44 AM To: Eric Smith ; ceph-users@ceph.io Subject: RE: [ceph-users] Re: Write Caching to hot tier not working as expected Thank you Eric. That 'sounds like' exactly my issue. Though I'm surprised to bump into something like that on such a small system and at such low bandwidth. But the information I can find on those parameters is sketchy to say the least. Can you point me at some doco that explains what they do, how to read the current values and how to set them? Cheers, Steve -Original Message- From: Eric Smith Sent: Monday, 11 May 2020 8:00 PM To: Steve Hughes ; ceph-users@ceph.io Subject: RE: [ceph-users] Re: Write Caching to hot tier not working as expected It sounds like you might be bumping up against the default objecter_inflight_ops (1024) and/or objecter_inflight_op_bytes (100MB). -Original Message- From: ste...@scalar.com.au Sent: Monday, May 11, 2020 5:48 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Write Caching to hot tier not working as expected Interestingly, I have found that if I limit the rate at which data is written the tiering behaves as expected. I'm using a robocopy job from a Windows VM to copy large files from my existing storage array to a test Ceph volume. By using the /IPG parameter I can roughly control the rate at which data is written. I've found that if I limit the write rate to around 30MBytes/sec the data all goes to the hot tier, zero data goes to the HDD tier, and the observed write latency is about 5msec. If I go any higher than this I see data being written to the HDDs and the observed write latency goes way up. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster network and public network
Hi, On 11/05/2020 08:50, Wido den Hollander wrote: Great to hear! I'm still behind this idea and all the clusters I design have a single (or LACP) network going to the host. One IP address per node where all traffic goes over. That's Ceph, SSH, (SNMP) Monitoring, etc. Wido We have an 'old-style' cluster, seperated LAN/cluster network. We would like to move over to the 'new-style'. Is it as easy as: define the NICs in a 2x10G LACP bond0, and add both NICs to the bond0 config, and add configure like: auto bond0 iface bond0 inet static address 192.168.0.5 netmask 255.255.255.0 and add our cluster IP as a second IP, like auto bond0:1 iface bond0:1 inet static address 192.168.10.160 netmask 255.255.255.0 On all nodes, reboot, and everything will work? Or are there ceph specifics to consider? Thanks, MJ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Zeroing out rbd image or volume
dd if=/dev/zero of=rbd :) but if you have encrypted osd's, what would be the use of this? -Original Message- From: huxia...@horebdata.cn [mailto:huxia...@horebdata.cn] Sent: 12 May 2020 12:55 To: ceph-users Subject: [ceph-users] Zeroing out rbd image or volume Hi, Ceph folks, Is there a rbd command, or any other way, to zero out rbd images or volume? I would like to write all zero data to an rbd image/volume before remove it. Any comments would be appreciated. best regards, samuel Horebdata AG Switzerland huxia...@horebdata.cn ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Zeroing out rbd image or volume
Hi, Ceph folks, Is there a rbd command, or any other way, to zero out rbd images or volume? I would like to write all zero data to an rbd image/volume before remove it. Any comments would be appreciated. best regards, samuel Horebdata AG Switzerland huxia...@horebdata.cn ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD corruption and down PGs
Can you share your osd tree and the current ceph status? Zitat von Kári Bertilsson : Hello I had an incidence where 3 OSD's crashed at once completely and won't power up. And during recovery 3 OSD's in another host have somehow become corrupted. I am running erasure coding with 8+2 setup using crush map which takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's down. Unfortunately these PG's seem to overlap almost all data on the pool, so i believe the entire pool is mostly lost after only these 2% of PG's down. I am running ceph 14.2.9. OSD 92 log https://pastebin.com/5aq8SyCW OSD 97 log https://pastebin.com/uJELZxwr ceph-bluestore-tool repair without --deep showed "success" but OSD's still fail with the log above. Log from trying ceph-bluestore-tool repair --deep which is still running, not sure if it will actually fix anything and log looks pretty bad. https://pastebin.com/gkqTZpY3 Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op list" gave me input/output error. But everything in SMART looks OK, and i see no indication of hardware read error in any logs. Same for both OSD. The OSD's with corruption have absolutely no bad sectors and likely have only a minor corruption but at important locations. Any ideas on how to recover this kind of scenario ? Any tips would be highly appreciated. Best regards, Kári Bertilsson ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] OSD corruption and down PGs
Hello I had an incidence where 3 OSD's crashed at once completely and won't power up. And during recovery 3 OSD's in another host have somehow become corrupted. I am running erasure coding with 8+2 setup using crush map which takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's down. Unfortunately these PG's seem to overlap almost all data on the pool, so i believe the entire pool is mostly lost after only these 2% of PG's down. I am running ceph 14.2.9. OSD 92 log https://pastebin.com/5aq8SyCW OSD 97 log https://pastebin.com/uJELZxwr ceph-bluestore-tool repair without --deep showed "success" but OSD's still fail with the log above. Log from trying ceph-bluestore-tool repair --deep which is still running, not sure if it will actually fix anything and log looks pretty bad. https://pastebin.com/gkqTZpY3 Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op list" gave me input/output error. But everything in SMART looks OK, and i see no indication of hardware read error in any logs. Same for both OSD. The OSD's with corruption have absolutely no bad sectors and likely have only a minor corruption but at important locations. Any ideas on how to recover this kind of scenario ? Any tips would be highly appreciated. Best regards, Kári Bertilsson ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw user access questions
Hi, Any input on this ? *Thanks & Regards,* *Vishwas * On Mon, May 11, 2020 at 11:11 AM Vishwas Bm wrote: > Hi, > > I am a newbie to ceph. I have gone through the ceph docs, we are planning > to use rgw for object storage. > > From the docs, what I have understood is that there are two types of users: > 1) ceph storage user > 2) radosgw user > > I am able to create user of both the types. But I am not able to > understand how to restrict the rgw user access to a pool. > > My questions are below: > 1) How to restrict the access of a rgw user to a particular pool ? Can > this be done using placement groups ? > > 2) Is it possible to restrict rgw user access to a particular namespace in > a pool ? > > 3) I can understand the flow till he is able to write to a bucket using > the .index pool object. But I am not able to understand the flow how the > rgw user can write objects in pool. Where can I check the permissions ? > > *Thanks & Regards,* > > *Vishwas * > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: nfs migrate to rgw
On 5/12/20 4:22 AM, Zhenshi Zhou wrote: > Hi all, > > We have several nfs servers providing file storage. There is a nginx in > front of > nfs servers in order to serve the clients. The files are mostly small files > and > nearly about 30TB in total. > What is small? How many objects/files are you talking about? > I'm gonna use ceph rgw as the storage. I wanna know if it's appropriate to > do so. > The data migrating from nfs to rgw is a huge job. Besides I'm not sure > whether > ceph rgw is suitable in this scenario or not. > Yes, it is. But make sure you don't put millions of objects into a single bucket. Make sure that you spread them out so that you have let's say 1M of objects per bucket at max. Wido > Thanks > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cifs slow read speed
Hi, I am running a small Ceph Nautilus cluster on Ubuntu 18.04. I am testing cluster to expose cephfs volume in samba v4 share for user to access from windows. When i do test with DD Write (600 MB/s) and md5sum file Read speed is (700 - 800 MB/s) from ceph kernel mount. Same volume i have exposed in samba using "vfs_ceph" and mounted it thru cifs in another ubuntu18.04 as client. Now, when i perform DD write i get speed of 600 MB/s and md5sum of file Read speed is only 65 MB/s. What could be the problem? Any one faced similar issue? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io