[ceph-users] Re: Question on multi-site
Replication works on osd layer, rgw is a http frontend for objects. If you write some object via librados directly, rgw will not be awared about this k Sent from my iPhone > On 22 Feb 2021, at 18:52, Cary FitzHugh wrote: > > Question is - do files which are written directly to an OSD get replicated > using the gateway, or is it only files which are written through the > gateway that get replicated? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB
OMAP with keys works as database-like replication, new keys/updates comes to acting set as data stream, not a full object k Sent from my iPhone > On 22 Feb 2021, at 17:13, Benoît Knecht wrote: > > Is recovery faster for OMAP compared to the equivalent number of RADOS > objects? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Multisite sync shards cleanup
Hi, Is there a way to cleanup the sync shards and start from scratch? Thank you This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Unable to delete bucket - endless multipart uploads?
Hi All, We've been dealing with what seems to be a pretty annoying bug for a while now. We are unable to delete a customer's bucket that seems to have an extremely large number of aborted multipart uploads. I've had $(radosgw-admin bucket rm --bucket=pusulax --purge-objects) running in a screen session for almost 3 weeks now and it's still not finished; it's most likely stuck in a loop or something. The screen session with debug-rgw=10 spams billions of these messages: 2021-02-23 15:38:58.667 7f9b55704840 10 RGWRados::cls_bucket_list_unordered: got _multipart_04/d3/04d33e18-3f13-433c-b924-56602d702d60-31.msg.2~0DTalUjTHsnIiKraN1klwIFO88Vc2E3.meta[] 2021-02-23 15:38:58.667 7f9b55704840 10 RGWRados::cls_bucket_list_unordered: got _multipart_04/d7/04d7ad26-c8ec-4a39-9938-329acd6d9da7-102.msg.2~K_gAeTpfEongNvaOMNa0IFwSGPpQ1iA.meta[] 2021-02-23 15:38:58.667 7f9b55704840 10 RGWRados::cls_bucket_list_unordered: got _multipart_04/da/04da4147-c949-4c3a-aca6-e63298f5ff62-102.msg.2~-hXBSFcjQKbMkiyEqSgLaXMm75qFzEp.meta[] 2021-02-23 15:38:58.667 7f9b55704840 10 RGWRados::cls_bucket_list_unordered: got _multipart_04/db/04dbb0e6-dfb0-42fb-9d0f-49cceb18457f-102.msg.2~B5EhGgBU5U_U7EA5r8IhVpO3Aj2OvKg.meta[] 2021-02-23 15:38:58.667 7f9b55704840 10 RGWRados::cls_bucket_list_unordered: got _multipart_04/df/04df39be-06ab-4c72-bc63-3fac1d2700a9-11.msg.2~_8h5fWlkNrIMqcrZgNbAoJfc8BN1Xx-.meta[] This is probably the 2nd or 3rd time I've been unable to delete this bucket. I also tried running $(radosgw-admin bucket check --fix --check-objects --bucket=pusulax) before kicking off the delete job, but that didn't work either. Here is the bucket in question, the num_objects counter never decreases after trying to delete the bucket: [root@os5 ~]# radosgw-admin bucket stats --bucket=pusulax { "bucket": "pusulax", "num_shards": 144, "tenant": "", "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c", "placement_rule": "default-placement", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" }, "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3209338.4", "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3292800.7", "index_type": "Normal", "owner": "REDACTED", "ver": "0#115613,1#115196,2#115884,3#115497,4#114649,5#114150,6#116127,7#114269,8#115220,9#115092,10#114003,11#114538,12#115235,13#113463,14#114928,15#115135,16#115535,17#114867,18#116010,19#115766,20#115274,21#114818,22#114805,23#114853,24#114099,25#114359,26#114966,27#115790,28#114572,29#114826,30#114767,31#115614,32#113995,33#115305,34#114227,35#114342,36#114144,37#114704,38#114088,39#114738,40#114133,41#114520,42#114420,43#114168,44#113820,45#115093,46#114788,47#115522,48#114713,49#115315,50#115055,51#114513,52#114086,53#114401,54#114079,55#113649,56#114089,57#114157,58#114064,59#115224,60#114753,61#114686,62#115169,63#114321,64#114949,65#115075,66#115003,67#114993,68#115320,69#114392,70#114893,71#114219,72#114190,73#114868,74#113432,75#114882,76#115300,77#114755,78#114598,79#114221,80#114895,81#114031,82#114566,83#113849,84#115155,85#113790,86#113334,87#113800,88#114856,89#114841,90#115073,91#113849,92#114554,93#114820,94#114256,95#113840,96#114838,97#113784,98#114876,99#115524,100#115 686,101#112969,102#112156,103#112635,104#112732,105#112933,106#112412,107#113090,108#112239,109#112697,110#113444,111#111730,112#112446,113#114479,114#113318,115#113032,116#112048,117#112404,118#114545,119#112563,120#112341,121#112518,122#111719,123#112273,124#112014,125#112979,126#112209,127#112830,128#113186,129#112944,130#111991,131#112865,132#112688,133#113819,134#112586,135#113275,136#112172,137#113019,138#112872,139#113130,140#112716,141#112091,142#111859,143#112773", "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0,64#0,65#0,66#0,67#0,68#0,69#0,70#0,71#0,72#0,73#0,74#0,75#0,76#0,77#0,78#0,79#0,80#0,81#0,82#0,83#0,84#0,85#0,86#0,87#0,88#0,89#0,90#0,91#0,92#0,93#0,94#0,95#0,96#0,97#0,98#0,99#0,100#0,101#0,102#0,103#0,104#0,105#0,106#0,107#0,108#0,109#0,110#0,111#0,112#0,113#0,114#0,115#0,116#0,117#0,118#0,119#0,120#0,121#0,122#0,123#0,124#0,125#0,126#0,127#0,128#0,129#0,130#0,131#0,132#0,133#0,134#0,135#0,136#0,137#0,138#0,139#0,140#0,141#0,142#0,143#0", "mtime": "2020-06-17 20:27:16.685833Z", "max_marker":
[ceph-users] Re: ceph-radosgw: Initialization timeout, failed to initialize
I increased the debug level to 20. There isn't anything additional being written: 2021-02-23 16:26:38.736642 7£2c45£3700 -1 Initialization timeout, failed to initialize 2021-02-23 16:26:38.931400 7f4d7bf4a000 0 deferred set uid:gid to 167:167 (ceph:ceph) 2021-02-23 16:26:38.931707 7f4d7bf4a000 0 ceph version 12.2.8-128.1. TEST. bz1742993.el7cp (87ba41fa7b3bcb79d916fb0ca41a9dd90eb877f8) luminous (stable), process radosgw, pid 7169 2021-02-23 16:31:38.931973 7f4d6b8e5700 -1 Initialization timeout, failed to initialize 2021-02-23 16:31:39.191898 7fa2fae31000 deferred set uid:gid to 167:167 (ceph:ceph) 2021-02-23 16:31:39.192119 7fa2fae31000 0 ceph version 12.2.8-128.1. TEST.bz1742993.el7cp (87ba41fa7b3bcb79d916fb0ca41a9dd90eb877f8) luminous (stable), process radosgw, pid 7204 Thank you, Mathew Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Tuesday, February 23, 2021 10:57 AM, Janne Johansson wrote: > Den tis 23 feb. 2021 kl 16:53 skrev Mathew Snyder > mathew.sny...@protonmail.com: > > > We have a Red Hat installation of Luminuous (full packages version: > > 12.2.8-128.1). We're experiencing an issue where the ceph-radosgw service > > will timeout during initialization and cycle through attempts every five > > minutes until it seems to just give up. Every other ceph service starts > > successfully. > > I tried looking at the health of the cluster, but anytime I run a command, > > whether ceph or radosgw-admin just to see a list of users, it seems to time > > out as well. > > I've used strace when attempting to start radosgw directly and was > > presented with a missing keyring error. I would be inclined to think that > > might be the problem, but wouldn't that also impact all of the other > > services? > > No, a missing rgw key would stop only it, and the radosgw-admin > command (if run on a box without a global admin key) > > > I haven't been able to find anything in the logs that would lead me down > > any paths. Everything I've looked at (journalctl, /var/log/messages, > > /var/log/ceph/ceph-rgw-server.log) all just say the same thing: the service > > attempted to start, it failed to initialize, entered a failed state, > > service stopped. This repeats. > > See if you can bump the debug log level of the radosgw when starting it. > > https://access.redhat.com/solutions/2085183 > and > https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/#rados-gateway > > --- > > May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-radosgw: Initialization timeout, failed to initialize
Den tis 23 feb. 2021 kl 16:53 skrev Mathew Snyder : > > We have a Red Hat installation of Luminuous (full packages version: > 12.2.8-128.1). We're experiencing an issue where the ceph-radosgw service > will timeout during initialization and cycle through attempts every five > minutes until it seems to just give up. Every other ceph service starts > successfully. > > I tried looking at the health of the cluster, but anytime I run a command, > whether ceph or radosgw-admin just to see a list of users, it seems to time > out as well. > > I've used strace when attempting to start radosgw directly and was presented > with a missing keyring error. I would be inclined to think that might be the > problem, but wouldn't that also impact all of the other services? No, a missing rgw key would stop only it, and the radosgw-admin command (if run on a box without a global admin key) > I haven't been able to find anything in the logs that would lead me down any > paths. Everything I've looked at (journalctl, /var/log/messages, > /var/log/ceph/ceph-rgw-server.log) all just say the same thing: the service > attempted to start, it failed to initialize, entered a failed state, service > stopped. This repeats. See if you can bump the debug log level of the radosgw when starting it. https://access.redhat.com/solutions/2085183 and https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/#rados-gateway -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Network design issues
On 2/21/21 9:51 AM, Frank Schilder wrote: Hi Stefan, thanks for the additional info. Dell will put me in touch with their deployment team soonish and then I can ask about matching abilities. It turns out that the problem I observed might have a much more profane reason. I saw really long periods with slow ping time yesterday and finally managed to pin it down to a flapping link. My best bet is that an SFP transceiver has gone bad. What I'm really surprised about is, that the switch seems not to have any flapping detection. It happily takes the port up and down several times per second. Unfortunately, I can't find anything about server-sided flapping detection on mode=4 bonds nor for members of a LAG on the switch. Do you know of anything that does that? I might be looking for the wrong term. Flapping detection would indeed be the thing to search for. Flaps (port down / up) events could be trapped with SNMP. Not sure if you have a SNMP(trap) infra in place. Otherwise LibreNMS [1] is a nice tool to set up to gather network related info. According to a couple of forum threads you should be able to do flapping detection and alerting based on that [2,3]. You might also want to drop all those traps in an irc (or matrix [4]) channel. We have quite high redundancy. I can loose up to 3 ports on a server before the aggregated bandwidth might get too small. Therefore, I would be happy to take the occasional false positive as long as we don't miss the real flaps. Something like "permanently shut down interface if it does a down-up 3 times per second" would be perfect. Ideally without having to watch the logs. Gr. Stefan [1]: https://www.librenms.org/ [2]: https://community.librenms.org/t/selected-interface-flapping-detection/10658 [3]: https://community.librenms.org/t/alert-port-flapping-up-down-too-much/10380 [4]: https://matrix.org/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] splitting Volume Group with odd number of PE in 2 logical volumes
Hello, Recently I deployed a small ceph cluster using cephadm. In this cluster, I have 3 OSD nodes with 8 HDDs Hitachi (9.1 TiB), 4 NVMes Micron_9300 (2.9 TiB), and 2 NVMes Intel Optane P4800X (375 GiB). I want to use spinning disks for the data block, 2.9 NVMes for the block.DB and the intel Optane for the block.wal. I tried with a spec file and also via the ceph dashboard but I encountered one problem. I would expect 1 lv on every data disk, 4 lv on wal disks, and 2 lv on DB disks. The problem arises on DB disks where only 1 lv gets created. After some debugging, I think that the problem is generated when the VG gets divided into 2. I have 763089 Total PE and the first LV was created using 381545 PE (round-up for 763089/2). Thanks to that, the creation of the second LV fails: Volume group "ceph-c7078851-d3c1-4745-96b6-f98a45d3da93" has insufficient free space (381544 extents): 381545 required. Is this an expected behavior or not? Should I create the LVs by myself? Gheorghita BUTNARU, Gheorghe Asachi Technical University of Iasi smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph nvme timeout and then aborting
I don't think there are here people advising to use consumer grade ssd's/nvme's. The enterprise ssd's often have more twpd, and are just stable under high constant load. My 1,5 year old sm863a still has 099 wearlevel and 097 poweronhours, some other sm863a of 3,8 years has 099 wearlevel and 093 poweronhours. When starting to use ceph try to keep your environment as simple as possible, and as standard as possible. You need to be aware of quite some details if you are trying to squeeze out the the last few % ceph can offer. For instance disabling the ssd/hdd drive cache gives you better performance (I do not really notice it, but that is probably because my cluster is having low load) Also realize from this page of Vitalif, at some point it will not become any faster. I was also thinking of putting wal/db on ssd's for the hdd pool. But I skipped it for now. The hdd pool is not fast, but I am also not complaining about it. This is my fio[1] and a micron sata ssd drive result[2] [1] [global] ioengine=libaio #ioengine=posixaio invalidate=1 ramp_time=30 iodepth=1 runtime=180 time_based direct=1 filename=/dev/sdX #filename=/mnt/cephfs/ssd/fio-bench.img [write-4k-seq] stonewall bs=4k rw=write [randwrite-4k-seq] stonewall bs=4k rw=randwrite fsync=1 [randwrite-4k-d32-seq] stonewall bs=4k rw=randwrite iodepth=32 [read-4k-seq] stonewall bs=4k rw=read [randread-4k-seq] stonewall bs=4k rw=randread fsync=1 [randread-4k-d32-seq] stonewall bs=4k rw=randread iodepth=32 [rw-4k-seq] stonewall bs=4k rw=rw [randrw-4k-seq] stonewall bs=4k rw=randrw [randrw-4k-d4-seq] stonewall bs=4k rw=randrw iodepth=4 [write-128k-seq] stonewall bs=128k rw=write [randwrite-128k-seq] stonewall bs=128k rw=randwrite [read-128k-seq] stonewall bs=128k rw=read [randread-128k-seq] stonewall bs=128k rw=randread [rw-128k-seq] stonewall bs=128k rw=rw [randrw-128k-seq] stonewall bs=128k rw=randrw [write-1024k-seq] stonewall bs=1024k rw=write [randwrite-1024k-seq] stonewall bs=1024k rw=randwrite [read-1024k-seq] stonewall bs=1024k rw=read [randread-1024k-seq] stonewall bs=1024k rw=randread [rw-1024k-seq] stonewall bs=1024k rw=rw [randrw-1024k-seq] stonewall bs=1024k rw=randrw [write-4096k-seq] stonewall bs=4096k rw=write [write-4096k-d16-seq] stonewall bs=4M rw=write iodepth=16 [randwrite-4096k-seq] stonewall bs=4096k rw=randwrite [read-4096k-seq] stonewall bs=4096k rw=read [read-4096k-d16-seq] stonewall bs=4M rw=read iodepth=16 [randread-4096k-seq] stonewall bs=4096k rw=randread [rw-4096k-seq] stonewall bs=4096k rw=rw [randrw-4096k-seq] stonewall bs=4096k rw=randrw [2] write-4k-seq: (groupid=0, jobs=1): err= 0: pid=982502: Sun Oct 4 16:13:28 2020 write: IOPS=15.3k, BW=59.7MiB/s (62.6MB/s)(10.5GiB/180001msec) slat (usec): min=6, max=706, avg=12.09, stdev= 5.40 clat (nsec): min=1618, max=1154.4k, avg=50455.96, stdev=18670.50 lat (usec): min=39, max=1161, avg=62.85, stdev=21.79 clat percentiles (usec): | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 42], | 30.00th=[ 43], 40.00th=[ 43], 50.00th=[ 45], 60.00th=[ 48], | 70.00th=[ 51], 80.00th=[ 54], 90.00th=[ 58], 95.00th=[ 87], | 99.00th=[ 141], 99.50th=[ 153], 99.90th=[ 178], 99.95th=[ 188], | 99.99th=[ 235] bw ( KiB/s): min=37570, max=63946, per=69.50%, avg=42495.21, stdev=3251.18, samples=359 iops: min= 9392, max=15986, avg=10623.45, stdev=812.82, samples=359 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=66.19% lat (usec) : 100=30.09%, 250=3.70%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01% cpu : usr=9.73%, sys=29.92%, ctx=2751526, majf=0, minf=53 IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,2751607,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=983595: Sun Oct 4 16:13:28 2020 write: IOPS=14.9k, BW=58.2MiB/s (61.0MB/s)(10.2GiB/180001msec) slat (usec): min=6, max=304, avg=10.89, stdev= 4.80 clat (nsec): min=1355, max=1258.5k, avg=49272.39, stdev=17923.95 lat (usec): min=42, max=1265, avg=60.46, stdev=20.51 clat percentiles (usec): | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 41], | 30.00th=[ 42], 40.00th=[ 43], 50.00th=[ 43], 60.00th=[ 46], | 70.00th=[ 50], 80.00th=[ 53], 90.00th=[ 58], 95.00th=[ 84], | 99.00th=[ 137], 99.50th=[ 151], 99.90th=[ 174], 99.95th=[ 184], | 99.99th=[ 231] bw ( KiB/s): min=37665, max=62936, per=69.49%, avg=41402.23, stdev=2934.41, samples=359 iops: min= 9416, max=15734, avg=10350.21, stdev=733.65, samples=359 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=71.60%
[ceph-users] Re: multiple-domain for S3 on rgws with same ceph backend on one zone
>>> Hello, >>> We have functional ceph swarm with a pair of S3 rgw in front that uses >>> A.B.C.D domain to be accessed. >>> >>> Now a new client asks to have access using the domain : E.C.D, but to >>> already existing buckets. This is not a scenario discussed in the docs. >>> Apparently, looking at the code and by trying it, rgw does not support >>> multiple domains for the variable rgw_dns_name. >>> >>> But reading through parts of the code, I am no dev, and my c++ is 25 years >>> rusty, I get the impression that maybe we could just add a second pair of >>> rgw S3 servers that would give service to the same buckets, but using a >>> different domain. >>> >>> Am I wrong ? Let's say this works, is this an unconscious behaviour that >>> the ceph team would remove down the road ? >> >> We run this, a LB sends to one pool for one DNS name and to another pool >> for a different DNS name, and both rgws serve the "same" buckets. > > > How can they serve the "same" buckets if they are in different ceph pools ? > Am I understanding you correctly ? To me, same bucket means same objects. I mean that a user can go via either one, and it works. And no, it is not different ceph pools, it is the same ceph pools underneath, only the rgw name in the conf differs. > So if I were to deploy a new pair of RGWS with the new domain, would it > create a bunch of new pools in ceph to store its objects or reuse the > preexisting ones ? It reuses the old pools. The pool names are not tied to the DNS name the rgw is using, so it starts looking for .rgw.root and from there divines which zones and zonegroups exist and (in our case) that the pools are default.rgw.buckets.index and so on, which is true for both sets of rgws. >> Since S3 auth v4 the dns name is very much a part of the hash to make your >> access work, so whatever the client thinks is the DNS name is what it will >> use to make the hash-of-hash-of-hash* combination to auth itself. >> >> We haven't made a huge attempt to break it by doing wacky parallel accesses >> from both directions, but it seems to work to move off clients from old name >> to new name and the stragglers that will never change will get the old small >> LB pool and the clients with a decent config get better service. > > I have a need for parallel access, have you tried it ? We have not tried since we see it as either you have moved to the new name or you haven't. I don't expect this to be a showstopper, since having N+1 rgws in all other cases is equally susceptible to races regardless of the DNS name the client used to reach an rgw. After auth is done, I expect it to be quite similar if your client and my client ends up on different rgw daemons. Since using N+1 rgw daemons is used in many many installations, I consider that use-case tested well enough. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io