date:20230103

[ceph-users] RGW access logs with bucket name

2023-01-03 Thread Boris Behrens

Hi,
I am looking forward to move our logs from
/var/log/ceph/ceph-client...log to our logaggregator.

Is there a way to have the bucket name in the log file?

Or can I write the rgw_enable_ops_log into a file? Maybe I could work with this.

Cheers and happy new year
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [EXTERNAL] Re: S3 Deletes in Multisite Sometimes Not Syncing

2023-01-03 Thread Alex Hussein-Kershaw (HE/HIM)

Hi Matthew,

That's interesting to hear - especially that you are not using bucket 
versioning and are seeing the same issue.

I was hoping this might go away if I turned off versioning, but if that's not 
the case this gets a bit more worrying for us! 

Thanks,
Alex

-Original Message-
From: Matthew Darwin  
Sent: Friday, December 23, 2022 3:13 PM
To: ceph-users@ceph.io
Subject: [EXTERNAL] [ceph-users] Re: S3 Deletes in Multisite Sometimes Not 
Syncing

Hi Alex,

We also have a multi-site setup (17.2.5). I just deleted a bunch of files from 
one side and some files got deleted on the other side but not others. I waited 
10 hours to see if the files would delete. I didn't do an exhaustive test like 
yours, but seems similar issues. In our case, like yours, the two ceph sites 
are geographically separated.

We don't have versioning enabled.

I would love to hear from anyone who has replication working perfectly.

On 2022-12-22 07:17, Alex Hussein-Kershaw (HE/HIM) wrote:
> Hi Folks,
>
> Have made a strange observation on one of our Storage Clusters.
>
>*   Running Ceph 15.2.13.
>*   Set up as a multisite pair of siteA and siteB. The two sites are 
> geographically separated.
>*   We are using S3 with a bucket in versioning suspended state (we 
> previously had versioning on but decided it’s not required).
>*   We’re using pubsub in conjunction with our S3 usage, don’t think this 
> is relevant but figured I should mention just in case.
>
> We wrote 2413 small objects (no more than a few MB each) into the cluster via 
> S3 on siteA. Then we deleted those objects via the S3 interface on siteA. 
> Once the deleting was complete, we had 11 objects of the 2413 in a strange 
> state on siteB but not siteA.
>
> On both sites the objects were set to zero size, I think this is expected. On 
> siteA, where the deletes were sent, the objects were marked with 
> “delete-marker”. On siteB, the objects were not marked with “delete-marker”. 
> “DELETE_MARKER_CREATE” pubsub events on siteA were generated for these 
> objects, but not on siteB (expecting the problem is not at the pubsub level).
>
> I followed a specific object through in logs and saw the following:
>
>*   Object created: 00:11:16
>*   Object deleted: 01:04:02
>*   Pubsub on SiteB generated “OBJECT_CREATE” events at 00:11:31, 
> 00:11:34, 01:04:18.
>
>
> My observations from this are:
>
>*   There is plenty time between the create and the delete for this not to 
> be some niche timing issue.
>*   The final “OBJECT_CREATE” event is after the delete so I expect is a 
> result of the multisite sync informing siteB of the change.
>*   I expect this final event to be a “DELETE_MARKER_CREATE” event, not an 
> “OBJECT_CREATE”.
>
> We can manually delete the objects from siteB to clean-up, but this is 
> painful and makes us look a bit silly when we get support calls from 
> customers for this sort of thing – so I’m keen find a better solution.
>
> I’ve failed to find a reason why this would occur due to us doing something 
> wrong in our setup, it seems this is not the intended behaviour given that 
> it’s only affecting a small number of the objects (most are marked as deleted 
> on both sites as expected).
>
>*   Has anyone else experienced this sort of thing?
>*   I wonder if it’s related to our versioning suspended state.
>*   How well tested is this scenario i.e., multisite + bucket versioning 
> together?
>*   Is there something we can do it mitigate it? As I understand, we can’t 
> return to a versioning disabled state for this bucket.
>
> Thanks, and Season’s Greetings 😊
>
> Alex Kershaw |alex...@microsoft.com
> Software Engineer | Azure for Operators
>
> ___
> ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an 
> email toceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [ext] Copying large file stuck, two cephfs-2 mounts on two cluster

2023-01-03 Thread Kuhring, Mathias

Trying to exclude clusters and/or clients might have gotten me on the right 
track. It might have been a client issue or actually a snapshot retention 
issue. As it turned out when I tried other routes for the data using a 
different client, the data was not available anymore since the snapshot had 
been trimmed.

We got behind syncing our snapshots a while ago (due to other issues). And now 
we are somewhere in between our weekly (16 weeks) and daily (30 days) 
snapshots. So, I assume before we catch up with daily (<30), there is a general 
risk that snapshots disappear while we are syncing them.

The funny/weird thing is though (and why I didn't catch up on this), the 
particular file (and potentially others) of this trimmed snapshot was 
apparently still available for the client I initially used for the transfer. 
I'm wondering if the client somehow cached the data until the snapshot got 
trimmed. And then just re-tried copying the incompletely cached data.

Continuing with the next available snapshot, mirroring/syncing is now catching 
up again. I expect it might happen again once we catch up to the 30-days 
threshold. If the time point of snapshot trimming falls into the syncinc time 
frame. But then I know to just cancel/skip the current snapshot and continue 
with the next one. Syncing time is short enough to get me over the hill then 
before the next trimming.

Note to myself: Next time something similar things happens, check if different 
clients AND different snapshots or original data behave the same.

On 12/22/2022 4:27 PM, Kuhring, Mathias wrote:

Dear ceph community,



We have two ceph cluster of equal size, one main and one mirror, both using 
cephadm and on version

ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)



We are stuck with copying a large file (~ 64G) between the CephFS file systems 
of the two clusters.


The source path is a snapshot (i.e. something like 
/my/path/.snap/schedule_some-date/…).
But I don't think that should make any difference.



First, I was thinking that I need to adapt some rsync parameters to work better 
with bigger files on CephFS.

But when confirming by just copying the file with cp, the transfer get's also 
stuck.

Without any error message, the process just keeps running (rsync or cp).

But the file size on the target doesn't increase anymore at some point (almost 
85%).



Main:

-rw--- 1 cockpit-ws printadmin 68360698297 16. Nov 13:40 
LB22_2764_dragen.bam



Mirror:

-rw--- 1 root root 58099499008 22. Dez 15:54 LB22_2764_dragen.bam



Our CephFS file size limit is with 10 TB more than generous.
And as far as I know from clients there are indeed files in TB ranges on the 
cluster without issues.



I don't know if this is the file's fault or if this is some issue with either 
of the CephFS' or cluster.

And I don't know how to look and troubleshoot this.

Can anybody give me a tip where I can start looking and debug these kind of 
issues?



Thank you very much.



Best Wishes,

Mathias
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io


--
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhr...@bih-charite.de
Mobile: +49 172 3475576
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] increasing number of (deep) scrubs

2023-01-03 Thread Frank Schilder

Hi all,

we are using 16T and 18T spinning drives as OSDs and I'm observing that they 
are not scrubbed as often as I would like. It looks like too few scrubs are 
scheduled for these large OSDs. My estimate is as follows: we have 852 spinning 
OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I 
see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This 
amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a 
conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly 
quickly.

I would like to increase gently the number of scrubs scheduled for these drives 
and *not* the number of scrubs per OSD. I'm looking at parameters like:

osd_scrub_backoff_ratio
osd_deep_scrub_randomize_ratio

I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing 
osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there 
other parameters to look at that allow gradual changes in the number of scrubs 
going on?

Thanks a lot for your help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] mon scrub error (scrub mismatch)

2023-01-03 Thread Frank Schilder

Hi all,

we have these messages in our logs daily:

1/3/23 12:20:00 PM[INF]overall HEALTH_OK
1/3/23 12:19:46 PM[ERR] mon.2 ScrubResult(keys 
{auth=77,config=2,health=11,logm=10} crc 
{auth=688385498,config=4279003239,health=3522308637,logm=132403602})
1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys 
{auth=78,config=2,health=11,logm=9} crc 
{auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
1/3/23 12:19:46 PM[ERR]scrub mismatch
1/3/23 12:19:46 PM[ERR] mon.1 ScrubResult(keys 
{auth=77,config=2,health=11,logm=10} crc 
{auth=688385498,config=4279003239,health=3522308637,logm=132403602})
1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys 
{auth=78,config=2,health=11,logm=9} crc 
{auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
1/3/23 12:19:46 PM[ERR]scrub mismatch
1/3/23 12:17:04 PM[INF]Cluster is now healthy
1/3/23 12:17:04 PM[INF]Health check cleared: MON_CLOCK_SKEW (was: clock skew 
detected on mon.tceph-02)

Cluster is health OK:

# ceph status
  cluster:
id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 3M)
mgr: tceph-01(active, since 8w), standbys: tceph-03, tceph-02
mds: fs:1 {0=tceph-02=up:active} 2 up:standby
osd: 9 osds: 9 up (since 3M), 9 in
 
  task status:
 
  data:
pools:   4 pools, 321 pgs
objects: 9.94M objects, 336 GiB
usage:   1.6 TiB used, 830 GiB / 2.4 TiB avail
pgs: 321 active+clean

Unfortunately, google wasn't of too much help. Is this scrub error something to 
worry about?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] rgw - unable to remove some orphans

2023-01-03 Thread Andrei Mikhailovsky

Happy New Year everyone! 

I have a bit of an issue with removing some of the orphan objects that were 
generated with the rgw-orphan-list tool. Over the years rgw generated over 14 
million orphans with an overall waste of over 100TB in size, considering the 
overall data stored in rgw was well under 10TB at max. Anyways, I have managed 
to remove around 12m objects over the holiday season, but there are just over 
2m orphans which were not removed. Here is an example of one of the objects 
taken from the orphans list file: 

$ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
 

error removing .rgw.buckets>default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
 (2) No such file or directory 

Checking the presence of the object with the rados tool shows that the object 
is there. 

$ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92 
default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
 

$ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92 
default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
 


Why can't I remove it? I have around 2m objects which can't be removed. What 
can I do to remove them? 

Thanks 

Andrei 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Boris Behrens

Hi Andrei,
happy new year to you too.

The file might be already removed.
You can check if the radosobject is there with `rados -p ls ...`
You can also check if the file is is still in the bucket with
`radosgw-admin bucket radoslist --bucket BUCKET`

Cheers
 Boris

Am Di., 3. Jan. 2023 um 13:47 Uhr schrieb Andrei Mikhailovsky
:
>
> Happy New Year everyone!
>
> I have a bit of an issue with removing some of the orphan objects that were 
> generated with the rgw-orphan-list tool. Over the years rgw generated over 14 
> million orphans with an overall waste of over 100TB in size, considering the 
> overall data stored in rgw was well under 10TB at max. Anyways, I have 
> managed to remove around 12m objects over the holiday season, but there are 
> just over 2m orphans which were not removed. Here is an example of one of the 
> objects taken from the orphans list file:
>
> $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL 
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
>
> error removing .rgw.buckets>default.775634629.1__shadow_SQL 
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
>  (2) No such file or directory
>
> Checking the presence of the object with the rados tool shows that the object 
> is there.
>
> $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL 
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>
> $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL 
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>
>
> Why can't I remove it? I have around 2m objects which can't be removed. What 
> can I do to remove them?
>
> Thanks
>
> Andrei
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
im groÃƒ¼en Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Manuel Rios - EDH

Object index database get corrupted and no ones can fix. We wipped a 500TB 
cluster years ago and move out ceph due this orphans bugs.
After move all our data we saw in disk more than 100TB data unable to be 
deleted by ceph, also know as orphans... no sense.

We expended thousand hours with this bug, the best solution replicate valid 
data to a new ceph cluster.

Some providers solve this with x4 replica  but no money sense. 

Regards,
Manuel

CONFIDENTIALITY NOTICE:
This e-mail message and all attachments transmitted with it may contain legally 
privileged, proprietary and/or confidential information intended solely for the 
use of the addressee. If you are not the intended recipient, you are hereby 
notified that any review, dissemination, distribution, duplication or other use 
of this message and/or its attachments is strictly prohibited. If you are not 
the intended recipient, please contact the sender by reply e-mail and destroy 
all copies of the original message and its attachments. Thank you.
No imprimas si no es necesario. Protejamos el Medio Ambiente.


-Original Message-
From: Andrei Mikhailovsky  
Sent: martes, 3 de enero de 2023 13:46
To: ceph-users 
Subject: [ceph-users] rgw - unable to remove some orphans

Happy New Year everyone! 

I have a bit of an issue with removing some of the orphan objects that were 
generated with the rgw-orphan-list tool. Over the years rgw generated over 14 
million orphans with an overall waste of over 100TB in size, considering the 
overall data stored in rgw was well under 10TB at max. Anyways, I have managed 
to remove around 12m objects over the holiday season, but there are just over 
2m orphans which were not removed. Here is an example of one of the objects 
taken from the orphans list file: 

$ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
 

error removing .rgw.buckets>default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
 (2) No such file or directory 

Checking the presence of the object with the rados tool shows that the object 
is there. 

$ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92 
default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
 

$ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92 
default.775634629.1__shadow_SQL 
Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
 


Why can't I remove it? I have around 2m objects which can't be removed. What 
can I do to remove them? 

Thanks 

Andrei 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mon scrub error (scrub mismatch)

2023-01-03 Thread Eugen Block


Hi Frank,

I had this a few years back and ended up recreating the MON with the  
scrub mismatch, so in your case it probably would be mon.0. To test if  
the problem still exists you can trigger a mon scrub manually:


ceph mon scrub

Are all MONs on rocksdb back end in this cluster? I didn't check back  
then if this was the case in our cluster, so I'm just wondering if  
that could be an explanation.


Regards,
Eugen

Zitat von Frank Schilder :


Hi all,

we have these messages in our logs daily:

1/3/23 12:20:00 PM[INF]overall HEALTH_OK
1/3/23 12:19:46 PM[ERR] mon.2 ScrubResult(keys  
{auth=77,config=2,health=11,logm=10} crc  
{auth=688385498,config=4279003239,health=3522308637,logm=132403602})
1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys  
{auth=78,config=2,health=11,logm=9} crc  
{auth=325876668,config=4279003239,health=3522308637,logm=1083913445})

1/3/23 12:19:46 PM[ERR]scrub mismatch
1/3/23 12:19:46 PM[ERR] mon.1 ScrubResult(keys  
{auth=77,config=2,health=11,logm=10} crc  
{auth=688385498,config=4279003239,health=3522308637,logm=132403602})
1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys  
{auth=78,config=2,health=11,logm=9} crc  
{auth=325876668,config=4279003239,health=3522308637,logm=1083913445})

1/3/23 12:19:46 PM[ERR]scrub mismatch
1/3/23 12:17:04 PM[INF]Cluster is now healthy
1/3/23 12:17:04 PM[INF]Health check cleared: MON_CLOCK_SKEW (was:  
clock skew detected on mon.tceph-02)


Cluster is health OK:

# ceph status
  cluster:
id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
health: HEALTH_OK

  services:
mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 3M)
mgr: tceph-01(active, since 8w), standbys: tceph-03, tceph-02
mds: fs:1 {0=tceph-02=up:active} 2 up:standby
osd: 9 osds: 9 up (since 3M), 9 in

  task status:

  data:
pools:   4 pools, 321 pgs
objects: 9.94M objects, 336 GiB
usage:   1.6 TiB used, 830 GiB / 2.4 TiB avail
pgs: 321 active+clean

Unfortunately, google wasn't of too much help. Is this scrub error  
something to worry about?


Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg deep scrubbing issue

2023-01-03 Thread Jeffrey Turmelle

Thank you Anthony.  I did have an empty pool that I had provisioned for 
developers that was never used.  I’ve removed that pool and the 0 object PGs 
are gone.  I don’t know why I didn’t realize that.  Removing that pool halved 
the # of PGs not scrubbed in time.

This is entirely an HDD cluster.  I don’t constrain my scrubs, and I had 
already set the osd_deep_scrub_interval to 2 weeks, and increased the 
osd_scrub_load_threshold to 5.  But that didn’t help much.

I’ve moved our operations to our failover cluster so hopefully this one can 
catch up now.  I don’t understand how this started out of the blue, but at 
least now, the number is decreasing.

Jeff


> On Jan 3, 2023, at 12:57 AM, Anthony D'Atri  wrote:
> 
> Look closely at your output. The PGs with 0 objects. Are only “every other” 
> due to how the command happened to order the output.
> 
> Note that the empty PGs all have IDs matching “3.*”. The numeric prefix of a 
> PG ID reflects the cardinal ID of the pool to which it belongs.   I strongly 
> suspect that you have a pool with no data.
> 
> 
> 
>>> Strangely, ceph pg dump gives shows every other PG with 0 objects.  An 
>>> attempt to perform a deep scrub (or scrub) on one of these PGs does 
>>> nothing.   The cluster appears to be running fine, but obviously there’s an 
>>> issue.   What should my next steps be to troubleshoot ?
 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
 OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE   
 STATE_STAMPVERSION   REPORTED   UP
 UP_PRIMARY ACTINGACTING_PRIMARY LAST_SCRUBSCRUB_STAMP  
   LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
 3.e9b 0  00 0   00 
   0  000active+clean 
 2022-12-31 22:49:07.629579   0'023686:19820   [28,79]  
28   [28,79] 28   0'0 2022-12-31 
 22:49:07.629508 0'0 2022-12-31 22:49:07.629508 0
 1.e99 60594  00 0   0 177433523272 
   0  0 3046 3046active+clean 
 2022-12-21 14:35:08.175858  23686'268137  23686:1732399 [178,115]  
   178 [178,115]178  23675'267613 2022-12-21 
 11:01:10.40352523675'267613 2022-12-21 11:01:10.403525 0
 3.e9a 0  00 0   00 
   0  000active+clean 
 2022-12-31 09:16:48.644619   0'023686:22855  [51,140]  
51  [51,140] 51   0'0 2022-12-31 
 09:16:48.644568 0'0 2022-12-30 02:35:23.367344 0
 1.e98 59962  00 0   0 177218669411 
   0  0 3035 3035active+clean 
 2022-12-28 14:14:49.908560  23686'265576  23686:1357499   [92,86]  
92   [92,86] 92  23686'265445 2022-12-28 
 14:14:49.90852223686'265445 2022-12-28 14:14:49.908522 0
 3.e95 0  00 0   00 
   0  000active+clean 
 2022-12-31 06:09:39.442932   0'023686:22757   [48,83]  
48   [48,83] 48   0'0 2022-12-31 
 06:09:39.442879 0'0 2022-12-18 09:33:47.892142 0
> 
> 
> As to your PGs not scrubbed in time, what sort of hardware are your OSDs?  
> Here are some thoughts, especially if they’re HDDs.
> 
> * If you don’t need that empty pool, delete it, then evaluate how many PGs on 
> average your OSDs  hold (eg. `ceph osd df`).  If you have an unusually high 
> number of PGs per, maybe just maybe you’re running afoul of 
> osd_scrub_extended_sleep / osd_scrub_sleep .  In other words, individual 
> scrubs on empty PGs may naturally be very fast, but they may be DoSing 
> because of the efforts Ceph makes to spread out the impact of scrubs.
> 
> * Do you limit scrubs to certain times via osd_scrub_begin_hour, 
> osd_scrub_end_hour, osd_scrub_begin_week_day, osd_scrub_end_week_day?  I’ve 
> seen operators who constraint scrubs to only a few overnight / weekend hours, 
> but doing so can hobble Ceph’s ability to get through them all in time.
> 
> * Similarly, a value of osd_scrub_load_threshold that’s too low can also 
> result in starvation.  The load average statistic can be misleading on modern 
> SMP systems with lots of cores.  I’ve witnessed 32c/64t OSD nodes report a 
> load average of like 40, but with tools like htop one could see that they 
> were barely breaking a sweat.
> 
> * If you have osd_scrub_during_recovery disabled and experience a lot of 
> backfill / recovery / rebalance traffic, that can s

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Andrei Mikhailovsky

Hi Boris,

The objects do exist and I can see it with ls. I can also verify that the total 
amount of objects in the pool is over 2m more than the amount of files. The 
total used space of all the buckets is about 10TB less than the total space 
used up by the .rgw.buckets pool.

My colleague has suggested that there are unprintable characters in the object 
names and thus they can't be removed with cli tools. Could this be the case and 
if so, how do I remove them?

Cheers

Andrei

- Original Message -
> From: "Boris Behrens" 
> To: "ceph-users" 
> Sent: Tuesday, 3 January, 2023 12:53:29
> Subject: [ceph-users] Re: rgw - unable to remove some orphans

> Hi Andrei,
> happy new year to you too.
> 
> The file might be already removed.
> You can check if the radosobject is there with `rados -p ls ...`
> You can also check if the file is is still in the bucket with
> `radosgw-admin bucket radoslist --bucket BUCKET`
> 
> Cheers
> Boris
> 
> Am Di., 3. Jan. 2023 um 13:47 Uhr schrieb Andrei Mikhailovsky
> :
>>
>> Happy New Year everyone!
>>
>> I have a bit of an issue with removing some of the orphan objects that were
>> generated with the rgw-orphan-list tool. Over the years rgw generated over 14
>> million orphans with an overall waste of over 100TB in size, considering the
>> overall data stored in rgw was well under 10TB at max. Anyways, I have 
>> managed
>> to remove around 12m objects over the holiday season, but there are just over
>> 2m orphans which were not removed. Here is an example of one of the objects
>> taken from the orphans list file:
>>
>> $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL
>> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
>>
>> error removing .rgw.buckets>default.775634629.1__shadow_SQL
>> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
>> (2) No such file or directory
>>
>> Checking the presence of the object with the rados tool shows that the 
>> object is
>> there.
>>
>> $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92
>> default.775634629.1__shadow_SQL
>> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>>
>> $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92
>> default.775634629.1__shadow_SQL
>> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>>
>>
>> Why can't I remove it? I have around 2m objects which can't be removed. What 
>> can
>> I do to remove them?
>>
>> Thanks
>>
>> Andrei
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
> im groÃƒ¼en Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Andrei Mikhailovsky

Manuel,

Wow, I am pretty surprised to hear that the ceph developers hasn't addressed 
this issue already. It looks like it is a big issue, which is costing a lot of 
money to keep this orphan data unresolved.

Could someone from the developers comment on the issue and let us know if there 
is a workaround?

Cheers

Andrei

- Original Message -
> From: "EDH" 
> To: "Andrei Mikhailovsky" , "ceph-users" 
> 
> Sent: Tuesday, 3 January, 2023 13:36:19
> Subject: RE: rgw - unable to remove some orphans

> Object index database get corrupted and no ones can fix. We wipped a 500TB
> cluster years ago and move out ceph due this orphans bugs.
> After move all our data we saw in disk more than 100TB data unable to be 
> deleted
> by ceph, also know as orphans... no sense.
> 
> We expended thousand hours with this bug, the best solution replicate valid 
> data
> to a new ceph cluster.
> 
> Some providers solve this with x4 replica  but no money sense.
> 
> Regards,
> Manuel
> 
> CONFIDENTIALITY NOTICE:
> This e-mail message and all attachments transmitted with it may contain 
> legally
> privileged, proprietary and/or confidential information intended solely for 
> the
> use of the addressee. If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, duplication or other 
> use
> of this message and/or its attachments is strictly prohibited. If you are not
> the intended recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message and its attachments. Thank you.
> No imprimas si no es necesario. Protejamos el Medio Ambiente.
> 
> 
> -Original Message-
> From: Andrei Mikhailovsky 
> Sent: martes, 3 de enero de 2023 13:46
> To: ceph-users 
> Subject: [ceph-users] rgw - unable to remove some orphans
> 
> Happy New Year everyone!
> 
> I have a bit of an issue with removing some of the orphan objects that were
> generated with the rgw-orphan-list tool. Over the years rgw generated over 14
> million orphans with an overall waste of over 100TB in size, considering the
> overall data stored in rgw was well under 10TB at max. Anyways, I have managed
> to remove around 12m objects over the holiday season, but there are just over
> 2m orphans which were not removed. Here is an example of one of the objects
> taken from the orphans list file:
> 
> $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
> 
> error removing .rgw.buckets>default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
> (2) No such file or directory
> 
> Checking the presence of the object with the rados tool shows that the object 
> is
> there.
> 
> $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
> 
> $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
> 
> 
> Why can't I remove it? I have around 2m objects which can't be removed. What 
> can
> I do to remove them?
> 
> Thanks
> 
> Andrei
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] RGW - Keyring Storage Cluster Users ceph for secondary RGW multisite

2023-01-03 Thread Guillaume Morin

Hello, i need help for configure a Storage Cluster Users for a  secondary rados 
gateway.

My multisite RGW configuration & sync works with lot of capabilities (osd 
'allow rwx, mon 'allow profile simple-rados-client', mgr 'allow profile rbd') 
but i would like  avoided to use osd 'allow rwx'.

Actually, for the master zone, it's work with the follow  osd cap:
'allow rwx pool=myrootpool, allow rwx pool=myzone.rgw.buckets.index, allow rwx 
pool=myzone.rgw.buckets.data, ...'

but for the secondary zone, it doesn't work with an exaustive pool list but 
just with osd 'allow rwx'.
Is there need other osd cap for  secondary RGW ?

configuration
ceph version 16.2.9
1 cluster ceph
i don't use .rgw.root for realm and zone info
two radosGW (master and secondary)


secondary error logs:
   -20> 2023-01-03T16:10:58.519+0100 7f14fe28d840  5 asok(0x55c71bd1c100) 
register_command sync trace show hook 0x7f14f0002700
   -19> 2023-01-03T16:10:58.519+0100 7f14fe28d840  5 asok(0x55c71bd1c100) 
register_command sync trace history hook 0x7f14f0002700
   -18> 2023-01-03T16:10:58.519+0100 7f14fe28d840  5 asok(0x55c71bd1c100) 
register_command sync trace active hook 0x7f14f0002700
   -17> 2023-01-03T16:10:58.519+0100 7f14fe28d840  5 asok(0x55c71bd1c100) 
register_command sync trace active_short hook 0x7f14f0002700
   -16> 2023-01-03T16:10:58.523+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.02
   -15> 2023-01-03T16:10:58.523+0100 7f14fe28d840  5 rgw main: starting data 
sync thread for zone pvid-qualif-0.s3
   -14> 2023-01-03T16:10:58.523+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.03
   -13> 2023-01-03T16:10:58.523+0100 7f1457fef700  5 lifecycle: schedule life 
cycle next start time: Tue Jan  3 23:00:00 2023
   -12> 2023-01-03T16:10:58.523+0100 7f1455feb700  5 lifecycle: schedule life 
cycle next start time: Tue Jan  3 23:00:00 2023
   -11> 2023-01-03T16:10:58.523+0100 7f1453fe7700  5 lifecycle: schedule life 
cycle next start time: Tue Jan  3 23:00:00 2023
   -10> 2023-01-03T16:10:58.523+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.04
-9> 2023-01-03T16:10:58.523+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.05
-8> 2023-01-03T16:10:58.523+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.06
-7> 2023-01-03T16:10:58.527+0100 7f14fe28d840  0 framework: beast
-6> 2023-01-03T16:10:58.527+0100 7f14fe28d840  0 framework conf key: 
ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
-5> 2023-01-03T16:10:58.527+0100 7f14fe28d840  0 framework conf key: 
ssl_private_key, val: config://rgw/cert/$realm/$zone.key
-4> 2023-01-03T16:10:58.527+0100 7f14fe28d840  0 starting handler: beast
-3> 2023-01-03T16:10:58.527+0100 7f1464ff9700  5 rgw object expirer Worker 
thread: process_single_shard(): failed to acquire lock on 
obj_delete_at_hint.07
-2> 2023-01-03T16:10:58.527+0100 7f14fe28d840  4 frontend listening on 
0.0.0.0:443
-1> 2023-01-03T16:10:58.527+0100 7f14fe28d840  4 frontend listening on 
[::]:443
 0> 2023-01-03T16:10:58.527+0100 7f1452fe5700 -1 *** Caught signal 
(Aborted) **
 in thread 7f1452fe5700 thread_name:rgw_user_st_syn

 ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3bd60) [0x7f150a491d60]
 2: gsignal()
 3: abort()
 4: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9a7ec) [0x7f1500ac67ec]
 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5966) [0x7f1500ad1966]
 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa59d1) [0x7f1500ad19d1]
 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5c65) [0x7f1500ad1c65]
 8: /lib/librados.so.2(+0x36b7a) [0x7f150a03cb7a]
 9: /lib/librados.so.2(+0x7cd20) [0x7f150a082d20]
 10: (librados::v14_2_0::IoCtx::nobjects_begin(librados::v14_2_0::ObjectCursor 
const&, ceph::buffer::v15_2_0::list const&)+0x59) [0x7f150a08d749]
 11: (RGWSI_RADOS::Pool::List::init(DoutPrefixProvider const*, 
std::__cxx11::basic_string, std::allocator > 
const&, RGWAccessListFilter*)+0x2e5) [0x7f150b12d615]
 12: (RGWSI_SysObj_Core::pool_list_objects_init(DoutPrefixProvider const*, 
rgw_pool const&, std::__cxx11::basic_string, 
std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, 
RGWSI_SysObj::Pool::ListCtx*)+0x24f) [0x7f150abb6c6f]
 13: (RGWSI_MetaBackend_SObj::list_init(DoutPrefixProvider const*, 
RGWSI_MetaBackend::Context*, std::__cxx11::basic_string, std::allocator > const&)+0x235) [0x7f150b11f8e5]
 14: (RGWMetadataHandler_GenericMetaBE::list_keys_init(DoutPrefixProvider 
const*, std::__cxx11::basic_string, 
std::allocator > const&, void**)+0x41) [0x7f150ace3f71]
 15: (

[ceph-users] Re: mon scrub error (scrub mismatch)

2023-01-03 Thread Frank Schilder

Hi Eugen,

thanks for your answer. All our mons use rocksdb.

I found some old threads, but they never really explained anything. What 
irritates me is that this is a silent corruption. If you don't read the logs 
every day, you will not see it, ceph status reports health ok. That's also why 
I'm wondering if this is a real issue or not.

It would be great if someone could shed light on (1) how serious this is, (2) 
why it doesn't trigger a health warning/error and (3) why the affected mon 
doesn't sync back from the majority right away.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 03 January 2023 15:04:34
To: ceph-users@ceph.io
Subject: [ceph-users] Re: mon scrub error (scrub mismatch)

Hi Frank,

I had this a few years back and ended up recreating the MON with the
scrub mismatch, so in your case it probably would be mon.0. To test if
the problem still exists you can trigger a mon scrub manually:

ceph mon scrub

Are all MONs on rocksdb back end in this cluster? I didn't check back
then if this was the case in our cluster, so I'm just wondering if
that could be an explanation.

Regards,
Eugen

Zitat von Frank Schilder :

> Hi all,
>
> we have these messages in our logs daily:
>
> 1/3/23 12:20:00 PM[INF]overall HEALTH_OK
> 1/3/23 12:19:46 PM[ERR] mon.2 ScrubResult(keys
> {auth=77,config=2,health=11,logm=10} crc
> {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> {auth=78,config=2,health=11,logm=9} crc
> {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> 1/3/23 12:19:46 PM[ERR]scrub mismatch
> 1/3/23 12:19:46 PM[ERR] mon.1 ScrubResult(keys
> {auth=77,config=2,health=11,logm=10} crc
> {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> {auth=78,config=2,health=11,logm=9} crc
> {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> 1/3/23 12:19:46 PM[ERR]scrub mismatch
> 1/3/23 12:17:04 PM[INF]Cluster is now healthy
> 1/3/23 12:17:04 PM[INF]Health check cleared: MON_CLOCK_SKEW (was:
> clock skew detected on mon.tceph-02)
>
> Cluster is health OK:
>
> # ceph status
>   cluster:
> id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 3M)
> mgr: tceph-01(active, since 8w), standbys: tceph-03, tceph-02
> mds: fs:1 {0=tceph-02=up:active} 2 up:standby
> osd: 9 osds: 9 up (since 3M), 9 in
>
>   task status:
>
>   data:
> pools:   4 pools, 321 pgs
> objects: 9.94M objects, 336 GiB
> usage:   1.6 TiB used, 830 GiB / 2.4 TiB avail
> pgs: 321 active+clean
>
> Unfortunately, google wasn't of too much help. Is this scrub error
> something to worry about?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mon scrub error (scrub mismatch)

2023-01-03 Thread Dan van der Ster

Hi Frank,

Can you work backwards in the logs to when this first appeared?
The scrub error is showing that mon.0 has 78 auth keys and the other
two have 77. So you'd have query the auth keys of each mon to see if
you get a different response each time (e.g. ceph auth list), and
compare with what you expect.

Cheers, Dan

On Tue, Jan 3, 2023 at 9:29 AM Frank Schilder  wrote:
>
> Hi Eugen,
>
> thanks for your answer. All our mons use rocksdb.
>
> I found some old threads, but they never really explained anything. What 
> irritates me is that this is a silent corruption. If you don't read the logs 
> every day, you will not see it, ceph status reports health ok. That's also 
> why I'm wondering if this is a real issue or not.
>
> It would be great if someone could shed light on (1) how serious this is, (2) 
> why it doesn't trigger a health warning/error and (3) why the affected mon 
> doesn't sync back from the majority right away.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 03 January 2023 15:04:34
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: mon scrub error (scrub mismatch)
>
> Hi Frank,
>
> I had this a few years back and ended up recreating the MON with the
> scrub mismatch, so in your case it probably would be mon.0. To test if
> the problem still exists you can trigger a mon scrub manually:
>
> ceph mon scrub
>
> Are all MONs on rocksdb back end in this cluster? I didn't check back
> then if this was the case in our cluster, so I'm just wondering if
> that could be an explanation.
>
> Regards,
> Eugen
>
> Zitat von Frank Schilder :
>
> > Hi all,
> >
> > we have these messages in our logs daily:
> >
> > 1/3/23 12:20:00 PM[INF]overall HEALTH_OK
> > 1/3/23 12:19:46 PM[ERR] mon.2 ScrubResult(keys
> > {auth=77,config=2,health=11,logm=10} crc
> > {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> > 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> > {auth=78,config=2,health=11,logm=9} crc
> > {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> > 1/3/23 12:19:46 PM[ERR]scrub mismatch
> > 1/3/23 12:19:46 PM[ERR] mon.1 ScrubResult(keys
> > {auth=77,config=2,health=11,logm=10} crc
> > {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> > 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> > {auth=78,config=2,health=11,logm=9} crc
> > {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> > 1/3/23 12:19:46 PM[ERR]scrub mismatch
> > 1/3/23 12:17:04 PM[INF]Cluster is now healthy
> > 1/3/23 12:17:04 PM[INF]Health check cleared: MON_CLOCK_SKEW (was:
> > clock skew detected on mon.tceph-02)
> >
> > Cluster is health OK:
> >
> > # ceph status
> >   cluster:
> > id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
> > health: HEALTH_OK
> >
> >   services:
> > mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 3M)
> > mgr: tceph-01(active, since 8w), standbys: tceph-03, tceph-02
> > mds: fs:1 {0=tceph-02=up:active} 2 up:standby
> > osd: 9 osds: 9 up (since 3M), 9 in
> >
> >   task status:
> >
> >   data:
> > pools:   4 pools, 321 pgs
> > objects: 9.94M objects, 336 GiB
> > usage:   1.6 TiB used, 830 GiB / 2.4 TiB avail
> > pgs: 321 active+clean
> >
> > Unfortunately, google wasn't of too much help. Is this scrub error
> > something to worry about?
> >
> > Thanks and best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Telemetry service is temporarily down

2023-01-03 Thread Yaarit Hatuka

Hi everyone,

We are having some infrastructure issues with our telemetry backend, and we
are working on fixing it.
Thanks Jan Horacek for opening this issue
 [1]. We will update once the
service is back up.
We are sorry for any inconvenience you may be experiencing, and appreciate
your patience.

Thanks,
Yaarit

[1] https://tracker.ceph.com/issues/58371
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Does Raid Controller p420i in HBA mode become Bottleneck?

2023-01-03 Thread hosseinz8...@yahoo.com

Hi Experts,In my new cluster, each of my storage nodes have 6x PM1643 Samsung 
SSD with P420i Raid Controller in HBA Mode.My Main concern is P420i working in 
HBA mode become bottleneck in IOPs & throughput or not.Each PM1643 support 30k 
write and 6 count of PM1643 result in 180k iops (30k * 6).I don't know P420i in 
HBA mode supports 180k iops or not.
Thanks in advance.  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Fabio Pasetti

Hi everyone,
we’ve got the same issue with our cluster Ceph (release Pacific) and we saw 
this issue for the first time when we start to use it as offload storage for 
Veeam Backup. In fact Veeam, at the end of the offload job, when it try to 
delete the oldest files, gave us the “unknown error” which is related with the 
impossible delete of multiple object. At the very beginning we supposed to be 
an s3 api implementation bug with the multiple delete request, but digging into 
the radosgw-admin commands we found the orphan list and we saw that we had a 
lot (I mean hundreds of thousands) of orphans files. Our cluster is about 2.7TB 
raw capacity but are 50% full of orphan files.

Is there a way to delete them in a safe way? Or is it possible to change the 
garbage collector configuration to avoid this issue with the orphan files?

Thank you all, I was pretty scared that the issue was related with my fault 
during the cluster setup 😊

Fabio



From: Andrei Mikhailovsky 
Date: Tuesday, 3 January 2023 at 16:35
To: EDH 
Cc: ceph-users 
Subject: [ceph-users] Re: rgw - unable to remove some orphans
Manuel,

Wow, I am pretty surprised to hear that the ceph developers hasn't addressed 
this issue already. It looks like it is a big issue, which is costing a lot of 
money to keep this orphan data unresolved.

Could someone from the developers comment on the issue and let us know if there 
is a workaround?

Cheers

Andrei

- Original Message -
> From: "EDH" 
> To: "Andrei Mikhailovsky" , "ceph-users" 
> 
> Sent: Tuesday, 3 January, 2023 13:36:19
> Subject: RE: rgw - unable to remove some orphans

> Object index database get corrupted and no ones can fix. We wipped a 500TB
> cluster years ago and move out ceph due this orphans bugs.
> After move all our data we saw in disk more than 100TB data unable to be 
> deleted
> by ceph, also know as orphans... no sense.
>
> We expended thousand hours with this bug, the best solution replicate valid 
> data
> to a new ceph cluster.
>
> Some providers solve this with x4 replica  but no money sense.
>
> Regards,
> Manuel
>
> CONFIDENTIALITY NOTICE:
> This e-mail message and all attachments transmitted with it may contain 
> legally
> privileged, proprietary and/or confidential information intended solely for 
> the
> use of the addressee. If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, duplication or other 
> use
> of this message and/or its attachments is strictly prohibited. If you are not
> the intended recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message and its attachments. Thank you.
> No imprimas si no es necesario. Protejamos el Medio Ambiente.
>
>
> -Original Message-
> From: Andrei Mikhailovsky 
> Sent: martes, 3 de enero de 2023 13:46
> To: ceph-users 
> Subject: [ceph-users] rgw - unable to remove some orphans
>
> Happy New Year everyone!
>
> I have a bit of an issue with removing some of the orphan objects that were
> generated with the rgw-orphan-list tool. Over the years rgw generated over 14
> million orphans with an overall waste of over 100TB in size, considering the
> overall data stored in rgw was well under 10TB at max. Anyways, I have managed
> to remove around 12m objects over the holiday season, but there are just over
> 2m orphans which were not removed. Here is an example of one of the objects
> taken from the orphans list file:
>
> $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92'
>
> error removing .rgw.buckets>default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92:
> (2) No such file or directory
>
> Checking the presence of the object with the rados tool shows that the object 
> is
> there.
>
> $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>
> $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92
> default.775634629.1__shadow_SQL
> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92
>
>
> Why can't I remove it? I have around 2m objects which can't be removed. What 
> can
> I do to remove them?
>
> Thanks
>
> Andrei
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -

[ceph-users] RGW access logs with bucket name

[ceph-users] Re: [EXTERNAL] Re: S3 Deletes in Multisite Sometimes Not Syncing

[ceph-users] Re: [ext] Copying large file stuck, two cephfs-2 mounts on two cluster

[ceph-users] increasing number of (deep) scrubs

[ceph-users] mon scrub error (scrub mismatch)

[ceph-users] rgw - unable to remove some orphans

[ceph-users] Re: rgw - unable to remove some orphans

[ceph-users] Re: rgw - unable to remove some orphans

[ceph-users] Re: mon scrub error (scrub mismatch)

[ceph-users] Re: pg deep scrubbing issue

[ceph-users] Re: rgw - unable to remove some orphans

[ceph-users] Re: rgw - unable to remove some orphans

[ceph-users] RGW - Keyring Storage Cluster Users ceph for secondary RGW multisite

[ceph-users] Re: mon scrub error (scrub mismatch)

[ceph-users] Re: mon scrub error (scrub mismatch)

[ceph-users] Telemetry service is temporarily down

[ceph-users] Does Raid Controller p420i in HBA mode become Bottleneck?

[ceph-users] Re: rgw - unable to remove some orphans

18 matches

Site Navigation

Mail list logo

Footer information