[ceph-users] Enable LUKS encryption on a snapshot created from unencrypted image

2023-04-27 Thread Will Gorman
Is there a way to enable the LUKS encryption format on a snapshot that was 
created from an unencrypted image without losing data?  I've seen in 
https://docs.ceph.com/en/quincy/rbd/rbd-encryption/ that "Any data written to 
the image prior to its format may become unreadable, though it may still occupy 
storage resources." and observed that to be the case when running `encryption 
format` on an image that already has data in it.  However is there any way to 
take a snapshot of an unencrypted image and enable encryption on the snapshot 
(or even on a new image cloned from the snapshot?)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: Massive OMAP remediation

2023-04-27 Thread Ben . Zieglmeier
Hi Dan,

Thanks for the response. No I have not yet told the OSDs participating in that 
PG to compact. It was something I had thought about, but was somewhat concerned 
about what that might do, or what performance impact that might have (or if the 
OSD would come out alive on the other side). I think we may have found a less 
impactful way to trim these bilog entries by using `--start-marker` and 
`--end-marker` and simply looping and incrementing those marker values by 1000 
each time. This is far less impactful than running the commands without those 
flags: it was taking ~45 seconds each time to enumerate bilog entries to trim 
in which the lead OSD was nearly unresponsive. It took diving into the source 
code and the help of a few colleagues (as well as some trial and error on 
non-production systems) to figure out what values those arguments actually 
wanted. Thankfully I was able to get a listing of all OMAP keys for that object 
a couple weeks ago. I’m still not sure how comfortable I would be doing this to 
a bucket that was actually mission critical (this one contains non-critical 
data), but I think we may have a way forward to dislodge this large OMAP by 
trimming. Thanks again!

-Ben

From: Dan van der Ster 
Date: Wednesday, April 26, 2023 at 11:11 AM
To: Ben.Zieglmeier 
Cc: ceph-users@ceph.io 
Subject: [EXTERNAL] Re: [ceph-users] Massive OMAP remediation
Hi Ben,

Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.

Otherwise, I have experience deleting ~50M object indices in one step
in the past, probably back in the luminous days IIRC. It will likely
lockup the relevant osds for a while while the omap is removed. If you
dare take that step, it might help to set nodown; that might prevent
other osds from flapping and creating more work.

Cheers, Dan

__
Clyso GmbH | 
https://urldefense.com/v3/__https://www.clyso.com__;!!A-7_uaOk87I!rAkZvWTiVOMlLhgs9UYh_GnFo0_SjmhHU9yBCmioZveHqD0td7g4PbmBewq_wjdaruksI1fcreeet106f6GIfmCrx5f7$


On Tue, Apr 25, 2023 at 2:45 PM Ben.Zieglmeier
 wrote:
>
> Hi All,
>
> We have a RGW cluster running Luminous (12.2.11) that has one object with an 
> extremely large OMAP database in the index pool. Listomapkeys on the object 
> returned 390 Million keys to start. Through bilog trim commands, we’ve 
> whittled that down to about 360 Million. This is a bucket index for a 
> regrettably unsharded bucket. There are only about 37K objects actually in 
> the bucket, but through years of neglect, the bilog grown completely out of 
> control. We’ve hit some major problems trying to deal with this particular 
> OMAP object. We just crashed 4 OSDs when a bilog trim caused enough churn to 
> knock one of the OSDs housing this PG out of the cluster temporarily. The OSD 
> disks are 6.4TB NVMe, but are split into 4 partitions, each housing their own 
> OSD daemon (collocated journal).
>
> We want to be rid of this large OMAP object, but are running out of options 
> to deal with it. Reshard outright does not seem like a viable option, as we 
> believe the deletion would deadlock OSDs can could cause impact. Continuing 
> to run `bilog trim` 1000 records at a time has been what we’ve done, but this 
> also seems to be creating impacts to performance/stability. We are seeking 
> options to remove this problematic object without creating additional 
> problems. It is quite likely this bucket is abandoned, so we could remove the 
> data, but I fear even the deletion of such a large OMAP could bring OSDs down 
> and cause potential for metadata loss (the other bucket indexes on that same 
> PG).
>
> Any insight available would be highly appreciated.
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: import OSD after host OS reinstallation

2023-04-27 Thread Tony Liu
Tried [1] already, but got error.
Created no osd(s) on host ceph-4; already created?

The error is from [2] in deploy_osd_daemons_for_existing_osds().

Not sure what's missing.
Should OSD be removed, or removed with --replace, or untouched before host 
reinstallation?

[1] 
https://docs.ceph.com/en/pacific/cephadm/services/osd/#activate-existing-osds
[2] 
https://github.com/ceph/ceph/blob/0a5b3b373b8a5ba3081f1f110cec24d82299cac8/src/pybind/mgr/cephadm/services/osd.py#L196

Thanks!
Tony

From: Tony Liu 
Sent: April 27, 2023 10:20 PM
To: ceph-users@ceph.io; d...@ceph.io
Subject: [ceph-users] import OSD after host OS reinstallation

Hi,

The cluster is with Pacific and deployed by cephadm on container.
The case is to import OSDs after host OS reinstallation.
All OSDs are SSD who has DB/WAL and data together.
Did some research, but not able to find a working solution.
Wondering if anyone has experiences in this?
What needs to be done before host OS reinstallation and what's after?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] import OSD after host OS reinstallation

2023-04-27 Thread Tony Liu
Hi,

The cluster is with Pacific and deployed by cephadm on container.
The case is to import OSDs after host OS reinstallation.
All OSDs are SSD who has DB/WAL and data together.
Did some research, but not able to find a working solution.
Wondering if anyone has experiences in this?
What needs to be done before host OS reinstallation and what's after?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs - max snapshot limit?

2023-04-27 Thread Milind Changire
There's a default/hard limit of 50 snaps that's maintained for any dir via
the definition MAX_SNAPS_PER_PATH = 50 in the source file
src/pybind/mgr/snap_schedule/fs/schedule_client.py.
Every time the snapshot names are read for pruning, the last thing done is
to check the length of the list and keep only MAX_SNAPS_PER_PATH and the
rest are pruned.

Jakob Haufe has pointed it out correctly.



On Thu, Apr 27, 2023 at 12:38 PM Tobias Hachmer  wrote:

> Hello,
>
> we are running a 3-node ceph cluster with version 17.2.6.
>
> For CephFS snapshots we have configured the following snap schedule with
> retention:
>
> /PATH 2h 72h15d6m
>
> But we observed that max 50 snapshot are preserved. If a new snapshot is
> created the oldest 51st is deleted.
>
> Is there a limit for maximum cephfs snapshots or maybe this is a bug?
>
> I have found the setting "mds_max_snaps_per_dir" which is 100 by default
> but I think this is not related to my problem?
>
> Thanks,
>
> Tobias
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Milind
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph stretch mode / POOL_BACKFILLFULL

2023-04-27 Thread Gregory Farnum
On Fri, Apr 21, 2023 at 7:26 AM Kilian Ries  wrote:
>
> Still didn't find out what will happen when the pool is full - but tried a 
> little bit in our testing environment and i were not able to get the pool 
> full before an OSD got full. So in first place one OSD reached the full ratio 
> (pool not quite full, about 98%) and IO stopped (like expected when an OSD 
> reaches full ratio).

I *think* pool full doesn't actually matter if you haven't set quotas,
but those properties have seen some code changes recently. CCing RADOS
people.
We do have a proposed fix but it seems to have languished. :(
-Greg

> I were able to re-balance the OSDs by manually doing reweights. Now, the 
> cluster is much more balanced and even the pool shows more free space (about 
> 75% used).
>
> Also the pg-autoscaler does not really play well with the stretch crush rule 
> ... had to increase / adjust the PGs manually to get a better distribution.
>
> Regards,
> Kilian
> 
> Von: Kilian Ries 
> Gesendet: Mittwoch, 19. April 2023 12:18:06
> An: ceph-users
> Betreff: [ceph-users] Ceph stretch mode / POOL_BACKFILLFULL
>
> Hi,
>
>
> we run a ceph cluster in stretch mode with one pool. We know about this bug:
>
>
> https://tracker.ceph.com/issues/56650
>
> https://github.com/ceph/ceph/pull/47189
>
>
> Can anyone tell me what happens when a pool gets to 100% full? At the moment 
> raw OSD usage is about 54% but ceph throws me a "POOL_BACKFILLFULL" error:
>
>
> $ ceph df
>
> --- RAW STORAGE ---
>
> CLASSSIZE   AVAILUSED  RAW USED  %RAW USED
>
> ssd63 TiB  29 TiB  34 TiB34 TiB  54.19
>
> TOTAL  63 TiB  29 TiB  34 TiB34 TiB  54.19
>
>
>
> --- POOLS ---
>
> POOL ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
>
> .mgr  11  415 MiB  105  1.2 GiB   0.041.1 TiB
>
> vm_stretch_live   2   64   15 TiB4.02M   34 TiB  95.53406 GiB
>
>
>
> So the pool warning / calculation is just a bug, because it thinks its 50% of 
> the total size. I know ceph will stop IO / set OSDs to read only if the hit a 
> "backfillfull_ratio" ... but what will happen if the pool gets to 100% full ?
>
>
> Will IO still be possible?
>
>
> No limits / quotas are set on the pool ...
>
>
> Thanks
>
> Regards,
>
> Kilian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Igor. I mentioned earlier that according to the OSD logs compaction
wasn't an issue. I did run `ceph-kvstore-tool` offline though, it completed
rather quickly without any warnings or errors, but the OSD kept showing
excessive latency.

I did something rather radical: rebooted the node and redeployed all its
OSDs, now the "slow" OSD is showing latency more in line with that of other
OSDs.

/Z

On Thu, 27 Apr 2023 at 23:10, Igor Fedotov  wrote:

> Hi Zakhar,
>
> you might want to try offline DB compaction using ceph-kvstore-tool for
> this specific OSD.
>
> Periodically we observe OSD perf drop due to degraded RocksDB
> performance, particularly after bulk data removal/migration.. Compaction
> is quite helpful in this case.
>
>
> Thanks,
>
> Igor
>
>
>
> On 26/04/2023 20:22, Zakhar Kirpichenko wrote:
> > Hi,
> >
> > I have a Ceph 16.2.12 cluster with uniform hardware, same drive
> make/model,
> > etc. A particular OSD is showing higher latency than usual in `ceph osd
> > perf`, usually mid to high tens of milliseconds while other OSDs show low
> > single digits, although its drive's I/O stats don't look different from
> > those of other drives. The workload is mainly random 4K reads and writes,
> > the cluster is being used as Openstack VM storage.
> >
> > Is there a way to trace, which particular PG, pool and disk image or
> object
> > cause this OSD's excessive latency? Is there a way to tell Ceph to
> >
> > I would appreciate any advice or pointers.
> >
> > Best regards,
> > Zakhar
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: architecture help (iscsi, rbd, backups?)

2023-04-27 Thread Alex Gorbachev
Hi Angelo,

Just some thoughts to consider from our experience with similar setups:

1. Use Proxmox instead of VMWare, or anything KVM based.  These VMs can
consume Ceph directly, and provide the same level of service (some may say
better) for live ,migration, hyperconvergence etc.  Then you run Windows
VMs in KVM, bring RBD storage to them as virtual disks and share out as
needed.

2. Use NFS - all modern Windows  OSs support it.  You can use any NFS
gateway you like, or set up your own machine or cluster (which is what we
did with Storcium) and export your storage as needed.

3. If you must use VMWare, you can present datastores via NFS as well, this
has a lot of indirection but is easier to manage.

--
Alex Gorbachev
ISS Storcium
https://www.iss-integration.com



On Thu, Apr 27, 2023 at 5:06 PM Angelo Höngens  wrote:

> Hey guys and girls,
>
> I'm working on a project to build storage for one of our departments,
> and I want to ask you guys and girls for input on the high-level
> overview part. It's a long one, I hope you read along and comment.
>
> SUMMARY
>
> I made a plan last year to build a 'storage solution' including ceph
> and some windows VM's to expose the data over SMB to clients. A year
> later I finally have the hardware, built a ceph cluster, and I'm doing
> tests. Ceph itself runs great, but when I wanted to start exposing the
> data using iscsi to our VMware farm, I ran into some issues. I know
> the iscsi gateways will introduce some new performance bottlenecks,
> but I'm seeing really slow performance, still working on that.
>
> But then I ran into the warning on the iscsi gateway page: "The iSCSI
> gateway is in maintenance as of November 2022. This means that it is
> no longer in active development and will not be updated to add new
> features.". Wait, what? Why!? What does this mean? Does this mean that
> iSCSI is now 'feature complete' and will still be supported the next 5
> years, or will it be deprecated in the future? I tried searching, but
> couldn't find any info on the decision and the roadmap.
>
> My goal is to build a future-proof setup, and using deprecated
> components should not be part of that of course.
>
> If the iscsi gateway will still be supported the next few years and I
> can iron out the performance issues, I can still go on with my
> original plan. If not, I have to go back to the drawing board. And
> maybe you guys would advise me to take another route anyway.
>
> GOALS
>
> My goals/considerations are:
>
> - we want >1PB of storage capacity for cheap (on a tight budget) for
> research data. Most of it is 'store once, read sometimes'. <1% of the
> data is 'hot'.
> - focus is on capacity, but it would be nice to have > 200MB/s of
> sequential write/read performance and not 'totally suck' on random
> i/o. Yes, not very well quantified, but ah. Sequential writes are most
> important.
> - end users all run Windows computers (mostly VDI's) and a lot of
> applications require SMB shares.
> - security is a big thing, we want really tight ACL's, specific
> monitoring agents, etc.
> - our data is incredibly important to us, we still want the 3-2-1
> backup rule. Primary storage solution, a second storage solution in a
> different place, and some of the data that is not reproducible is also
> written to tape. We also want to be protected from ransomware or user
> errors (so no direct replication to the second storage).
> - I like open source, reliability, no fork-lift upgrades, no vendor
> lock-in, blah, well, I'm on the ceph list here, no need to convince
> you guys ;)
> - We're hiring a commercial company to do ceph maintenance and support
> for when I'm on leave or leaving the company, but they won't support
> clients, backup software, etc, so I want something as simple as
> possible. We do have multiple Windows/VMware admins, but no other real
> linux guru's.
>
> THE INITIAL PLAN
>
> Given these considerations, I ordered two identical clusters, each
> consisting of 3 monitor nodes and 8 osd nodes, Each osd node has 2
> ssd's and 10 capacity disks (EC 4:2 for the data), and each node is
> connected using a 2x25Gbps bond. Ceph is running like a charm. Now I
> just have to think about exposing the data to end users, and I've been
> testing different setups.
>
> My original plan was to expose for example 10x100TB rbd images using
> iSCSI to our VMware farm, formatting the luns with VMFS6, and run for
> example 2 Windows file servers per datastore on that with a single DFS
> namespace to end users. Then backup the file servers using our
> existing Veeam infrastructure to RGW running on the second cluster
> with an immutable bucket. This way we would have easily defined
> security boundaries: the clients can only reach the file servers, the
> file servers only see their local VMDK's, ESX only sees the luns on
> the iSCSI target, etc. When a file server would be compromised, it
> would have no access to ceph. We have easy incremental backups,
> immutability for 

[ceph-users] Re: architecture help (iscsi, rbd, backups?)

2023-04-27 Thread Anthony D'Atri
There is also a direct RBD client for MS Windows, though it's relatively young.

> On Apr 27, 2023, at 18:20, Bailey Allison  wrote:
> 
> Hey Angelo,
> 
> Just to make sure I'm understanding correctly, the main idea for the use
> case is to be able to present Ceph storage to windows clients as SMB? 
> 
> If so, you can absolutely use CephFS to get that done. This is something we
> do all the time with our cluster configurations, if we're looking to present
> ceph storage to windows clients for the use case of a file server is our
> standard choice, and to your point of security/ACLs we can make use of
> joining the samba server that to an existing active directory, and then
> assigning permissions through Windows. 
> 
> I will provide a high level overview of an average setup to hopefully
> explain it better, and of course if you have any questions please let me
> know. I understand that this is way different of a setup of what you
> currently have planned, but it's a different choice that could prove useful
> in your case.
> 
> Essentially how it works is we have ceph cluster with CephFS configured, of
> which we map CephFS kernel mounts onto some gateway nodes, at which point we
> expose to clients via CTDB with SMB shares (CTDB for high availability). 
> 
> i.e
> 
> ceph cluster > ceph fs > map cephfs kernel mount on linux client > create
> smb share on top of cephfs kernel mount > connect to samba share with
> windows clients.
> 
> The SMB gateway nodes hosting samba also can be joined to an Active
> Directory to allow setting Windows ACL permissions to allow more in depth
> control of ACLs.
> 
> Also I will say +1 for the RBD driver on Windows, something we also make use
> of a lot and have a lot of success with.
> 
> Again, please let me know if you need any insight or clarification, or have
> any further questions. Hope this is of assistance.
> 
> Regards,
> 
> Bailey
> 
> -Original Message-
>> From: Angelo Höngens  
>> Sent: April 27, 2023 6:06 PM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] architecture help (iscsi, rbd, backups?)
>> 
>> Hey guys and girls,
>> 
>> I'm working on a project to build storage for one of our departments, and I
> want to ask you guys and girls for input on the high-level overview part.
> It's a long one, I hope you read along and comment.
>> 
>> SUMMARY
>> 
>> I made a plan last year to build a 'storage solution' including ceph and
> some windows VM's to expose the data over SMB to clients. A year later I
> finally have the hardware, built a ceph cluster, and I'm doing tests. Ceph
> itself runs great, but when I wanted to start exposing the data using iscsi
> to our VMware farm, I ran into some issues. I know the iscsi gateways will
> introduce some new performance bottlenecks, but I'm seeing really slow
> performance, still working on that.
>> 
>> But then I ran into the warning on the iscsi gateway page: "The iSCSI
> gateway is in maintenance as of November 2022. This means that it is no
> longer in active development and will not be updated to add new features.".
> Wait, what? Why!? What does this mean? Does this mean that iSCSI is now
> 'feature complete' and will still be supported the next 5 years, or will it
> be deprecated in the future? I tried searching, but couldn't find any info
> on the decision and the roadmap.
>> 
>> My goal is to build a future-proof setup, and using deprecated components
> should not be part of that of course.
>> 
>> If the iscsi gateway will still be supported the next few years and I can
> iron out the performance issues, I can still go on with my original plan. If
> not, I have to go back to the drawing board. And maybe you guys would advise
> me to take another route anyway.
>> 
>> GOALS
>> 
>> My goals/considerations are:
>> 
>> - we want >1PB of storage capacity for cheap (on a tight budget) for
> research data. Most of it is 'store once, read sometimes'. <1% of the data
> is 'hot'.
>> - focus is on capacity, but it would be nice to have > 200MB/s of
> sequential write/read performance and not 'totally suck' on random i/o. Yes,
> not very well quantified, but ah. Sequential writes are most important.
>> - end users all run Windows computers (mostly VDI's) and a lot of
> applications require SMB shares.
>> - security is a big thing, we want really tight ACL's, specific monitoring
> agents, etc.
>> - our data is incredibly important to us, we still want the 3-2-1 backup
> rule. Primary storage solution, a second storage solution in a different
> place, and some of the data that is not reproducible is also written to
> tape. We also want to be protected from ransomware or user errors (so no
> direct replication to the second storage).
>> - I like open source, reliability, no fork-lift upgrades, no vendor
> lock-in, blah, well, I'm on the ceph list here, no need to convince you guys
> ;)
>> - We're hiring a commercial company to do ceph maintenance and support for
> when I'm on leave or leaving the company, but they 

[ceph-users] Re: architecture help (iscsi, rbd, backups?)

2023-04-27 Thread Bailey Allison
Hey Angelo,

Just to make sure I'm understanding correctly, the main idea for the use
case is to be able to present Ceph storage to windows clients as SMB? 

If so, you can absolutely use CephFS to get that done. This is something we
do all the time with our cluster configurations, if we're looking to present
ceph storage to windows clients for the use case of a file server is our
standard choice, and to your point of security/ACLs we can make use of
joining the samba server that to an existing active directory, and then
assigning permissions through Windows. 

I will provide a high level overview of an average setup to hopefully
explain it better, and of course if you have any questions please let me
know. I understand that this is way different of a setup of what you
currently have planned, but it's a different choice that could prove useful
in your case.

Essentially how it works is we have ceph cluster with CephFS configured, of
which we map CephFS kernel mounts onto some gateway nodes, at which point we
expose to clients via CTDB with SMB shares (CTDB for high availability). 

i.e

ceph cluster > ceph fs > map cephfs kernel mount on linux client > create
smb share on top of cephfs kernel mount > connect to samba share with
windows clients.

The SMB gateway nodes hosting samba also can be joined to an Active
Directory to allow setting Windows ACL permissions to allow more in depth
control of ACLs.

Also I will say +1 for the RBD driver on Windows, something we also make use
of a lot and have a lot of success with.

Again, please let me know if you need any insight or clarification, or have
any further questions. Hope this is of assistance.

Regards,

Bailey

-Original Message-
>From: Angelo Höngens  
>Sent: April 27, 2023 6:06 PM
>To: ceph-users@ceph.io
>Subject: [ceph-users] architecture help (iscsi, rbd, backups?)
>
>Hey guys and girls,
>
>I'm working on a project to build storage for one of our departments, and I
want to ask you guys and girls for input on the high-level overview part.
It's a long one, I hope you read along and comment.
>
>SUMMARY
>
>I made a plan last year to build a 'storage solution' including ceph and
some windows VM's to expose the data over SMB to clients. A year later I
finally have the hardware, built a ceph cluster, and I'm doing tests. Ceph
itself runs great, but when I wanted to start exposing the data using iscsi
to our VMware farm, I ran into some issues. I know the iscsi gateways will
introduce some new performance bottlenecks, but I'm seeing really slow
performance, still working on that.
>
>But then I ran into the warning on the iscsi gateway page: "The iSCSI
gateway is in maintenance as of November 2022. This means that it is no
longer in active development and will not be updated to add new features.".
Wait, what? Why!? What does this mean? Does this mean that iSCSI is now
'feature complete' and will still be supported the next 5 years, or will it
be deprecated in the future? I tried searching, but couldn't find any info
on the decision and the roadmap.
>
>My goal is to build a future-proof setup, and using deprecated components
should not be part of that of course.
>
>If the iscsi gateway will still be supported the next few years and I can
iron out the performance issues, I can still go on with my original plan. If
not, I have to go back to the drawing board. And maybe you guys would advise
me to take another route anyway.
>
>GOALS
>
>My goals/considerations are:
>
>- we want >1PB of storage capacity for cheap (on a tight budget) for
research data. Most of it is 'store once, read sometimes'. <1% of the data
is 'hot'.
>- focus is on capacity, but it would be nice to have > 200MB/s of
sequential write/read performance and not 'totally suck' on random i/o. Yes,
not very well quantified, but ah. Sequential writes are most important.
>- end users all run Windows computers (mostly VDI's) and a lot of
applications require SMB shares.
>- security is a big thing, we want really tight ACL's, specific monitoring
agents, etc.
>- our data is incredibly important to us, we still want the 3-2-1 backup
rule. Primary storage solution, a second storage solution in a different
place, and some of the data that is not reproducible is also written to
tape. We also want to be protected from ransomware or user errors (so no
direct replication to the second storage).
>- I like open source, reliability, no fork-lift upgrades, no vendor
lock-in, blah, well, I'm on the ceph list here, no need to convince you guys
;)
>- We're hiring a commercial company to do ceph maintenance and support for
when I'm on leave or leaving the company, but they won't support clients,
backup software, etc, so I want something as simple as possible. We do have
multiple Windows/VMware admins, but no other real linux guru's.
>
>THE INITIAL PLAN
>
>Given these considerations, I ordered two identical clusters, each
consisting of 3 monitor nodes and 8 osd nodes, Each osd node has 2 ssd's and
10 capacity disks 

[ceph-users] 16.2.13 pacific QE validation status

2023-04-27 Thread Yuri Weinstein
Details of this release are summarized here:

https://tracker.ceph.com/issues/59542#note-1
Release Notes - TBD

Seeking approvals for:

smoke - Radek, Laura
rados - Radek, Laura
  rook - Sébastien Han
  cephadm - Adam K
  dashboard - Ernesto

rgw - Casey
rbd - Ilya
krbd - Ilya
fs - Venky, Patrick
upgrade/octopus-x (pacific) - Laura (look the same as in 16.2.8)
upgrade/pacific-p2p - Laura
powercycle - Brad (SELinux denials)
ceph-volume - Guillaume, Adam K

Thx
YuriW
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] architecture help (iscsi, rbd, backups?)

2023-04-27 Thread Angelo Höngens
Hey guys and girls,

I'm working on a project to build storage for one of our departments,
and I want to ask you guys and girls for input on the high-level
overview part. It's a long one, I hope you read along and comment.

SUMMARY

I made a plan last year to build a 'storage solution' including ceph
and some windows VM's to expose the data over SMB to clients. A year
later I finally have the hardware, built a ceph cluster, and I'm doing
tests. Ceph itself runs great, but when I wanted to start exposing the
data using iscsi to our VMware farm, I ran into some issues. I know
the iscsi gateways will introduce some new performance bottlenecks,
but I'm seeing really slow performance, still working on that.

But then I ran into the warning on the iscsi gateway page: "The iSCSI
gateway is in maintenance as of November 2022. This means that it is
no longer in active development and will not be updated to add new
features.". Wait, what? Why!? What does this mean? Does this mean that
iSCSI is now 'feature complete' and will still be supported the next 5
years, or will it be deprecated in the future? I tried searching, but
couldn't find any info on the decision and the roadmap.

My goal is to build a future-proof setup, and using deprecated
components should not be part of that of course.

If the iscsi gateway will still be supported the next few years and I
can iron out the performance issues, I can still go on with my
original plan. If not, I have to go back to the drawing board. And
maybe you guys would advise me to take another route anyway.

GOALS

My goals/considerations are:

- we want >1PB of storage capacity for cheap (on a tight budget) for
research data. Most of it is 'store once, read sometimes'. <1% of the
data is 'hot'.
- focus is on capacity, but it would be nice to have > 200MB/s of
sequential write/read performance and not 'totally suck' on random
i/o. Yes, not very well quantified, but ah. Sequential writes are most
important.
- end users all run Windows computers (mostly VDI's) and a lot of
applications require SMB shares.
- security is a big thing, we want really tight ACL's, specific
monitoring agents, etc.
- our data is incredibly important to us, we still want the 3-2-1
backup rule. Primary storage solution, a second storage solution in a
different place, and some of the data that is not reproducible is also
written to tape. We also want to be protected from ransomware or user
errors (so no direct replication to the second storage).
- I like open source, reliability, no fork-lift upgrades, no vendor
lock-in, blah, well, I'm on the ceph list here, no need to convince
you guys ;)
- We're hiring a commercial company to do ceph maintenance and support
for when I'm on leave or leaving the company, but they won't support
clients, backup software, etc, so I want something as simple as
possible. We do have multiple Windows/VMware admins, but no other real
linux guru's.

THE INITIAL PLAN

Given these considerations, I ordered two identical clusters, each
consisting of 3 monitor nodes and 8 osd nodes, Each osd node has 2
ssd's and 10 capacity disks (EC 4:2 for the data), and each node is
connected using a 2x25Gbps bond. Ceph is running like a charm. Now I
just have to think about exposing the data to end users, and I've been
testing different setups.

My original plan was to expose for example 10x100TB rbd images using
iSCSI to our VMware farm, formatting the luns with VMFS6, and run for
example 2 Windows file servers per datastore on that with a single DFS
namespace to end users. Then backup the file servers using our
existing Veeam infrastructure to RGW running on the second cluster
with an immutable bucket. This way we would have easily defined
security boundaries: the clients can only reach the file servers, the
file servers only see their local VMDK's, ESX only sees the luns on
the iSCSI target, etc. When a file server would be compromised, it
would have no access to ceph. We have easy incremental backups,
immutability for ransomware protection, etc. And the best part is that
the ceph admin can worry about ceph, the vmware admin can focus on
ESX, VMFS and all the vmware stuff, and the Windows admins can focus
on the Windows boxes, Windows-specific ACLS and tools and Veeam
backups and stuff.

CURRENT SITUATION

I'm building out this plan now, but I'm running into issues with
iSCSI. Are any of you doing something similar? What is your iscsi
performance compared to direct rbd?

In regard to performance: If I take 2 test windows VM's, I put one on
an iSCSI datastore and another with direct rbd access using the
windows rbd driver, I create a share on those boxes and push data to
it, I see different results (of course). Copying some iso images over
SMB to the 'windows vm running direct rbd' I see around 800MB/s write,
and 200MB/s read, which is pretty okay. When I send data to the
'windows vm running on top of iscsi' it starts writing at around
350MB/s, but after like 10-20 seconds drops to 100MB/s and 

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Igor Fedotov

Hi Zakhar,

you might want to try offline DB compaction using ceph-kvstore-tool for 
this specific OSD.


Periodically we observe OSD perf drop due to degraded RocksDB 
performance, particularly after bulk data removal/migration.. Compaction 
is quite helpful in this case.



Thanks,

Igor



On 26/04/2023 20:22, Zakhar Kirpichenko wrote:

Hi,

I have a Ceph 16.2.12 cluster with uniform hardware, same drive make/model,
etc. A particular OSD is showing higher latency than usual in `ceph osd
perf`, usually mid to high tens of milliseconds while other OSDs show low
single digits, although its drive's I/O stats don't look different from
those of other drives. The workload is mainly random 4K reads and writes,
the cluster is being used as Openstack VM storage.

Is there a way to trace, which particular PG, pool and disk image or object
cause this OSD's excessive latency? Is there a way to tell Ceph to

I would appreciate any advice or pointers.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Frank Schilder
Hi, I asked a similar question about increasing scrub throughput some time ago 
and couldn't get a fully satisfying answer: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NHOHZLVQ3CKM7P7XJWGVXZUXY24ZE7RK

My observation is that much fewer (deep) scrubs are scheduled than could be 
executed. Some people wrote scripts to do scrub scheduling in a more efficient 
way (by last-scrub time stamp), but I don't want to go this route (yet). 
Unfortunately, the thread above does not contain the full conversation, I think 
it forked into a second one with the same or a similar title.

About performance calculations, along the lines of

> they were never setup to have enough IOPS to support the maintenance load,
> never mind the maintenance load plus the user load

> initially setup, it is nearly empty, so it appears
> to perform well even if it was setup with inexpensive but slow/large
> HDDs, then it becomes fuller and therefore heavily congested

There is a bit more to that. HDDs have the unfortunate property that sector 
reads/writes are not independent of which sector is read/written to. An empty 
drive will serve IO from the beginning of the disk when everything is fast. As 
drives fill up, they start using slower and slower regions. This performance 
degradation is in addition to the effects of longer seek paths and 
fragmentation.

Here I'm talking only about enterprise data centre drives with proper sustained 
performance profiles, not cheap stuff that falls apart once you go serious.

Unfortunately, ceph adds on top of that the lack of tail merging support, which 
makes small objects extra expensive.

Still, ceph was written for HDDs and actually performs well if IO calculations 
are done properly. For example, 8TB vs. 18TB drives. 8TB drives start with 
about 150MB/s bandwidth at the fast part and slow down to 80-100MB/s when you 
reach the end. 18TB drives are not just 8TB drives with denser packing, they 
actually have more platters. That means, they start out at 250MB/s and reach 
something like 100-130MB/s towards the end. Its more than double the capacity, 
but not more than double the throughput. IOP/s are roughly the same, so IOP/s 
per TB go down a lot with capacity.

When is this fine and when is it problematic. Its fine if you have large 
objects that are never modified. Then ceph will usually reach sequential 
read/write performance and scrubbing will be done within a week (with less than 
10% utilisation, which is good). The other extreme is many small objects, in 
which case your observed performance/throughput can be terrible and scrubbing 
might never end.

For being able to make reasonable estimates, you need to know real-life object 
size distributions and if full object writes are effectively sequential 
(meaning you have large bluestore alloc sizes in general, look at the bluestore 
performance counters, it will indicate how many large and how many small writes 
you have).

We have a fairly mixed size distribution with, unfortunately, quite a 
percentage of small objects on our ceph fs. We do have 18T drives, which are 
about 30% utilised. Scrubbing still finishes within less than 2 weeks even with 
the outliers due to "not ideal" scrub scheduling (thread above). I'm willing to 
accept up to 4 weeks tail time, which will probably give me 50-60% utilisation 
before things go below acceptable.

In essence, the 18T average performance drives are something like 10T pretty 
good performance drives compared with the usual 8T drives. You just have to let 
go of 100% capacity utilisation. The limit is what comes first, capacity- or 
IOP/s saturation. Once admin workload cannot complete in time, that's it, the 
disks are full and one needs to expand.

We have about 900 HDDs in our cluster and I maintain this large number mostly 
for performance reasons. I don't think I will ever see more than 50% 
utilisation before we change deployment or add drives.

Looking at our data in more detail, most of it is ice cold. Therefore, in the 
long run we plan to go for tiered OSDs (bcache/dm-cache) with sufficient total 
SSD capacity to hold about 2 times all hot data. Then, maybe, we can fill big 
drives a bit more.

I was looking into large capacity SSDs and, I'm afraid, when going to the 
>=18TB SSD section they either have bad and often worse performance than 
spinners, or are massively expensive. With performance here I mean bandwidth. 
Large SSDs can have a sustained bandwith of 30MB/s. They will still do about 
500-1000IOP/s per TB, but large file transfer or backfill will become a pain.

I looked at models with reasonable bandwidth and asked if I could get a price. 
The answer was that one such disk costs more than an entire of our standard 
storage servers. Clearly not our league. A better solution is to combine the 
best of both worlds and have a more intelligent software that can differentiate 
between hot and cold data and may be able to adapt to workloads.

> the best that can be 

[ceph-users] Re: Massive OMAP remediation

2023-04-27 Thread dongdong tao
Luminous 12.2.11 is still default using bluefs_buffered_io = false, which will 
disable the kernel side cache for rocksdb. 
It's possible that your NVMe is saturated with the massive rocksdb workload 
when bluefs_buffered_io is disabled.
So one way you could try is to set the bluefs_buffered_io = true on those OSDs 
that hold the large OMAP object.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi
On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers 
already divided by replication factor), you need 55 days to scrub it once.

That's 8x larger than the default scrub factor [...] Also, even if I set
the default scrub interval to 8x larger, it my disks will still be thrashing > 
seeks 100% of the time, affecting the cluster's  throughput and latency
performance.


Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Object data missing, but metadata is OK (Quincy 17.2.3)

2023-04-27 Thread Jeff Briden
We have a large cluster on Quincy 17.2.3 with a bucket holding 8.9 million 
small (15~20 MiB) objects.  
All the objects were multipart uploads from scripts using `aws s3 cp` 
The data is static (write-once, read-many) with no manual deletions and no new 
writes for months.
We recently found 3 objects in this bucket that cannot be retrieved.
The symptom is exactly the same as https://tracker.ceph.com/issues/47866 and 
https://bugzilla.redhat.com/show_bug.cgi?id=1892644 which were fixed a long 
time ago.
Any form of listing (`aws s3 ls`, radosgw-admin object stat, radoslist, http 
head request, etc)  returns good data, but the objects cannot be retrieved and 
rados -p ls shows the object data is missing.
Any suggestions on how to troubleshoot this further?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug, pg_upmap_primaries.empty()

2023-04-27 Thread Nguetchouang Ngongang Kevin
I don't think it's a commit from yesterday, I had this issue since last
week, 

the command "ceph features" shows me that my clients have the luminous
versions, but I don't know how to upgrade client version (ceph osd
set-require-min-compat-client is not upgrading client version)

---
Nguetchouang Ngongang Kevin
ENS de Lyon
https://perso.ens-lyon.fr/kevin.nguetchouang/ 

Le 2023-04-26 15:58, Gregory Farnum a écrit :

> Looks like you've somehow managed to enable the upmap balancer while
> allowing a client that's too told to understand it to mount.
> 
> Radek, this is a commit from yesterday; is it a known issue?
> 
> On Wed, Apr 26, 2023 at 7:49 AM Nguetchouang Ngongang Kevin
>  wrote: 
> Good morning, i found a bug on ceph reef
> 
> After installing ceph and deploying 9 osds with a cephfs layer. I got
> this error after many writing and reading operations on the ceph fs i
> deployed.
> 
> ```{
> "assert_condition": "pg_upmap_primaries.empty()",
> "assert_file":
> "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc",
> "assert_func": "void OSDMap::encode(ceph::buffer::v15_2_0::list&,
> uint64_t) const",
> "assert_line": 3239,
> "assert_msg":
> "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc:
> In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t)
> const' thread 7f86cb8e5700 time
> 2023-04-26T12:25:12.278025+\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc:
> 3239: FAILED ceph_assert(pg_upmap_primaries.empty())\n",
> "assert_thread_name": "msgr-worker-0",
> "backtrace": [
> "/lib64/libpthread.so.0(+0x12cf0) [0x7f86d0d21cf0]",
> "gsignal()",
> "abort()",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x18f) [0x55ce1794774b]",
> "/usr/bin/ceph-osd(+0x6368b7) [0x55ce179478b7]",
> "(OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long)
> const+0x1229) [0x55ce183e0449]",
> "(MOSDMap::encode_payload(unsigned long)+0x396)
> [0x55ce17ae2576]",
> "(Message::encode(unsigned long, int, bool)+0x2e)
> [0x55ce1825dbee]",
> "(ProtocolV1::prepare_send_message(unsigned long, Message*,
> ceph::buffer::v15_2_0::list&)+0x54) [0x55ce184e5914]",
> "(ProtocolV1::write_event()+0x511) [0x55ce184f4ce1]",
> "(EventCenter::process_events(unsigned int,
> std::chrono::duration *)+0xa64) 
> [0x55ce182eb484]", "/usr/bin/ceph-osd(+0xfdf276) [0x55ce182f0276]",
> "/lib64/libstdc++.so.6(+0xc2b13) [0x7f86d0369b13]",
> "/lib64/libpthread.so.0(+0x81ca) [0x7f86d0d171ca]",
> "clone()"
> ],
> "ceph_version": "18.0.0-3593-g1e73409b",
> "crash_id":
> "2023-04-26T12:25:12.286947Z_55675d7c-7833-4e91-b0eb-6df705104c2e",
> "entity_name": "osd.0",
> "os_id": "centos",
> "os_name": "CentOS Stream",
> "os_version": "8",
> "os_version_id": "8",
> "process_name": "ceph-osd",
> "stack_sig":
> "0ffad2c4bc07caf68ff1e124d3911823bc6fa6f5772444754b7f0a998774c8fe",
> "timestamp": "2023-04-26T12:25:12.286947Z",
> "utsname_hostname": "node1-link-1",
> "utsname_machine": "x86_64",
> "utsname_release": "5.4.0-100-generic",
> "utsname_sysname": "Linux",
> "utsname_version": "#113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022"
> }
> 
> ```
> 
> I really don't know what is this error for, Will appreciate any help.
> 
> Cordially,
> 
> --
> Nguetchouang Ngongang Kevin
> ENS de Lyon
> https://perso.ens-lyon.fr/kevin.nguetchouang/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Radosgw multisite replication issues

2023-04-27 Thread Casey Bodley
On Thu, Apr 27, 2023 at 11:36 AM Tarrago, Eli (RIS-BCT)
 wrote:
>
> After working on this issue for a bit.
> The active plan is to fail over master, to the “west” dc. Perform a realm 
> pull from the west so that it forces the failover to occur. Then have the 
> “east” DC, then pull the realm data back. Hopefully will get both sides back 
> in sync..
>
> My concern with this approach is both sides are “active”, meaning the client 
> has been writing data to both endpoints. Will this cause an issue where 
> “west” will have data that the metadata does not have record of, and then 
> delete the data?

no object data would be deleted as a result of metadata failover issues, no

>
> Thanks
>
> From: Tarrago, Eli (RIS-BCT) 
> Date: Thursday, April 20, 2023 at 3:13 PM
> To: Ceph Users 
> Subject: Radosgw multisite replication issues
> Good Afternoon,
>
> I am experiencing an issue where east-1 is no longer able to replicate from 
> west-1, however, after a realm pull, west-1 is now able to replicate from 
> east-1.
>
> In other words:
> West <- Can Replicate <- East
> West -> Cannot Replicate -> East
>
> After confirming the access and secret keys are identical on both sides, I 
> restarted all radosgw services.
>
> Here is the current status of the cluster below.
>
> Thank you for your help,
>
> Eli Tarrago
>
>
> root@east01:~# radosgw-admin zone get
> {
> "id": "ddd66ab8-0417-46ee-a53b-043352a63f93",
> "name": "rgw-east",
> "domain_root": "rgw-east.rgw.meta:root",
> "control_pool": "rgw-east.rgw.control",
> "gc_pool": "rgw-east.rgw.log:gc",
> "lc_pool": "rgw-east.rgw.log:lc",
> "log_pool": "rgw-east.rgw.log",
> "intent_log_pool": "rgw-east.rgw.log:intent",
> "usage_log_pool": "rgw-east.rgw.log:usage",
> "roles_pool": "rgw-east.rgw.meta:roles",
> "reshard_pool": "rgw-east.rgw.log:reshard",
> "user_keys_pool": "rgw-east.rgw.meta:users.keys",
> "user_email_pool": "rgw-east.rgw.meta:users.email",
> "user_swift_pool": "rgw-east.rgw.meta:users.swift",
> "user_uid_pool": "rgw-east.rgw.meta:users.uid",
> "otp_pool": "rgw-east.rgw.otp",
> "system_key": {
> "access_key": "PW",
> "secret_key": "H6"
> },
> "placement_pools": [
> {
> "key": "default-placement",
> "val": {
> "index_pool": "rgw-east.rgw.buckets.index",
> "storage_classes": {
> "STANDARD": {
> "data_pool": "rgw-east.rgw.buckets.data"
> }
> },
> "data_extra_pool": "rgw-east.rgw.buckets.non-ec",
> "index_type": 0
> }
> }
> ],
> "realm_id": "98e0e391-16fb-48da-80a5-08437fd81789",
> "notif_pool": "rgw-east.rgw.log:notif"
> }
>
> root@west01:~# radosgw-admin zone get
> {
>"id": "b2a4a31c-1505-4fdc-b2e0-ea07d9463da1",
> "name": "rgw-west",
> "domain_root": "rgw-west.rgw.meta:root",
> "control_pool": "rgw-west.rgw.control",
> "gc_pool": "rgw-west.rgw.log:gc",
> "lc_pool": "rgw-west.rgw.log:lc",
> "log_pool": "rgw-west.rgw.log",
> "intent_log_pool": "rgw-west.rgw.log:intent",
> "usage_log_pool": "rgw-west.rgw.log:usage",
> "roles_pool": "rgw-west.rgw.meta:roles",
> "reshard_pool": "rgw-west.rgw.log:reshard",
> "user_keys_pool": "rgw-west.rgw.meta:users.keys",
> "user_email_pool": "rgw-west.rgw.meta:users.email",
> "user_swift_pool": "rgw-west.rgw.meta:users.swift",
> "user_uid_pool": "rgw-west.rgw.meta:users.uid",
> "otp_pool": "rgw-west.rgw.otp",
> "system_key": {
> "access_key": "PxxW",
> "secret_key": "Hxx6"
> },
> "placement_pools": [
> {
> "key": "default-placement",
> "val": {
> "index_pool": "rgw-west.rgw.buckets.index",
> "storage_classes": {
> "STANDARD": {
> "data_pool": "rgw-west.rgw.buckets.data"
> }
> },
> "data_extra_pool": "rgw-west.rgw.buckets.non-ec",
> "index_type": 0
> }
> }
> ],
> "realm_id": "98e0e391-16fb-48da-80a5-08437fd81789",
> "notif_pool": "rgw-west.rgw.log:notif"
> east01:~# radosgw-admin metadata sync status
> {
> "sync_status": {
> "info": {
> "status": "init",
> "num_shards": 0,
> "period": "",
> "realm_epoch": 0
> },
> "markers": []
> },
> "full_sync": {
> "total": 0,
> "complete": 0
> }
> }
>
> west01:~#  radosgw-admin metadata sync status
> {
> "sync_status": {
> "info": {
> "status": "sync",
> "num_shards": 64,
> "period": "44b6b308-e2d8-4835-8518-c90447e7b55c",
> "realm_epoch": 3
> },
> "markers": [
>  

[ceph-users] Re: Radosgw multisite replication issues

2023-04-27 Thread Tarrago, Eli (RIS-BCT)
After working on this issue for a bit.
The active plan is to fail over master, to the “west” dc. Perform a realm pull 
from the west so that it forces the failover to occur. Then have the “east” DC, 
then pull the realm data back. Hopefully will get both sides back in sync..

My concern with this approach is both sides are “active”, meaning the client 
has been writing data to both endpoints. Will this cause an issue where “west” 
will have data that the metadata does not have record of, and then delete the 
data?

Thanks

From: Tarrago, Eli (RIS-BCT) 
Date: Thursday, April 20, 2023 at 3:13 PM
To: Ceph Users 
Subject: Radosgw multisite replication issues
Good Afternoon,

I am experiencing an issue where east-1 is no longer able to replicate from 
west-1, however, after a realm pull, west-1 is now able to replicate from 
east-1.

In other words:
West <- Can Replicate <- East
West -> Cannot Replicate -> East

After confirming the access and secret keys are identical on both sides, I 
restarted all radosgw services.

Here is the current status of the cluster below.

Thank you for your help,

Eli Tarrago


root@east01:~# radosgw-admin zone get
{
"id": "ddd66ab8-0417-46ee-a53b-043352a63f93",
"name": "rgw-east",
"domain_root": "rgw-east.rgw.meta:root",
"control_pool": "rgw-east.rgw.control",
"gc_pool": "rgw-east.rgw.log:gc",
"lc_pool": "rgw-east.rgw.log:lc",
"log_pool": "rgw-east.rgw.log",
"intent_log_pool": "rgw-east.rgw.log:intent",
"usage_log_pool": "rgw-east.rgw.log:usage",
"roles_pool": "rgw-east.rgw.meta:roles",
"reshard_pool": "rgw-east.rgw.log:reshard",
"user_keys_pool": "rgw-east.rgw.meta:users.keys",
"user_email_pool": "rgw-east.rgw.meta:users.email",
"user_swift_pool": "rgw-east.rgw.meta:users.swift",
"user_uid_pool": "rgw-east.rgw.meta:users.uid",
"otp_pool": "rgw-east.rgw.otp",
"system_key": {
"access_key": "PW",
"secret_key": "H6"
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "rgw-east.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "rgw-east.rgw.buckets.data"
}
},
"data_extra_pool": "rgw-east.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"realm_id": "98e0e391-16fb-48da-80a5-08437fd81789",
"notif_pool": "rgw-east.rgw.log:notif"
}

root@west01:~# radosgw-admin zone get
{
   "id": "b2a4a31c-1505-4fdc-b2e0-ea07d9463da1",
"name": "rgw-west",
"domain_root": "rgw-west.rgw.meta:root",
"control_pool": "rgw-west.rgw.control",
"gc_pool": "rgw-west.rgw.log:gc",
"lc_pool": "rgw-west.rgw.log:lc",
"log_pool": "rgw-west.rgw.log",
"intent_log_pool": "rgw-west.rgw.log:intent",
"usage_log_pool": "rgw-west.rgw.log:usage",
"roles_pool": "rgw-west.rgw.meta:roles",
"reshard_pool": "rgw-west.rgw.log:reshard",
"user_keys_pool": "rgw-west.rgw.meta:users.keys",
"user_email_pool": "rgw-west.rgw.meta:users.email",
"user_swift_pool": "rgw-west.rgw.meta:users.swift",
"user_uid_pool": "rgw-west.rgw.meta:users.uid",
"otp_pool": "rgw-west.rgw.otp",
"system_key": {
"access_key": "PxxW",
"secret_key": "Hxx6"
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "rgw-west.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "rgw-west.rgw.buckets.data"
}
},
"data_extra_pool": "rgw-west.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"realm_id": "98e0e391-16fb-48da-80a5-08437fd81789",
"notif_pool": "rgw-west.rgw.log:notif"
east01:~# radosgw-admin metadata sync status
{
"sync_status": {
"info": {
"status": "init",
"num_shards": 0,
"period": "",
"realm_epoch": 0
},
"markers": []
},
"full_sync": {
"total": 0,
"complete": 0
}
}

west01:~#  radosgw-admin metadata sync status
{
"sync_status": {
"info": {
"status": "sync",
"num_shards": 64,
"period": "44b6b308-e2d8-4835-8518-c90447e7b55c",
"realm_epoch": 3
},
"markers": [
{
"key": 0,
"val": {
"state": 1,
"marker": "",
"next_step_marker": "",
"total_entries": 46,
"pos": 0,
"timestamp": "0.00",
"realm_epoch": 3
}
},
 goes on for a long time…
{
"key": 

[ceph-users] Re: Bucket notification

2023-04-27 Thread Szabo, Istvan (Agoda)
Hi,

I think the sasl handshake is the issue:

On gateway:
2023-04-26T15:25:49.341+0700 7f7f04d21700  1 ERROR: failed to create push 
endpoint:  due to: pubsub endpoint configuration error: unknown schema in:
2023-04-26T15:25:49.341+0700 7f7f04d21700  5 req 245249540 0.00365s 
s3:delete_obj WARNING: publishing notification failed, with error: -22
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj completing
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj op status=0
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj http status=204

2023-04-26T15:44:31.978+0700 7f7fc3e9f700 20 notification: 'bulknotif' on 
topic: 'bulk-upload-tool-ceph-notifications' and bucket: 
'connectivity-bulk-upload-file-bucket' (unique topic: 
'bulknotif_bulk-upload-tool-ceph-notifications') apply to event of type: 
's3:ObjectCreated:Put'
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  1 ERROR: failed to create push 
endpoint:  due to: pubsub endpoint configuration error: unknown schema in:
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  5 req 245262095 0.00456s 
s3:put_obj WARNING: publishing notification failed, with error: -22
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  2 req 245262095 0.00456s 
s3:put_obj completing

In kafka we see this:
[2023-04-26 15:45:20,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2023-04-26 15:45:21,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2023-04-26 15:45:22,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Apr 27., at 15:08, Yuval Lifshitz  wrote:


Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi Istvan,
Looks like you are using user/password and SSL on the communication channels 
between RGW and the Kafka broker.
Maybe the issue is around the certificate? could you please increase RGW debug 
logs to 20 and see if there are any kafka related errors there?

Yuval

On Tue, Apr 25, 2023 at 5:48 PM Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:
Hi,

I'm trying to set a kafka endpoint for bucket object create operation 
notifications but the notification is not created in kafka endpoint.
Settings seems to be fine because I can upload to the bucket objects when these 
settings are applied:

NotificationConfiguration>

bulknotif
arn:aws:sns:default::butcen
s3:ObjectCreated:*
s3:ObjectRemoved:*



but it simply not created any message in kafka.

This is my topic creation post request:

https://xxx.local/?
Action=CreateTopic&
Name=butcen&
kafka-ack-level=broker&
use-ssl=true&
push-endpoint=kafka://ceph:pw@xxx.local:9093

Am I missing something or it's definitely kafka issue?

Thank you



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Eugen,

Thanks again for your suggestions! The cluster is balanced, OSDs on this
host and other OSDs in the cluster are almost evenly utilized:

ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE   DATA OMAP META
AVAIL%USE   VAR   PGS  STATUS
...
11hdd  9.38680   1.0  9.4 TiB   1.2 TiB  883 GiB  6.6 MiB  2.7 GiB
 8.2 TiB  12.29  1.16   90  up
12hdd  9.38680 0  0 B   0 B  0 B  0 B  0 B
 0 B  0 00  up -- this one is intentionally out
13hdd  9.38680   1.0  9.4 TiB   1.1 TiB  838 GiB  8.2 MiB  2.7 GiB
 8.3 TiB  11.82  1.12   92  up
14hdd  9.38680   1.0  9.4 TiB   1.1 TiB  838 GiB  7.6 MiB  2.4 GiB
 8.3 TiB  11.82  1.12   86  up
15hdd  9.38680   1.0  9.4 TiB   1.1 TiB  830 GiB  6.2 MiB  2.7 GiB
 8.3 TiB  11.74  1.11   80  up
16hdd  9.38680   1.0  9.4 TiB   1.1 TiB  809 GiB   11 MiB  2.7 GiB
 8.3 TiB  11.52  1.09   89  up
17hdd  9.38680   1.0  9.4 TiB   1.1 TiB  876 GiB  3.2 MiB  2.8 GiB
 8.2 TiB  12.22  1.16   86  up
18hdd  9.38680   1.0  9.4 TiB   1.1 TiB  826 GiB  3.0 MiB  2.4 GiB
 8.3 TiB  11.70  1.11   83  up
19hdd  9.38680   1.0  9.4 TiB   1.2 TiB  916 GiB  5.7 MiB  2.7 GiB
 8.2 TiB  12.64  1.20   99  up

I tried primary-affinity=0 for this OSD, this didn't have a noticeable
effect. The drive utilization is actually lower than the other ones:

04/27/2023 01:51:41 PM
Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz w/s
  wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s   drqm/s  %drqm
d_await dareq-sz  aqu-sz  %util
sda 23.73  1.08 9.89  29.421.7346.45   52.71
   0.79 5.63   9.660.3915.370.00  0.00 0.00   0.00
   0.00 0.000.05   6.59
sdb 16.60  0.72 6.69  28.742.4744.66   39.13
   0.59 4.83  10.980.4715.490.00  0.00 0.00   0.00
   0.00 0.000.02   3.07
sdc 20.33  0.99 9.33  31.461.4450.08   50.48
   0.78 5.27   9.450.5315.760.00  0.00 0.00   0.00
   0.00 0.000.06   2.46
>>> sdd 20.40  1.01 9.65  32.110.1950.89
52.07  0.83 5.84  10.080.8016.400.00  0.00 0.00
  0.000.00 0.000.04   2.34
sde 20.84  0.98 9.12  30.430.7848.34   49.57
   0.75 4.86   8.930.0415.560.00  0.00 0.00   0.00
   0.00 0.000.02   0.79
sdf 21.53  1.03 9.58  30.781.5949.01   48.30
   0.79 5.10   9.541.0616.700.00  0.00 0.00   0.00
   0.00 0.000.02   5.85
sdg 22.41  1.06 9.85  30.540.9348.32   48.60
   0.81 5.58  10.290.1416.990.00  0.00 0.00   0.00
   0.00 0.000.03   1.42
sdh 20.09  0.97 9.20  31.411.8349.55   50.06
   0.77 5.20   9.420.1815.660.00  0.00 0.00   0.00
   0.00 0.000.05   0.02
sdi 24.95  1.1410.42  29.451.2946.81   54.25
   0.88 6.10  10.100.2116.550.00  0.00 0.00   0.00
   0.00 0.000.03   5.21

There's a considerable difference between this and other OSDs in terms of
write speed:

# ceph tell osd.13 bench -f plain
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 11.20604339,
"bytes_per_sec": 95818103.377877429,
"iops": 22.844816059560163
}
# ceph tell osd.14 bench -f plain
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 26.16093169899,
"bytes_per_sec": 41043714.969870269,
"iops": 9.7855842041659997
}

In general, somehow OSD performance isn't great although the drives are
plenty fast and can easily do about 200 MB/s sequential reads and writes,
specifically the one showing high latency is only half as fast as the other
OSDs.

I added `osd_scrub_sleep=0.1` for now in case scrubbing , will observe
whether that changes anything, so far no effect.

/Z

On Thu, 27 Apr 2023 at 15:49, Eugen Block  wrote:

> I don't see anything obvious in the pg output, they are relatively
> small and don't hold many objects. If deep-scrubs would impact
> performance that much you would see that in the iostat output as well.
> Have you watched it for a while, maybe with -xmt options to see the
> %util column as well? Does that OSD show a higher utilization than
> other OSDs? Is the cluster evenly balanced (ceph osd df)? And also try
> the primary-affinity = 0 part, this would set most of the primary PGs
> on that OSD to non-primary and others would take over. If the new
> primary OSDs show increased latencies as well there might be something
> else going on.
>
> Zitat von Zakhar Kirpichenko :
>
> > Thanks, Eugen. I very much appreciate your time and replies.
> >
> > It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and
> block
> > 

[ceph-users] Re: Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
Ok, I was able to do a backflip and revert to the old index files:

# Get stuff
radosgw-admin metadata get bucket.instance:BUCKET_NAME:NEW_BUCKET_ID >
bucket.instance:BUCKET_NAME:NEW_BUCKET_ID.json
radosgw-admin metadata get bucket:BUCKET_NAME > bucket:BUCKET_NAME.json

# create copy for fast rollback
cp bucket.instance:BUCKET_NAME:NEW_BUCKET_ID.json
new.bucket.instance:BUCKET_NAME:NEW_BUCKET_ID.json
cp bucket:BUCKET_NAME.json new.bucket:BUCKET_NAME.json

# edit the new.* files and replace all required fields with the correct
values.

# del stuff
radosgw-admin metadata rm bucket:BUCKET_NAME
radosgw-admin metadata rm bucket.instance:BUCKET_NAME:NEW_BUCKET_ID

# upload stuff
radosgw-admin metadata put bucket:BUCKET_NAME < new.bucket:BUCKET_NAME.json
radosgw-admin metadata put bucket.instance:BUCKET_NAME:OLD_BUCKET_ID <
new.bucket.instance:BUCKET_NAME:OLD_BUCKET_ID.json

# rollback in case it did not work
radosgw-admin metadata rm bucket:BUCKET_NAME
radosgw-admin metadata rm bucket.instance:BUCKET_NAME:OLD_BUCKET_ID
radosgw-admin metadata put bucket:BUCKET_NAME < bucket:BUCKET_NAME.json
radosgw-admin metadata put bucket.instance:BUCKET_NAME:OLD_BUCKET_ID <
bucket.instance:BUCKET_NAME:NEW_BUCKET_ID.json


Am Do., 27. Apr. 2023 um 13:32 Uhr schrieb Boris Behrens :

> To clarify a bit:
> The bucket data is not in the main zonegroup.
> I wanted to start the reshard in the zonegroup where the bucket and the
> data is located, but rgw told me to do it in the primary zonegroup.
>
> So I did it there and the index on the zonegroup where the bucket is
> located is empty.
>
> We only sync metadata between the zonegroups not the actual data
> (basically have working credentials in all replicated zones but the buckets
> only life in one place)
>
> radosgw-admin bucket stats shows me the correct ID/marker and the amount
> of shards in all location.
> radosgw-admin reshard status shows 101 entries with "not-resharding"
> radosgw-admin reshard stale-instances list --yes-i-really-mean-it does NOT
> show the bucket
> radosgw-admin bucket radoslist --bucket BUCKET is empty
> radosgw-admin bucket bi list --bucket BUCKET is empty
> radosgw-admin bucket radoslist --bucket-id BUCKETID list files
>
> Am Do., 27. Apr. 2023 um 13:08 Uhr schrieb Boris Behrens :
>
>> Hi,
>> I just resharded a bucket on an octopus multisite environment from 11 to
>> 101.
>>
>> I did it on the master zone and it went through very fast.
>> But now the index is empty.
>>
>> The files are still there when doing a radosgw-admin bucket radoslist
>> --bucket-id
>> Do I just need to wait or do I need to recover that somehow?
>>
>>
>>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bucket notification

2023-04-27 Thread Yuval Lifshitz
Hi Istvan,
Looks like you are using user/password and SSL on the communication
channels between RGW and the Kafka broker.
Maybe the issue is around the certificate? could you please increase RGW
debug logs to 20 and see if there are any kafka related errors there?

Yuval

On Tue, Apr 25, 2023 at 5:48 PM Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com> wrote:

> Hi,
>
> I'm trying to set a kafka endpoint for bucket object create operation
> notifications but the notification is not created in kafka endpoint.
> Settings seems to be fine because I can upload to the bucket objects when
> these settings are applied:
>
> NotificationConfiguration>
> 
> bulknotif
> arn:aws:sns:default::butcen
> s3:ObjectCreated:*
> s3:ObjectRemoved:*
> 
> 
>
> but it simply not created any message in kafka.
>
> This is my topic creation post request:
>
> https://xxx.local/?
> Action=CreateTopic&
> Name=butcen&
> kafka-ack-level=broker&
> use-ssl=true&
> push-endpoint=kafka://ceph:pw@xxx.local:9093
>
> Am I missing something or it's definitely kafka issue?
>
> Thank you
>
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lua scripting in the rados gateway

2023-04-27 Thread Yuval Lifshitz
Hi Thomas,
Thanks for the detailed info!
RGW lua scripting was never tested in a cephadm deployment :-(
Opened a tracker: https://tracker.ceph.com/issues/59574 to make sure this
would work out of the box.

Yuval


On Tue, Apr 25, 2023 at 10:25 PM Thomas Bennett  wrote:

> Hi ceph users,
>
> I've been trying out the lua scripting for the rados gateway (thanks
> Yuval).
>
> As in my previous email I mentioned that there is an error when trying to
> load the luasocket module. However, I thought it was a good time to report
> on my progress.
>
> My 'hello world' example below is called *test.lua* below includes the
> following checks:
>
>1. Can I write to the debug log?
>2. Can I use the lua socket package to do something stupid but
>intersting, like connect to a webservice?
>
> Before you continue reading this, you might need to know that I run all
> ceph processes in a *CentOS Stream release 8 *container deployed using ceph
> orchestrator running *Ceph v17.2.5*, so please view the information below
> in that context.
>
> For anyone looking for a reference, I suggest going to the ceph lua rados
> gateway documentation at radosgw/lua-scripting
> .
>
> There are two new switches you need to know about in the radosgw-admin:
>
>- *script* -> loading your lua script
>- *script-package* -> loading supporting packages for your script - e.i.
>luasocket in this case.
>
> For a basic setup, you'll need to have a few dependencies in your
> containers:
>
>- cephadm container: requires luarocks (I've checked the code - it runs
>a luarocks search command)
>- radosgw container: requires luarocks, gcc, make,  m4, wget (wget just
>in case).
>
> To achieve the above, I updated the container image for our running system.
> I needed to do this because I needed to redeploy the rados gateway
> container to inject the lua script packages into the radosgw runtime
> process. This will start with a fresh container based on the global config
> *container_image* setting on your running system.
>
> For us this is currently captured in *quay.io/tsolo/ceph:v17.2.5-3
> * and included the following exta
> steps (including installing the lua dev from an rpm because there is no
> centos package in yum):
> yum install luarocks gcc make wget m4
> rpm -i
>
> https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/lua-devel-5.3.4-12.el8.x86_64.rpm
>
> You will notice that I've included a compiler and compiler support into the
> image. This is because luarocks on the radosgw to compile luasocket (the
> package I want to install). This will happen at start time when the radosgw
> is restarted from ceph orch.
>
> In the cephadm container I still need to update our cephadm shell so I need
> to install luarocks by hand:
> yum install luarocks
>
> Then set thew updated image to use:
> ceph config set global container_image quay.io/tsolo/ceph:v17.2.5-3
>
> I now create a file called: *test.lua* in the cephadm container. This
> contains the following lines to write to the log and then do a get request
> to google. This is not practical in production, but it serves the purpose
> of testing the infrastructure:
>
> RGWDebugLog("Tsolo start lua script")
> local LuaSocket = require("socket")
> client = LuaSocket.connect("google.com", 80)
> client:send("GET / HTTP/1.0\r\nHost: google.com\r\n\r\n")
> while true do
>   s, status, partial = client:receive('*a')
>   RGWDebugLog(s or partial)
>   if status == "closed" then
> break
>   end
> end
> client:close()
> RGWDebugLog("Tsolo stop lua")
>
> Next I run:
> radosgw-admin script-package add --package=luasocket --allow-compilation
>
> And then list the added package to make sure it is there:
> radosgw-admin script-package list
>
> Note - at this point the radosgw has not been modified, it must first be
> restarted.
>
> Then I put the *test.lua *script into the pre request context:
> radosgw-admin script put --infile=test.lua --context=preRequest
>
> You also need to raise the debug log level on the running rados gateway:
> ceph daemon
> /var/run/ceph/ceph-client.rgw.xxx.xxx-cms1.x.x.xx.asok
> config set debug_rgw 20
>
> Inside the radosgw container I apply my fix (as per previous email):
> cp -ru /tmp/luarocks/client.rgw.xx.xxx--.pcoulb/lib64/*
> /tmp/luarocks/client.rgw.xx.xxx--.pcoulb/lib/
>
> Outside on the host running the radosgw-admin container I follow the
> journalctl for the radosgw container (to get the logs):
> journalctl -fu ceph-----@rgw.
> xxx.xxx-cms1.x.x.xx.service
>
> Then I run an s3cmd to put data in via the rados gateway and check the
> journalctl logs and see:
> Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: Tsolo start lua
> Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: HTTP/1.0 301 Moved
> Permanently
> Apr 25 20:54:47 brp-ceph-cms1 

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block
I don't see anything obvious in the pg output, they are relatively  
small and don't hold many objects. If deep-scrubs would impact  
performance that much you would see that in the iostat output as well.  
Have you watched it for a while, maybe with -xmt options to see the  
%util column as well? Does that OSD show a higher utilization than  
other OSDs? Is the cluster evenly balanced (ceph osd df)? And also try  
the primary-affinity = 0 part, this would set most of the primary PGs  
on that OSD to non-primary and others would take over. If the new  
primary OSDs show increased latencies as well there might be something  
else going on.


Zitat von Zakhar Kirpichenko :


Thanks, Eugen. I very much appreciate your time and replies.

It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block
storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous
Ceph version. The general host utilization is minimal:

---
  totalusedfree  shared  buff/cache
available
Mem:  394859228   162089492 22783924468   230491344
230135560
Swap:   8388604  410624 7977980
---

The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores
and 96 threads total. The load averages are < 1.5 most of the time. Iostat
doesn't show anything dodgy:

---
Device tpskB_read/skB_wrtn/skB_dscd/skB_read
 kB_wrtnkB_dscd
dm-0 40.6114.05  1426.88   274.81  899585953
91343089664 17592304524
dm-1 95.80  1085.29   804.00 0.00 69476048192
51469036032  0
dm-10   261.82 1.64  1046.98 0.00  104964624
67023666460  0
dm-1187.62  1036.39   801.12 0.00 66345393128
51284224352  0
dm-12   265.95 1.65  1063.50 0.00  105717636
68081300084  0
dm-1390.39  1064.38   820.32 0.00 68137422692
52513309008  0
dm-14   260.81 1.65  1042.94 0.00  105460360
66764843944  0
dm-1588.73   976.58   778.68 0.00 62516667260
49847871016  0
dm-16   266.54 1.62  1065.84 0.00  103731332
68230531868  0
dm-17   100.70  1148.40   892.47 0.00 73516251072
57132462352  0
dm-18   279.91 1.77  1119.29 0.00  113498508
71652321256  0
dm-1946.05   158.57   283.19 0.00 10150971936
18128700644  0
dm-2277.15 1.75  1108.26 0.00  112204480
70946082624  0
dm-2049.98   161.48   248.13 0.00 10337605436
15884020104  0
dm-3 69.60   722.59   596.21 0.00 46257546968
38166860968  0
dm-4210.51 1.02   841.90 0.00   65369612
53894908104  0
dm-5 89.88  1000.15   789.46 0.00 64025323664
50537848140  0
dm-6273.40 1.65  1093.31 0.00  105643468
69989257428  0
dm-7 87.50  1019.36   847.10 0.00 65255481416
54228140196  0
dm-8254.77 1.70  1018.76 0.00  109124164
65217134588  0
dm-9 88.66   989.21   766.84 0.00 63325285524
49089975468  0
loop0 0.01 1.54 0.00 0.00   98623259
   0  0
loop1 0.01 1.62 0.00 0.00  103719536
   0  0
loop100.01 1.04 0.00 0.00   66341543
   0  0
loop110.00 0.00 0.00 0.00 36
   0  0
loop2 0.01 1.61 0.00 0.00  102824919
   0  0
loop3 0.01 1.57 0.00 0.00  100808077
   0  0
loop4 0.01 1.56 0.00 0.00  100081689
   0  0
loop5 0.01 1.53 0.00 0.00   97741555
   0  0
loop6 0.01 1.47 0.00 0.00   93867958
   0  0
loop7 0.01 1.16 0.00 0.00   74491285
   0  0
loop8 0.01 1.05 0.00 0.00   67308404
   0  0
loop9 0.01 0.72 0.00 0.00   45939669
   0  0
md0  44.3033.75  1413.88   397.42 2160234553
90511235396 25441160328
nvme0n1 518.1224.41  5339.3573.24 1562435128
341803564504 4688433152
nvme1n1 391.0322.11  4063.5568.36 1415308200

[ceph-users] Memory leak in MGR after upgrading to pacific.

2023-04-27 Thread Gary Molenkamp

Good morning,

After upgrading from Octopus (15.2.17) to Pacific (16.2.12) two days 
ago, I'm noticing that the MGR daemons keep failing over to standby and 
then back every 24hrs.   Watching the output of 'ceph orch ps' I can see 
that the memory consumption of the mgr is steadily growing until it 
becomes unresponsive.


When the mgr becomes unresponsive, tasks such as RESTful calls start to 
fail, and the standby eventually takes over after ~20 minutes. I've 
included a log of memory consumption (in 10 minute intervals) at the end 
of this message. While the cluster recovers during this issue, the loss 
of usage data during the outage, and the fact its occurring is 
problematic.  Any assistance would be appreciated.


Note, this is a cluster that has been upgraded from an original jewel 
based ceph using filestore, through bluestore conversion, container 
conversion, and now to Pacific.    The data below shows memory use with 
three mgr modules enabled:  cephadm, restful, iostat.   By disabling 
iostat, I can reduce the rate of memory consumption increasing to about 
200MB/hr.


Thanks
Gary.


Wed Apr 26 12:50:08 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(3h)  7m ago   3h    1099M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(16h) 7m ago  16h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:00:09 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(3h)  6m ago   3h    1161M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(16h) 6m ago  16h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:10:10 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(3h)  6m ago   3h    1230M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(16h) 6m ago  16h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:20:11 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  6m ago   4h    1250M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(16h) 6m ago  16h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:30:12 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  5m ago   4h    1318M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(16h) 5m ago  16h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:40:13 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  5m ago   4h    1379M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 5m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 13:50:13 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  5m ago   4h    1442M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 5m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:00:14 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  4m ago   4h    1498M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 4m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:10:15 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(4h)  4m ago   4h    1554M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 4m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:20:16 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(5h)  4m ago   5h    1617M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 3m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:30:17 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(5h)  3m ago   5h    1677M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(17h) 3m ago  17h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:40:18 EDT 2023
mgr.controller04.chhvsy  controller04.mydomain  running 
(5h)  3m ago   5h    1735M    -  16.2.12  921c4e969fff e898a9606028
mgr.storage02.jnvnrm storage02.mydomain *:8003  running 
(18h) 3m ago  18h 383M    -  16.2.12  921c4e969fff 6fed006895e8

Wed Apr 26 14:50:19 EDT 2023

[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Anthony D'Atri

> Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
> storage system I have seen have this problem, and that's because they were 
> never setup to have enough IOPS to support the maintenance load, never mind 
> the maintenance load plus the user load (and as a rule not even the user 
> load).

Yep, this is one of the false economies of spinners.  The SNIA TCO calculator 
includes a performance factor for just this reason.

> There is a simple reason why this happens: when a large Ceph (etc. storage 
> instance is initially setup, it is nearly empty, so it appears to perform 
> well even if it was setup with inexpensive but slow/large HDDs, then it 
> becomes fuller and therefore heavily congested

Data fragments over time with organic growth, and the drive spends a larger 
fraction of time seeking.  I’ve predicted then seen this even on a cluster 
whose hardware had been blessed by a certain professional services company 
(*ahem*).

> but whoever set it up has already changed jobs or been promoted because of 
> their initial success (or they invent excuses).

`xfs.mkfs -n size=65536` will haunt my nightmares until the end of my days.  As 
well as an inadequate LFF HDD architecture I was not permitted to fix, 
*including the mons*.  But I digress.

> A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
> enough to support concurrent maintenance (scrubbing, backfilling, 
> rebalancing, backup) and user workloads. That is *expensive*, so in my 
> experience very few storage instance buyers aim for that.

^^^ This.  Moreover, it’s all too common to try to band-aid this with 
expensive, fussy RoC HBAs with cache RAM and BBU/supercap.  The money spent on 
those, and spent on jumping through their hoops, can easily debulk the HDD-SSD 
CapEx gap.  Plus if your solution doesn’t do the job it needs to do, it is no 
bargain at any price.

This correlates with IOPS/$, a metric in which HDDs are abysmal.

> The CERN IT people discovered long ago that quotes for storage workers always 
> used very slow/large HDDs that performed very poorly if the specs were given 
> as mere capacity, so they switched to requiring a different metric, 18MB/s 
> transfer rate of *interleaved* read and write per TB of capacity, that is at 
> least two parallel access streams per TB.

At least one major SSD manufacturer attends specifically to reads under write 
pressure.

> https://www.sabi.co.uk/blog/13-two.html?131227#131227
> "The issue with disk drives with multi-TB capacities"
> 
> BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB 
> is high enough to support simultaneous maintenance and user loads for most 
> Ceph instances, especially in HPC.
> 
> I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, 
> and the best that can be said about those HDDs is that they should be 
> considered "tapes" with some random access ability.

Yes!  This harks back to DECtape https://www.vt100.net/timeline/1964.html which 
was literally this, people even used it at a filesystem.  Some years ago I had 
Brian Kernighan sign one “Wow I haven’t seen one of these in YEARS!”

— aad

> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
To clarify a bit:
The bucket data is not in the main zonegroup.
I wanted to start the reshard in the zonegroup where the bucket and the
data is located, but rgw told me to do it in the primary zonegroup.

So I did it there and the index on the zonegroup where the bucket is
located is empty.

We only sync metadata between the zonegroups not the actual data (basically
have working credentials in all replicated zones but the buckets only life
in one place)

radosgw-admin bucket stats shows me the correct ID/marker and the amount of
shards in all location.
radosgw-admin reshard status shows 101 entries with "not-resharding"
radosgw-admin reshard stale-instances list --yes-i-really-mean-it does NOT
show the bucket
radosgw-admin bucket radoslist --bucket BUCKET is empty
radosgw-admin bucket bi list --bucket BUCKET is empty
radosgw-admin bucket radoslist --bucket-id BUCKETID list files

Am Do., 27. Apr. 2023 um 13:08 Uhr schrieb Boris Behrens :

> Hi,
> I just resharded a bucket on an octopus multisite environment from 11 to
> 101.
>
> I did it on the master zone and it went through very fast.
> But now the index is empty.
>
> The files are still there when doing a radosgw-admin bucket radoslist
> --bucket-id
> Do I just need to wait or do I need to recover that somehow?
>
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
Hi,
I just resharded a bucket on an octopus multisite environment from 11 to
101.

I did it on the master zone and it went through very fast.
But now the index is empty.

The files are still there when doing a radosgw-admin bucket radoslist
--bucket-id
Do I just need to wait or do I need to recover that somehow?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS recovery

2023-04-27 Thread Kotresh Hiremath Ravishankar
Hi,

First of all I would suggest upgrading your cluster on one of the supported
releases.

I think full recovery is recommended to get back the mds.

1. Stop the mdses and all the clients.

2. Fail the fs.

a. ceph fs fail 
3. Backup the journal: (If the below command fails, make rados level copy
using http://tracker.ceph.com/issues/9902). Since the mds is corrupted, we
can skip this too ?

# cephfs-journal-tool journal export backup.bin

4. Cleanup up ancillary data generated during if any previous recovery.

# cephfs-data-scan cleanup []

5. Recover_dentries, reset session, and reset_journal:

# cephfs-journal-tool --rank :0 event recover_dentries list

# cephfs-table-tool :all reset session

# cephfs-journal-tool --rank :0 journal reset

6. Execute scan_extents on each of the x4 tools pods in parallel:

# cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 --filesystem
 

 7. Execute scan_inodes on each of the x4 tools pods in parallel:

# cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 --filesystem
 

# cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 --filesystem
 

 8. scan_links:

# cephfs-data-scan scan_links --filesystem 

9. Mark the filesystem joinable from pod/rook-ceph-tools:

# ceph fs set  joinable true

10. Startup MDSs

11. Scrub online fs

   # ceph tell mds.- scrub start / recursive
repair

12. Check scrub status:

   # ceph tell mds.-{pick-active-mds| a or b} scrub status

For more information please look into
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/

Thanks,

Kotresh H R

On Wed, Apr 26, 2023 at 3:08 AM  wrote:

> Hi All,
>
> We have a CephFS cluster running Octopus with three control nodes each
> running an MDS, Monitor, and Manager on Ubuntu 20.04. The OS drive on one
> of these nodes failed recently and we had to do a fresh install, but made
> the mistake of installing Ubuntu 22.04 where Octopus is not available. We
> tried to force apt to use the Ubuntu 20.04 repo when installing Ceph so
> that it would install Octopus, but for some reason Quincy was still
> installed. We re-integrated this node and it seemed to work fine for about
> a week until our cluster reported damage to an MDS daemon and placed our
> filesystem into a degraded state.
>
> cluster:
> id: 692905c0-f271-4cd8-9e43-1c32ef8abd13
> health: HEALTH_ERR
> mons are allowing insecure global_id reclaim
> 1 filesystem is degraded
> 1 filesystem is offline
> 1 mds daemon damaged
> noout flag(s) set
> 161 scrub errors
> Possible data damage: 24 pgs inconsistent
> 8 pgs not deep-scrubbed in time
> 4 pgs not scrubbed in time
> 6 daemons have recently crashed
>
>   services:
> mon: 3 daemons, quorum database-0,file-server,webhost (age 12d)
> mgr: database-0(active, since 4w), standbys: webhost, file-server
> mds: cephfs:0/1 3 up:standby, 1 damaged
> osd: 91 osds: 90 up (since 32h), 90 in (since 5M)
>  flags noout
>
>   task status:
>
>   data:
> pools:   7 pools, 633 pgs
> objects: 169.18M objects, 640 TiB
> usage:   883 TiB used, 251 TiB / 1.1 PiB avail
> pgs: 605 active+clean
>  23  active+clean+inconsistent
>  4   active+clean+scrubbing+deep
>  1   active+clean+scrubbing+deep+inconsistent
>
> We are not sure if the Quincy/Octopus version mismatch is the problem, but
> we are in the process of downgrading this node now to ensure all nodes are
> running Octopus. Before doing that, we ran the following commands to try
> and recover:
>
> $ cephfs-journal-tool --rank=cephfs:all journal export backup.bin
>
> $ sudo cephfs-journal-tool --rank=cephfs:all event recover_dentries
> summary:
>
> Events by type:
>   OPEN: 29589
>   PURGED: 1
>   SESSION: 16
>   SESSIONS: 4
>   SUBTREEMAP: 127
>   UPDATE: 70438
> Errors: 0
>
> $ cephfs-journal-tool --rank=cephfs:0 journal reset:
>
> old journal was 170234219175~232148677
> new journal start will be 170469097472 (2729620 bytes past old end)
> writing journal head
> writing EResetJournal entry
> done
>
> $ cephfs-table-tool all reset session
>
> All of our MDS daemons are down and fail to restart with the following
> errors:
>
> -3> 2023-04-20T10:25:15.072-0700 7f0465069700 -1 log_channel(cluster) log
> [ERR] : journal replay alloc 0x153af79 not in free
> 

[ceph-users] Re: How to find the bucket name from Radosgw log?

2023-04-27 Thread Boris Behrens
Cheers Dan,

would it be an option to enable the ops log? I still didn't figure out how
it is actually working.
But I am also thinking to move to the logparsing in HAproxy and disable the
access log on the RGW instances.

Am Mi., 26. Apr. 2023 um 18:21 Uhr schrieb Dan van der Ster <
dan.vanders...@clyso.com>:

> Hi,
>
> Your cluster probably has dns-style buckets enabled.
> ..
> In that case the path does not include the bucket name, and neither
> does the rgw log.
> Do you have a frontend lb like haproxy? You'll find the bucket names there.
>
> -- Dan
>
> __
> Clyso GmbH | https://www.clyso.com
>
>
> On Tue, Apr 25, 2023 at 2:34 PM  wrote:
> >
> > I find a log like this, and I thought the bucket name should be "photos":
> >
> > [2023-04-19 15:48:47.0.5541s] "GET /photos/shares/
> >
> > But I can not find it:
> >
> > radosgw-admin bucket stats --bucket photos
> > failure: 2023-04-19 15:48:53.969 7f69dce49a80  0 could not get bucket
> info for bucket=photos
> > (2002) Unknown error 2002
> >
> > How does this happen? Thanks
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Disks are filling up

2023-04-27 Thread Omar Siam

No this is a cephadm setup. Not rook.

In the last days it is still deep scrubbing and filling up. We have to 
do something about it as it now impacts our K8s cluster (very slow 
cephfs access) and we are running out of (allocated) diskspace again.


Some more details now that I had a few more days to think about our 
particular setup:
* This is a setup with ESXi/vSphere virtualization. The ceph nodes are 
just some VMs. We don't have access to the bare servers or even direct 
access to the HDDs/SSDs ceph runs on.
* The setup is "asymmetric": there are 2 nodes on SSDs and one on HDDs 
(they are all RAIDx with hardware controllers, but we have no say in 
this). I labeled all OSDs as HDDs (even when VMWare reported SSD).
* We looked at the OSDs device usage and it is 100% (from the VMs point 
of view) for the HDDs (20% on everage for the SSD nodes).


My suspision is:
* deep scrubbing means every new write goes to unallocated space, no 
more overwrite/deleting while deep scrubbing. I didn't find it in the 
docs. Maybe I missed it, maybe that is common wisdom among the initiated.
* We write more new data per second to cephfs than can be scrubbed so 
scrubbing never ends and the PGs fill up.


We now ordered SSDs for the HDD only node to prevent this in the future.
Meanwhile we need to do something so we think about moving the data in 
cephfs to a new PG that does not need deep scrubbing at the moment.
Also we think about moving the OSD from the physical host that only has 
HDDs to one with SSDs ruining redundancy for a short while and hoping 
for the best


Am 26.04.2023 um 02:28 schrieb A Asraoui:


Omar, glad to see cephfs with kubernetes up and running.. did you guys 
use rook to deploy this ??


Abdelillah
On Mon, Apr 24, 2023 at 6:56 AM Omar Siam  wrote:

Hi list,

we created a cluster for using cephfs with a kubernetes cluster.
Since a
few weeks now the cluster keeps filling up at an alarming rate
(100 GB per day).
This is while the most relevant pg is deep scrubbing and was
interupted
a few times.

We use about 150G (du using the mounted filesystem) on the cephfs
filesystem and try not to use snapshots (.snap directories "exist"
but
are empty).
We do not understand why the pgs get bigger and bigger while cephfs
stays about the same size (overwrites on files certainly happen).
I suspect some snapshots mechanism. Any ideas how to debug this to
stop it?

Maybe we should try to speed up the deep scrubbing somehow?


Best regards

--
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for 
disabled persons
Bäckerstraße 13, 1010 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.s...@oeaw.ac.at  |www.oeaw.ac.at/acdh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Veeam backups to radosgw seem to be very slow

2023-04-27 Thread Boris Behrens
Thanks Janne, I will hand that to the customer.

> Look at https://community.veeam.com/blogs-and-podcasts-57/sobr-veeam
> -capacity-tier-calculations-and-considerations-in-v11-2548
> for "extra large blocks" to make them 8M at least.
> We had one Veeam installation vomit millions of files onto our rgw-S3
> at an average size of 180k per object, and at those sizes, you will
> see very poor throughput and the many objs/MB will hurt all other
> kinds of performance like listing the bucket and so on.
>

@joachim
What do you mean with "default region"?
I just checked the period and it aligns. I've told them to try to get more
information from it

"bucket does not exist" or "permission denied".
> Had received similar error messages with another client program. The
> default region did not match the region of the cluster.
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi

> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers already divided by replication factor), you need 55 days
> to scrub it once.
> That's 8x larger than the default scrub factor [...] Also, even
> if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the
> cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi



> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers already divided by replication factor), you need 55 days
> to scrub it once.
> That's 8x larger than the default scrub factor [...] Also, even
> if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the
> cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Eugen. I very much appreciate your time and replies.

It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block
storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous
Ceph version. The general host utilization is minimal:

---
  totalusedfree  shared  buff/cache
available
Mem:  394859228   162089492 22783924468   230491344
230135560
Swap:   8388604  410624 7977980
---

The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores
and 96 threads total. The load averages are < 1.5 most of the time. Iostat
doesn't show anything dodgy:

---
Device tpskB_read/skB_wrtn/skB_dscd/skB_read
 kB_wrtnkB_dscd
dm-0 40.6114.05  1426.88   274.81  899585953
91343089664 17592304524
dm-1 95.80  1085.29   804.00 0.00 69476048192
51469036032  0
dm-10   261.82 1.64  1046.98 0.00  104964624
67023666460  0
dm-1187.62  1036.39   801.12 0.00 66345393128
51284224352  0
dm-12   265.95 1.65  1063.50 0.00  105717636
68081300084  0
dm-1390.39  1064.38   820.32 0.00 68137422692
52513309008  0
dm-14   260.81 1.65  1042.94 0.00  105460360
66764843944  0
dm-1588.73   976.58   778.68 0.00 62516667260
49847871016  0
dm-16   266.54 1.62  1065.84 0.00  103731332
68230531868  0
dm-17   100.70  1148.40   892.47 0.00 73516251072
57132462352  0
dm-18   279.91 1.77  1119.29 0.00  113498508
71652321256  0
dm-1946.05   158.57   283.19 0.00 10150971936
18128700644  0
dm-2277.15 1.75  1108.26 0.00  112204480
70946082624  0
dm-2049.98   161.48   248.13 0.00 10337605436
15884020104  0
dm-3 69.60   722.59   596.21 0.00 46257546968
38166860968  0
dm-4210.51 1.02   841.90 0.00   65369612
53894908104  0
dm-5 89.88  1000.15   789.46 0.00 64025323664
50537848140  0
dm-6273.40 1.65  1093.31 0.00  105643468
69989257428  0
dm-7 87.50  1019.36   847.10 0.00 65255481416
54228140196  0
dm-8254.77 1.70  1018.76 0.00  109124164
65217134588  0
dm-9 88.66   989.21   766.84 0.00 63325285524
49089975468  0
loop0 0.01 1.54 0.00 0.00   98623259
   0  0
loop1 0.01 1.62 0.00 0.00  103719536
   0  0
loop100.01 1.04 0.00 0.00   66341543
   0  0
loop110.00 0.00 0.00 0.00 36
   0  0
loop2 0.01 1.61 0.00 0.00  102824919
   0  0
loop3 0.01 1.57 0.00 0.00  100808077
   0  0
loop4 0.01 1.56 0.00 0.00  100081689
   0  0
loop5 0.01 1.53 0.00 0.00   97741555
   0  0
loop6 0.01 1.47 0.00 0.00   93867958
   0  0
loop7 0.01 1.16 0.00 0.00   74491285
   0  0
loop8 0.01 1.05 0.00 0.00   67308404
   0  0
loop9 0.01 0.72 0.00 0.00   45939669
   0  0
md0  44.3033.75  1413.88   397.42 2160234553
90511235396 25441160328
nvme0n1 518.1224.41  5339.3573.24 1562435128
341803564504 4688433152
nvme1n1 391.0322.11  4063.5568.36 1415308200
260132142151 4375871488
nvme2n1  33.99   175.52   288.87   195.30 11236255296
18492074823 12502441984
nvme3n1  36.74   177.43   253.04   195.30 11358616904
16198706451 12502441984
nvme4n1  36.34   130.81  1417.08   275.71 8374240889
90715974981 17649735268
nvme5n1  35.97   101.47  1417.08   274.81 6495703006
90715974997 17592304524
sda  76.43  1102.34   810.08 0.00 70567310268
51858036484  0
sdb  55.74   741.38   606.07 0.00 47460332504
38798003512  0
sdc  70.79  1017.90   795.50 0.00 65161638916

[ceph-users] Re: cephfs - max snapshot limit?

2023-04-27 Thread Tobias Hachmer

Hi sur5r,

Am 4/27/23 um 10:33 schrieb Jakob Haufe:
> On Thu, 27 Apr 2023 09:07:10 +0200
> Tobias Hachmer  wrote:
>
>> But we observed that max 50 snapshot are preserved. If a new snapshot is
>> created the oldest 51st is deleted.
>>
>> Is there a limit for maximum cephfs snapshots or maybe this is a bug?
>
> I've been wondering the same thing for about 6 months now and found the
> reason just yesterday.
>
> The snap-schedule mgr module has a hard limit on how many snapshots it
> preserves, see [1]. It's even documented at [2] in section
> "Limitations" near the end of the page.
>
> The commit[3] implementing this does not only not explain the reason
> for the number at all, it doesn't even mention the fact it implements
> this.

Thanks. I've red the documentation, but it's not clear enough. I thought 
"the retention list will be shortened to the newest 50 snapshots" will 
just truncate the list and not delete the snapshots, effectively.


So as you stated the max. number of snapshots is currently a hard limit.

Can anyone clarify the reasons for this? If there's a big reason to hard 
limit this it would be great to schedule snapshots more granular e.g. 
mo-fr every two hours between 8am-6pm.


> Given the limitation is per directory, I'm currently trying this:
>
> / 1d 30d
> /foo 1h 48h
> /bar 1h 48h
>
> I forgot to activate the new schedules yesterday so I can't say whether
> it works as expected yet.

Please let me know if this works.

Thanks,
Tobias


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Marc


> >
> > > The question you should ask yourself, why you want to
> > change/investigate this?
> >
> > Because if scrubbing takes 10x longer thrashing seeks, my scrubs never
> > finish in time (the default is 1 week).
> > I end with e.g.
> >
> > > 267 pgs not deep-scrubbed in time
> >
> > On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers
> > already divided by replication factor), you need 55 days to scrub it
> > once.
> >
> > That's 8x larger than the default scrub factor, so I'll get warnings
> and
> > my risk of data degradation increases.
> >
> > Also, even if I set the default scrub interval to 8x larger, it my
> disks
> > will still be thrashing seeks 100% of the time, affecting the
> cluster's
> > throughput and latency performance.
> >
> 
> Oh I get it. Interesting. I think if you will expand the cluster in the
> future with more disks you will spread the load have more iops, this
> will disappear.
> I am not sure if you will be able to fix this other than to increase the
> scrub interval. If you are sure it is nothing related to hardware.
> 
> For you reference I have included how my disk io / performance looks
> like when I issue a deep-scrub. You can see it reads 2 disks here at
> ~70MB/s and the atop shows it is at 100% load. Nothing more you can do
> here.
> 
> #ceph osd pool ls detail
> #ceph pg ls | grep '^53'
> #ceph osd tree
> #ceph pg deep-scrub 53.38
> #dstat -D sdd,sde
> 
> 
> [@~]# dstat -d -D sdd,sde,sdj
> --dsk/sdd-dsk/sde-dsk/sdj--
>  read  writ: read  writ: read  writ
> 2493k  177k:5086k  316k:5352k  422k
>   70M0 :  89M0 :   068k
>   78M0 :  59M0 :   0 0
>   68M0 :  68M0 :   028k
>   90M 4096B:  90M   80k:4096B   24k
>   76M0 :  78M0 :   012k
>   66M0 :  64M0 :   012k
>   70M0 :  80M0 :4096B   52k
>   77M0 :  70M0 :   0 0
> 
> atop:
> |
> DSK |  sdd  | busy 97%  | read1462  | write  4  |
> KiB/r469  | KiB/w  5  | MBr/s   67.0  | MBw/s0.0  | avq
> 1.01  | avio 6.59 ms  |
> DSK |  sde  | busy 64%  | read1472  | write  4  |
> KiB/r465  | KiB/w  6  | MBr/s   67.0  | MBw/s0.0  | avq
> 1.01  | avio 4.32 ms  |
> DSK |  sdb  | busy  1%  | read   0  | write 82  |
> KiB/r  0  | KiB/w  9  | MBr/s0.0  | MBw/s0.1  | avq
> 1.30  | avio 1.29 ms  |
> 
> 

I did this on a pool with larger archived objects, when doing this on a 
filesystem with repo copies (rpm files) this performance is already dropping.

DSK |  sdh  | busy 86%  | read1875  | write 26  | KiB/r
254  | KiB/w  4  | MBr/s   46.7  | MBw/s0.0  | avq 1.59  | avio 
4.50 ms
DSK |  sdd  | busy 79%  | read1598  | write 63  | KiB/r
245  | KiB/w 16  | MBr/s   38.4  | MBw/s0.1  | avq 1.89  | avio 
4.77 ms
DSK |  sdf  | busy 33%  | read1383  | write139  | KiB/r
357  | KiB/w  7  | MBr/s   48.3  | MBw/s0.1  | avq 1.14  | avio 
2.20 ms

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs - max snapshot limit?

2023-04-27 Thread Jakob Haufe
On Thu, 27 Apr 2023 09:07:10 +0200
Tobias Hachmer  wrote:

> But we observed that max 50 snapshot are preserved. If a new snapshot is 
> created the oldest 51st is deleted.
> 
> Is there a limit for maximum cephfs snapshots or maybe this is a bug?

I've been wondering the same thing for about 6 months now and found the
reason just yesterday.

The snap-schedule mgr module has a hard limit on how many snapshots it
preserves, see [1]. It's even documented at [2] in section
"Limitations" near the end of the page.

The commit[3] implementing this does not only not explain the reason
for the number at all, it doesn't even mention the fact it implements
this.

Given the limitation is per directory, I'm currently trying this:

/ 1d 30d
/foo 1h 48h
/bar 1h 48h

I forgot to activate the new schedules yesterday so I can't say whether
it works as expected yet.

Cheers,
sur5r

[1] 
https://github.com/ceph/ceph/blob/3d7761bd59b8e5ebac1d9a136d020f0f8d2eaf32/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L21
[2] https://docs.ceph.com/en/quincy/cephfs/snap-schedule/
[3] https://github.com/ceph/ceph/commit/a48efa43dbe4c623ae88b84ef538ee306fc1eee8

-- 
ceterum censeo microsoftem esse delendam.


pgp4DgiD1s8Io.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Marc


> 
> > The question you should ask yourself, why you want to
> change/investigate this?
> 
> Because if scrubbing takes 10x longer thrashing seeks, my scrubs never
> finish in time (the default is 1 week).
> I end with e.g.
> 
> > 267 pgs not deep-scrubbed in time
> 
> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers
> already divided by replication factor), you need 55 days to scrub it
> once.
> 
> That's 8x larger than the default scrub factor, so I'll get warnings and
> my risk of data degradation increases.
> 
> Also, even if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the cluster's
> throughput and latency performance.
> 

Oh I get it. Interesting. I think if you will expand the cluster in the future 
with more disks you will spread the load have more iops, this will disappear.
I am not sure if you will be able to fix this other than to increase the scrub 
interval. If you are sure it is nothing related to hardware.

For you reference I have included how my disk io / performance looks like when 
I issue a deep-scrub. You can see it reads 2 disks here at ~70MB/s and the atop 
shows it is at 100% load. Nothing more you can do here.

#ceph osd pool ls detail
#ceph pg ls | grep '^53'
#ceph osd tree
#ceph pg deep-scrub 53.38
#dstat -D sdd,sde


[@~]# dstat -d -D sdd,sde,sdj
--dsk/sdd-dsk/sde-dsk/sdj--
 read  writ: read  writ: read  writ
2493k  177k:5086k  316k:5352k  422k
  70M0 :  89M0 :   068k
  78M0 :  59M0 :   0 0
  68M0 :  68M0 :   028k
  90M 4096B:  90M   80k:4096B   24k
  76M0 :  78M0 :   012k
  66M0 :  64M0 :   012k
  70M0 :  80M0 :4096B   52k
  77M0 :  70M0 :   0 0

atop:
|
DSK |  sdd  | busy 97%  | read1462  | write  4  | KiB/r
469  | KiB/w  5  | MBr/s   67.0  | MBw/s0.0  | avq 1.01  | avio 
6.59 ms  |
DSK |  sde  | busy 64%  | read1472  | write  4  | KiB/r
465  | KiB/w  6  | MBr/s   67.0  | MBw/s0.0  | avq 1.01  | avio 
4.32 ms  |
DSK |  sdb  | busy  1%  | read   0  | write 82  | KiB/r 
 0  | KiB/w  9  | MBr/s0.0  | MBw/s0.1  | avq 1.30  | avio 1.29 
ms  |


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block
Those numbers look really high to me, more than 2 seconds for a write  
is awful. Is this a HDD-only cluster/pool? But even then it would be  
too high, I just compared with our HDD-backed cluster (although  
rocksDB is SSD-backed) which also mainly serves RBD to openstack. What  
is the general utilization of that host? Is it an upgraded cluster  
which could suffer from the performance degredation which was  
discussed in a recent thread? But I'd expect that more OSDs would be  
affected by that. How many PGs and objects are on that OSD (ceph pg  
ls-by-osd )? Have you tried to restart and/or compact the OSD and  
see if anything improves?
You could set its primary-affinity to 0, or the worst case rebuild  
that OSD. And there are no smart errors or anything in dmesg reported  
about this disk?


Zitat von Zakhar Kirpichenko :


Thanks, Eugen!

It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm not
really sure what to make of them. I checked adjacent OSDs and they have
similar ops, but aren't showing excessive latency.

/Z

On Thu, 27 Apr 2023 at 10:42, Eugen Block  wrote:


Hi,

I would monitor the historic_ops_by_duration for a while and see if
any specific operation takes unusually long.

# this is within the container
[ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration
| head
{
 "size": 20,
 "duration": 600,
 "ops": [
 {
 "description": "osd_repop(client.9384193.0:2056545 12.6
e2233/2221 12:6192870f:::obj_delete_at_hint.53:head v
2233'696390, mlcod=2233'696388)",
 "initiated_at": "2023-04-27T07:37:35.046036+",
 "age": 54.80501619997,
 "duration": 0.5819846869995,
...

The output contains the PG (so you know which pool is involved) and
the duration of the operation, not sure if that helps though.

Zitat von Zakhar Kirpichenko :

> As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
> many, and they're somewhat difficult to interpret:
>
> "description": "osd_op(client.250533532.0:56821 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299120+",
> "description": "osd_op(client.250533532.0:56822 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299132+",
> "description": "osd_op(client.250533532.0:56823 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299138+",
> "description": "osd_op(client.250533532.0:56824 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299148+",
> "description": "osd_op(client.250533532.0:56825 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299188+",
> "description": "osd_op(client.250533532.0:56826 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299198+",
> "description": "osd_op(client.250533532.0:56827 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299207+",
> "description": "osd_op(client.250533532.0:56828 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 398~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299250+",
> "description": "osd_op(client.250533532.0:56829 13.16f
> 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> "initiated_at": "2023-04-26T07:00:58.299270+",
>
> There's a lot more information there ofc. I also tried to
> `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a time,
but
> the OSD latency remains high even when the ops count is low or zero. Any
> ideas?
>
> I would very much appreciate it if some could please point me to the
> documentation on interpreting the output of ops dump.
>
> /Z
>
>
> On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko 
wrote:
>
>> Hi,
>>
>> I have a Ceph 16.2.12 

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Zakhar Kirpichenko
Thanks, Eugen!

It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm not
really sure what to make of them. I checked adjacent OSDs and they have
similar ops, but aren't showing excessive latency.

/Z

On Thu, 27 Apr 2023 at 10:42, Eugen Block  wrote:

> Hi,
>
> I would monitor the historic_ops_by_duration for a while and see if
> any specific operation takes unusually long.
>
> # this is within the container
> [ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration
> | head
> {
>  "size": 20,
>  "duration": 600,
>  "ops": [
>  {
>  "description": "osd_repop(client.9384193.0:2056545 12.6
> e2233/2221 12:6192870f:::obj_delete_at_hint.53:head v
> 2233'696390, mlcod=2233'696388)",
>  "initiated_at": "2023-04-27T07:37:35.046036+",
>  "age": 54.80501619997,
>  "duration": 0.5819846869995,
> ...
>
> The output contains the PG (so you know which pool is involved) and
> the duration of the operation, not sure if that helps though.
>
> Zitat von Zakhar Kirpichenko :
>
> > As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
> > many, and they're somewhat difficult to interpret:
> >
> > "description": "osd_op(client.250533532.0:56821 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299120+",
> > "description": "osd_op(client.250533532.0:56822 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299132+",
> > "description": "osd_op(client.250533532.0:56823 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299138+",
> > "description": "osd_op(client.250533532.0:56824 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299148+",
> > "description": "osd_op(client.250533532.0:56825 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299188+",
> > "description": "osd_op(client.250533532.0:56826 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299198+",
> > "description": "osd_op(client.250533532.0:56827 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299207+",
> > "description": "osd_op(client.250533532.0:56828 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 398~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299250+",
> > "description": "osd_op(client.250533532.0:56829 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> > "initiated_at": "2023-04-26T07:00:58.299270+",
> >
> > There's a lot more information there ofc. I also tried to
> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a time,
> but
> > the OSD latency remains high even when the ops count is low or zero. Any
> > ideas?
> >
> > I would very much appreciate it if some could please point me to the
> > documentation on interpreting the output of ops dump.
> >
> > /Z
> >
> >
> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko 
> wrote:
> >
> >> Hi,
> >>
> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive
> >> make/model, etc. A particular OSD is showing higher latency than usual
> in
> >> `ceph osd perf`, usually mid to high tens of milliseconds while other
> OSDs
> >> show low single digits, although its drive's I/O stats don't look
> different
> >> from those of other drives. The workload is mainly random 4K reads and
> >> writes, the cluster is being used as Openstack VM storage.
> >>
> >> Is there a way to trace, which particular PG, pool and disk image or
> >> object cause this OSD's excessive latency? Is there a way to tell Ceph
> to
> >>
> >> I would appreciate any advice or pointers.
> >>
> >> Best regards,
> >> Zakhar
> >>
> 

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Eugen Block

Hi,

I would monitor the historic_ops_by_duration for a while and see if  
any specific operation takes unusually long.


# this is within the container
[ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration
| head
{
"size": 20,
"duration": 600,
"ops": [
{
"description": "osd_repop(client.9384193.0:2056545 12.6  
e2233/2221 12:6192870f:::obj_delete_at_hint.53:head v  
2233'696390, mlcod=2233'696388)",

"initiated_at": "2023-04-27T07:37:35.046036+",
"age": 54.80501619997,
"duration": 0.5819846869995,
...

The output contains the PG (so you know which pool is involved) and  
the duration of the operation, not sure if that helps though.


Zitat von Zakhar Kirpichenko :


As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
many, and they're somewhat difficult to interpret:

"description": "osd_op(client.250533532.0:56821 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299120+",
"description": "osd_op(client.250533532.0:56822 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299132+",
"description": "osd_op(client.250533532.0:56823 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299138+",
"description": "osd_op(client.250533532.0:56824 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299148+",
"description": "osd_op(client.250533532.0:56825 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299188+",
"description": "osd_op(client.250533532.0:56826 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299198+",
"description": "osd_op(client.250533532.0:56827 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299207+",
"description": "osd_op(client.250533532.0:56828 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
398~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299250+",
"description": "osd_op(client.250533532.0:56829 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
"initiated_at": "2023-04-26T07:00:58.299270+",

There's a lot more information there ofc. I also tried to
`dump_ops_in_flight` and there aren't many, usually 0-10 ops at a time, but
the OSD latency remains high even when the ops count is low or zero. Any
ideas?

I would very much appreciate it if some could please point me to the
documentation on interpreting the output of ops dump.

/Z


On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko  wrote:


Hi,

I have a Ceph 16.2.12 cluster with uniform hardware, same drive
make/model, etc. A particular OSD is showing higher latency than usual in
`ceph osd perf`, usually mid to high tens of milliseconds while other OSDs
show low single digits, although its drive's I/O stats don't look different
from those of other drives. The workload is mainly random 4K reads and
writes, the cluster is being used as Openstack VM storage.

Is there a way to trace, which particular PG, pool and disk image or
object cause this OSD's excessive latency? Is there a way to tell Ceph to

I would appreciate any advice or pointers.

Best regards,
Zakhar


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs - max snapshot limit?

2023-04-27 Thread Tobias Hachmer

Hello,

we are running a 3-node ceph cluster with version 17.2.6.

For CephFS snapshots we have configured the following snap schedule with 
retention:


/PATH 2h 72h15d6m

But we observed that max 50 snapshot are preserved. If a new snapshot is 
created the oldest 51st is deleted.


Is there a limit for maximum cephfs snapshots or maybe this is a bug?

I have found the setting "mds_max_snaps_per_dir" which is 100 by default 
but I think this is not related to my problem?


Thanks,

Tobias


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Janne Johansson
Den ons 26 apr. 2023 kl 21:20 skrev Niklas Hambüchen :
> > 100MB/s is sequential, your scrubbing is random. afaik everything is random.
>
> Is there any docs that explain this, any code, or other definitive answer?
> Also wouldn't it make sense that for scrubbing to be able to read the disk 
> linearly, at least to some significant extent?

Scrubs only read data that does exist in ceph as it exists, not every
sector of the drive, written or not. This is why many small objects
make it look like MB/s is "low", it reads the objects and not just
dumb cylinder reads.

It is not the same as hw raid boxes doing "patrol reads" or what they
call it where they have no idea of what they are reading, just seeing
that the drives don't report errors.

This is more "pretend you are a ceph client with low priority reading
all the data from this PG from start to end".

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io