[ceph-users] Re: Recoveries without any misplaced objects?

2024-04-24 Thread David Orman
t; On Apr 24, 2024, at 15:37, David Orman wrote: >> >> Did you ever figure out what was happening here? >> >> David >> >> On Mon, May 29, 2023, at 07:16, Hector Martin wrote: >>> On 29/05/2023 20.55, Anthony D'Atri wrote: >>>> Check the up

[ceph-users] Re: Recoveries without any misplaced objects?

2024-04-24 Thread David Orman
Did you ever figure out what was happening here? David On Mon, May 29, 2023, at 07:16, Hector Martin wrote: > On 29/05/2023 20.55, Anthony D'Atri wrote: >> Check the uptime for the OSDs in question > > I restarted all my OSDs within the past 10 days or so. Maybe OSD > restarts are somehow

[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread David Orman
I would suggest that you might consider EC vs. replication for index data, and the latency implications. There's more than just the nvme vs. rotational discussion to entertain, especially if using the more widely spread EC modes like 8+3. It would be worth testing for your particular workload.

[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-07 Thread David Orman
That tracker's last update indicates it's slated for inclusion. On Thu, Feb 1, 2024, at 10:47, Zakhar Kirpichenko wrote: > Hi, > > Please consider not leaving this behind: > https://github.com/ceph/ceph/pull/55109 > > It's a serious bug, which potentially affects a whole node stability if >

[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-05 Thread David Orman
Hi, Just looking back through PyO3 issues, it would appear this functionality was never supported: https://github.com/PyO3/pyo3/issues/3451 https://github.com/PyO3/pyo3/issues/576 It just appears attempting to use this functionality (which does not work/exist) wasn't successfully prevented

[ceph-users] Re: RFI: Prometheus, Etc, Services - Optimum Number To Run

2024-01-21 Thread David Orman
The "right" way to do this is to not run your metrics system on the cluster you want to monitor. Use the provided metrics via the exporter and ingest them using your own system (ours is Mimir/Loki/Grafana + related alerting), so if you have failures of nodes/etc you still have access to, at a

[ceph-users] CLT Meeting Minutes 2024-01-03

2024-01-03 Thread David Orman
Happy 2024! Today's CLT meeting covered the following: 1. 2024 brings a focus on performance of Crimson (some information here: https://docs.ceph.com/en/reef/dev/crimson/crimson/ ) 1. Status is available here: https://github.com/ceph/ceph.io/pull/635 2. There will be a new Crimson

[ceph-users] Re: RadosGW public HA traffic - best practices?

2023-11-17 Thread David Orman
will likely happens, so the impact won't be non-zero, but it also won't be catastrophic. David On Fri, Nov 17, 2023, at 10:09, David Orman wrote: > Use BGP/ECMP with something like exabgp on the haproxy servers. > > David > > On Fri, Nov 17, 2023, at 04:09, Boris Behrens wrote: >>

[ceph-users] Re: RadosGW public HA traffic - best practices?

2023-11-17 Thread David Orman
Use BGP/ECMP with something like exabgp on the haproxy servers. David On Fri, Nov 17, 2023, at 04:09, Boris Behrens wrote: > Hi, > I am looking for some experience on how people make their RGW public. > > Currently we use the follow: > 3 IP addresses that get distributed via keepalived between

[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-08 Thread David Orman
I would suggest updating: https://tracker.ceph.com/issues/59580 We did notice it with 16.2.13, as well, after upgrading from .10, so likely in-between those two releases. David On Fri, Sep 8, 2023, at 04:00, Loïc Tortay wrote: > On 07/09/2023 21:33, Mark Nelson wrote: >> Hi Rok, >> >> We're

[ceph-users] Re: MGR Memory Leak in Restful

2023-09-08 Thread David Orman
Hi, I do not believe this is actively being worked on, but there is a tracker open, if you can submit an update it may help attract attention/develop a fix: https://tracker.ceph.com/issues/59580 David On Fri, Sep 8, 2023, at 03:29, Chris Palmer wrote: > I first posted this on 17 April but did

[ceph-users] Re: OSDs spam log with scrub starts

2023-08-31 Thread David Orman
https://github.com/ceph/ceph/pull/48070 may be relevant. I think this may have gone out in 16.2.11. I would tend to agree, personally this feels quite noisy at default logging levels for production clusters. David On Thu, Aug 31, 2023, at 11:17, Zakhar Kirpichenko wrote: > This is happening to

[ceph-users] Re: Another Pacific point release?

2023-07-17 Thread David Orman
I'm hoping to see at least one more, if not more than that, but I have no crystal ball. I definitely support this idea, and strongly suggest it's given some thought. There have been a lot of delays/missed releases due to all of the lab issues, and it's significantly impacted the release cadence

[ceph-users] Re: Slow recovery on Quincy

2023-05-22 Thread David Orman
Someone who's got data regarding this should file a bug report, it sounds like a quick fix for defaults if this holds true. On Sat, May 20, 2023, at 00:59, Hector Martin wrote: > On 17/05/2023 03.07, 胡 玮文 wrote: >> Hi Sake, >> >> We are experiencing the same. I set

[ceph-users] Re: ceph pg stuck - missing on 1 osd how to proceed

2023-04-18 Thread David Orman
You may want to consider disabling deep scrubs and scrubs while attempting to complete a backfill operation. On Tue, Apr 18, 2023, at 01:46, Eugen Block wrote: > I didn't mean you should split your PGs now, that won't help because > there is already backfilling going on. I would revert the

[ceph-users] Re: Issue upgrading 17.2.0 to 17.2.5

2023-03-06 Thread David Orman
I've seen what appears to be the same post on Reddit, previously, and attempted to assist. My suspicion is a "stop" command was passed to ceph orch upgrade in an attempt to stop it, but with the --image flag preceding it, setting the image to stop. I asked the user to do an actual upgrade stop,

[ceph-users] Re: Undo "radosgw-admin bi purge"

2023-02-22 Thread David Orman
If it's a test cluster, you could try: root@ceph01:/# radosgw-admin bucket check -h |grep -A1 check-objects --check-objects bucket check: rebuilds bucket index according to actual objects state On Wed, Feb 22, 2023, at 02:22, Robert Sander wrote: > On

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-31 Thread David Orman
06  /dev/sdm  hdd   TOSHIBA_X_X 16.0T 21m ago *locked* >>> >>> >>> It shows locked and is not automatically added now, which is good i >>> think? otherwise it would probably be a new osd 307. >>> >>> >>> root@ceph-a2-01:/# ceph orch osd rm

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-30 Thread David Orman
    pgs: 3236 active+clean > > > This is the new disk shown as locked (because unzapped at the moment). > > # ceph orch device ls > > ceph-a1-06  /dev/sdm  hdd   TOSHIBA_X_X 16.0T 9m ago > locked > > > Best > > Ken > > > On

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-29 Thread David Orman
What does "ceph orch osd rm status" show before you try the zap? Is your cluster still backfilling to the other OSDs for the PGs that were on the failed disk? David On Fri, Jan 27, 2023, at 03:25, mailing-lists wrote: > Dear Ceph-Users, > > i am struggling to replace a disk. My ceph-cluster is

[ceph-users] Re: Current min_alloc_size of OSD?

2023-01-13 Thread David Orman
I think this would be valuable to have easily accessible during runtime, perhaps submit a report (and patch if possible)? David On Fri, Jan 13, 2023, at 08:14, Robert Sander wrote: > Hi, > > Am 13.01.23 um 14:35 schrieb Konstantin Shalygin: > > > ceph-kvstore-tool bluestore-kv

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
there must be numerous Ceph sites with hundreds of OSD nodes, > so I'm a bit surprised this isn't more automated... > > Cheers, > > Erik > > -- > Erik Lindahl > On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri , wrote: > > > > > > > On Jan 9, 2023, a

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
h replacing a few outlier drives to sleep better. > > Cheers, > > Erik > > -- > Erik Lindahl > On 9 Jan 2023 at 23:06 +0100, David Orman , wrote: > > "dmesg" on all the linux hosts and look for signs of failing drives. Look > > at smart data

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
"dmesg" on all the linux hosts and look for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB management logs, and so forth. If you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs. Is there a common OSD in the PGs you've run the repairs on? On

[ceph-users] Ceph Leadership Team Meeting - 2022/01/04

2023-01-04 Thread David Orman
Today's CLT meeting had the following topics of discussion: * Docs questions * crushtool options could use additional documentation * This is being addressed * sticky header on documentation pages obscuring titles when anchor links are used * There will be a follow-up email

[ceph-users] CLT meeting summary 2022-09-21

2022-09-22 Thread David Orman
This was a short meeting, and in summary: * Testing of upgrades for 17.2.4 in Gibba commenced and slowness during upgrade has been investigated. * Workaround available; not a release blocker ___ ceph-users mailing list -- ceph-users@ceph.io To

[ceph-users] Re: Wide variation in osd_mclock_max_capacity_iops_hdd

2022-09-06 Thread David Orman
Yes. Rotational drives can generally do 100-200IOPS (some outliers, of course). Do you have all forms of caching disabled on your storage controllers/disks? On Tue, Sep 6, 2022 at 11:32 AM Vladimir Brik < vladimir.b...@icecube.wisc.edu> wrote: > Setting osd_mclock_force_run_benchmark_on_init to

[ceph-users] Re: Cephadm old spec Feature `crush_device_class` is not supported

2022-08-04 Thread David Orman
https://github.com/ceph/ceph/pull/46480 - you can see the backports/dates there. Perhaps it isn't in the version you're running? On Thu, Aug 4, 2022 at 7:51 AM Kenneth Waegeman wrote: > Hi all, > > I’m trying to deploy this spec: > > spec: > data_devices: > model: Dell Ent NVMe AGN MU

[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman
Apologies, backport link should be: https://github.com/ceph/ceph/pull/46845 On Fri, Jul 15, 2022 at 9:14 PM David Orman wrote: > I think you may have hit the same bug we encountered. Cory submitted a > fix, see if it fits what you've encountered: > > https://github.com/ceph/cep

[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman
I think you may have hit the same bug we encountered. Cory submitted a fix, see if it fits what you've encountered: https://github.com/ceph/ceph/pull/46727 (backport to Pacific here: https://github.com/ceph/ceph/pull/46877 ) https://tracker.ceph.com/issues/54172 On Fri, Jul 15, 2022 at 8:52 AM

[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread David Orman
Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: > Hi Dan, > > I can confirm this is a regression introduced by >

[ceph-users] Ceph Leadership Team Meeting Minutes (2022-07-06)

2022-07-06 Thread David Orman
Here are the main topics of discussion during the CLT meeting today: - make-check/API tests - Ignoring the doc/ directory would skip an expensive git checkout operation and save time - Stale PRs - Currently an issue with stalebot which is being investigated - Cephalocon

[ceph-users] Re: Set device-class via service specification file

2022-06-27 Thread David Orman
Hi Robert, We had the same question and ended up creating a PR for this: https://github.com/ceph/ceph/pull/46480 - there are backports, as well, so I'd expect it will be in the next release or two. David On Mon, Jun 27, 2022 at 8:07 AM Robert Reihs wrote: > Hi, > We are setting up a test

[ceph-users] Re: OSDs getting OOM-killed right after startup

2022-06-10 Thread David Orman
Are you thinking it might be a permutation of: https://tracker.ceph.com/issues/53729 ? There are some posts in it to check for the issue, #53 and #65 had a few potential ways to check. On Fri, Jun 10, 2022 at 5:32 AM Marius Leustean wrote: > Did you check the mempools? > > ceph daemon osd.X

[ceph-users] Re: OpenStack Swift on top of CephFS

2022-06-09 Thread David Orman
I agree with this, just because you can doesn't mean you should. It will likely be significantly less painful to upgrade the infrastructure to support doing this the more-correct way, vs. trying to layer swift on top of cephfs. I say this having a lot of personal experience with Swift at extremely

[ceph-users] Re: Slow delete speed through the s3 API

2022-06-03 Thread David Orman
Is your client using the DeleteObjects call to delete 1000 per request?: https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html On Fri, Jun 3, 2022 at 9:35 AM J-P Methot wrote: > Read/writes are super fast. It's only deletes that are incredibly slow, > both through the s3 api

[ceph-users] Re: Replacing OSD with DB on shared NVMe

2022-05-25 Thread David Orman
In your example, you can login to the server in question with the OSD, and run "ceph-volume lvm zap --osd-id --destroy" and it will purge the DB/WAL LV. You don't need to reapply your osd spec, it will detect the available space on the nvme and redploy that OSD. On Wed, May 25, 2022 at 3:37 PM

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-17 Thread David Orman
est cluster that you upgraded that didn't exhibit the new > issue in 16.2.8 ? Thanks. > > Respectfully, > > *Wes Dillingham* > w...@wesdillingham.com > LinkedIn <http://www.linkedin.com/in/wesleydillingham> > > > On Tue, May 17, 2022 at 10:24 AM David Orman wrote:

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-17 Thread David Orman
We had an issue with our original fix in 45963 which was resolved in https://github.com/ceph/ceph/pull/46096. It includes the fix as well as handling for upgraded clusters. This is in the 16.2.8 release. I'm not sure if it will resolve your problem (or help mitigate it) but it would be worth

[ceph-users] Re: Recommendations on books

2022-04-27 Thread David Orman
Hi, I don't have any book suggestions, but in my experience, the best way to learn is to set up a cluster and start intentionally breaking things, and see how you can fix them. Perform upgrades, add load, etc. I do suggest starting with Pacific (the upcoming 16.2.8 release would likely be a good

[ceph-users] Re: [EXTERNAL] Re: radosgw-admin bi list failing with Input/output error

2022-04-21 Thread David Orman
https://tracker.ceph.com/issues/51429 with https://github.com/ceph/ceph/pull/45088 for Octopus. We're also working on: https://tracker.ceph.com/issues/55324 which is somewhat related in a sense. On Thu, Apr 21, 2022 at 11:19 AM Guillaume Nobiron wrote: > Yes, all the buckets in the reshard

[ceph-users] Re: radosgw-admin bi list failing with Input/output error

2022-04-21 Thread David Orman
Is this a versioned bucket? On Thu, Apr 21, 2022 at 9:51 AM Guillaume Nobiron wrote: > Hello, > > I have on issue on my ceph cluster (octopus 15.2.16) with several buckets > raising a LARGE_OMAP_OBJECTS warning. > I found the buckets in the resharding list but ceph fails to reshard them. > >

[ceph-users] Re: Laggy OSDs

2022-03-29 Thread David Orman
We're definitely dealing with something that sounds similar, but hard to state definitively without more detail. Do you have object lock/versioned buckets in use (especially if one started being used around the time of the slowdown)? Was this cluster always 16.2.7? What is your pool configuration

[ceph-users] Re: [RGW] Too much index objects and OMAP keys on them

2022-03-25 Thread David Orman
Hi Gilles, Did you ever figure this out? Also, your rados ls output indicates that the prod cluster has fewer objects in the index pool than the backup cluster, or am I misreading this? David On Wed, Dec 1, 2021 at 4:32 AM Gilles Mocellin < gilles.mocel...@nuagelibre.org> wrote: > Hello, > >

[ceph-users] Re: Cephadm is stable or not in product?

2022-03-08 Thread David Orman
We use it without major issues, at this point. There are still flaws, but there are flaws in almost any deployment and management system, and this is not unique to cephadm. I agree with the general sentiment that you need to have some knowledge about containers, however. I don't think that's

[ceph-users] Re: RGW automation encryption - still testing only?

2022-02-08 Thread David Orman
n the quincy release - and if not, we'll backport it to > > quincy in an early point release > > > > can SSE-S3 with PutBucketEncryption satisfy your use case? > > > > [1] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html > >

[ceph-users] RGW automation encryption - still testing only?

2022-02-08 Thread David Orman
Is RGW encryption for all objects at rest still testing only, and if not, which version is it considered stable in?: https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only David ___ ceph-users mailing list --

[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread David Orman
What version of Ceph are you using? Newer versions deploy a dashboard and prometheus module, which has some of this built in. It's a great start to seeing what can be done using Prometheus and the built in exporter. Once you learn this, if you decide you want something more robust, you can do an

[ceph-users] Re: Ideas for Powersaving on archive Cluster ?

2022-01-12 Thread David Orman
If performance isn't as big a concern, most servers have firmware settings that enable more aggressive power saving, at the cost of added latency/reduced cpu power/etc. HPE would be accessible/configurable via HP's ILO, Dells with DRAC, etc. They'd want to test and see how much of an impact it

[ceph-users] Re: cephadm issues

2022-01-07 Thread David Orman
What are you trying to do that won't work? If you need resources from outside the container, it doesn't sound like something you should need to be entering a shell inside the container to accomplish. On Fri, Jan 7, 2022 at 1:49 PM François RONVAUX wrote: > Thanks for the answer. > > I would

[ceph-users] Re: Repair/Rebalance slows down

2022-01-06 Thread David Orman
What's iostat show for the drive in question? What you're seeing is the cluster rebalancing initially, then at the end, it's probably that single drive being filled. I'd expect 25-100MB/s to be the fill rate of the newly added drive with backfills per osd set to 2 or so (much more than that

[ceph-users] Re: 16.2.7 pacific QE validation status, RC1 available for testing

2021-12-03 Thread David Orman
We've been testing RC1 since release on our 504 OSD/21 host, with split db/wal test cluster, and have experienced no issues on upgrade or operation so far. On Mon, Nov 29, 2021 at 11:23 AM Yuri Weinstein wrote: > Details of this release are summarized here: > >

[ceph-users] Re: Is it normal for a orch osd rm drain to take so long?

2021-12-02 Thread David Orman
.72899 > 0 > 0 B > 0 B > 0 B > 0 B > 0 B > 0 B > 0 > 0 > 1 > up > > Zach > > On 2021-12-01 5:20 PM, David Orman wrote: > > What's "ceph osd df" show? > > On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) > wrote: > >> I wanted to

[ceph-users] Re: Is it normal for a orch osd rm drain to take so long?

2021-12-01 Thread David Orman
What's "ceph osd df" show? On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) wrote: > I wanted to swap out on existing OSD, preserve the number, and then remove > the HDD that had it (osd.14 in this case) and give the ID of 14 to a new > SSD that would be taking its place in the same node. First

[ceph-users] Re: Pg autoscaling and device_health_metrics pool pg sizing

2021-11-02 Thread David Orman
I suggest continuing with manual PG sizing for now. With 16.2.6 we have seen the autoscaler scale up the device health metrics to 16000+ PGs on brand new clusters, which we know is incorrect. It's on our company backlog to investigate, but far down the backlog. It's bitten us enough times in the

[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread David Orman
The balancer does a pretty good job. It's the PG autoscaler that has bitten us frequently enough that we always ensure it is disabled for all pools. David On Mon, Nov 1, 2021 at 2:08 PM Alexander Closs wrote: > I can add another 2 positive datapoints for the balancer, my personal and > work

[ceph-users] Re: Adopting "unmanaged" OSDs into OSD service specification

2021-10-13 Thread David Orman
for a more smooth way to do that. > > Luis Domingues > > ‐‐‐ Original Message ‐‐‐ > > On Monday, October 4th, 2021 at 10:01 PM, David Orman < > orma...@corenode.com> wrote: > > > We have an older cluster which has been iterated on many times. It's

[ceph-users] Re: RFP for arm64 test nodes

2021-10-09 Thread David Orman
If there's intent to use this for performance comparisons between releases, I would propose that you include rotational drive(s), as well. It will be quite some time before everyone is running pure NVME/SSD clusters with the storage costs associated with that type of workload, and this should be

[ceph-users] Adopting "unmanaged" OSDs into OSD service specification

2021-10-04 Thread David Orman
We have an older cluster which has been iterated on many times. It's always been cephadm deployed, but I am certain the OSD specification used has changed over time. I believe at some point, it may have been 'rm'd. So here's our current state: root@ceph02:/# ceph orch ls osd --export

[ceph-users] Re: [16.2.6] When adding new host, cephadm deploys ceph image that no longer exists

2021-09-29 Thread David Orman
It appears when an updated container for 16.2.6 (there was a remoto version included with a bug in the first release) was pushed, the old one was removed from quay. We had to update our 16.2.6 clusters to the 'new' 16.2.6 version, and just did the typical upgrade with the image specified. This

[ceph-users] Re: prometheus - figure out which mgr (metrics endpoint) that is active

2021-09-28 Thread David Orman
We scrape all mgr endpoints since we use external Prometheus clusters, as well. The query results will have {instance=activemgrhost}. The dashboards in upstream don't have multiple cluster support, so we have to modify them to work with our deployments since we have multiple ceph clusters being

[ceph-users] Re: Change max backfills

2021-09-24 Thread David Orman
With recent releases, 'ceph config' is probably a better option; do keep in mind this sets things cluster-wide. If you're just wanting to target specific daemons, then tell may be better for your use case. # get current value ceph config get osd osd_max_backfills # set new value to 2, for

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c ^^ if people want to test and provide feedback for a potential merge to EPEL8 stable. David On Wed, Sep 22, 2021 at 11:43 AM David Orman wrote: > > I'm wondering if this was installed using pip/pypi before, and now >

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
I'm wondering if this was installed using pip/pypi before, and now switched to using EPEL? That would explain it - 1.2.1 may never have been pushed to EPEL. David On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote: > > We'd worked on pushing a change to fix > https://tracker.ceph.c

[ceph-users] Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
deployments with medium to large counts of OSDs or split db/wal devices, like many modern deployments. https://koji.fedoraproject.org/koji/packageinfo?packageID=18747 https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/ Respectfully, David Orman

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-20 Thread David Orman
Same question here, for clarity, was this on upgrading to 16.2.6 from 16.2.5? Or upgrading from some other release? On Mon, Sep 20, 2021 at 8:57 AM Sean wrote: > > I also ran into this with v16. In my case, trying to run a repair totally > exhausted the RAM on the box, and was unable to

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-20 Thread David Orman
For clarity, was this on upgrading to 16.2.6 from 16.2.5? Or upgrading from some other release? On Mon, Sep 20, 2021 at 8:33 AM Paul Mezzanini wrote: > > I got the exact same error on one of my OSDs when upgrading to 16. I > used it as an exercise on trying to fix a corrupt rocksdb. A spent a

[ceph-users] Re: OSD based ec-code

2021-09-14 Thread David Orman
ices Co., Ltd. > e: istvan.sz...@agoda.com > ------- > > -Original Message- > From: David Orman > Sent: Tuesday, September 14, 2021 8:55 PM > To: Eugen Block > Cc: ceph-users > Subject: [ceph-users] Re: OSD based ec-code &g

[ceph-users] Re: OSD based ec-code

2021-09-14 Thread David Orman
Keep in mind performance, as well. Once you start getting into higher 'k' values with EC, you've got a lot more drives involved that need to return completions for operations, and on rotational drives this becomes especially painful. We use 8+3 for a lot of our purposes, as it's a good balance of

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread David Orman
No problem, and it looks like they will. Glad it worked out for you! David On Thu, Sep 9, 2021 at 9:31 AM mabi wrote: > > Thank you Eugen. Indeed the answer went to Spam :( > > So thanks to David for his workaround, it worked like a charm. Hopefully > these patches can make it into the next

[ceph-users] Re: Smarter DB disk replacement

2021-09-09 Thread David Orman
Exactly, we minimize the blast radius/data destruction by allocating more devices for DB/WAL of smaller size than less of larger size. We encountered this same issue on an earlier iteration of our hardware design. With rotational drives and NVMEs, we are now aiming for a 6:1 ratio based on our

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
undeploy, then re-add the label, and it will redeploy. On Wed, Sep 8, 2021 at 7:03 AM David Orman wrote: > > This sounds a lot like: https://tracker.ceph.com/issues/51027 which is > fixed in https://github.com/ceph/ceph/pull/42690 > > David > > On Tue, Sep 7, 2021 a

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
This sounds a lot like: https://tracker.ceph.com/issues/51027 which is fixed in https://github.com/ceph/ceph/pull/42690 David On Tue, Sep 7, 2021 at 7:31 AM mabi wrote: > > Hello > > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on > Ubuntu 20.04 LTS bare metal. I just

[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread David Orman
It may be this: https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 Which we resolved with: https://github.com/alfredodeza/remoto/pull/63 What version of ceph are you running, and is it impacted by the above? David On Thu, Sep 2, 2021 at 9:53 AM fcid wrote:

[ceph-users] Re: Missing OSD in SSD after disk failure

2021-08-30 Thread David Orman
t;filter_logic: AND" in the yaml file > and the result was the same. > > Best regards, > Eric > > > -Original Message- > From: David Orman [mailto:orma...@corenode.com] > Sent: 27 August 2021 14:56 > To: Eric Fahnle > Cc: ceph-users@ceph.io > Subject:

[ceph-users] Re: Missing OSD in SSD after disk failure

2021-08-27 Thread David Orman
This was a bug in some versions of ceph, which has been fixed: https://tracker.ceph.com/issues/49014 https://github.com/ceph/ceph/pull/39083 You'll want to upgrade Ceph to resolve this behavior, or you can use size or something else to filter if that is not possible. David On Thu, Aug 19, 2021

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-12 Thread David Orman
- Am 9. Aug 2021 um 18:15 schrieb David Orman orma...@corenode.com: > > > Hi, > > > > We are seeing very similar behavior on 16.2.5, and also have noticed > > that an undeploy/deploy cycle fixes things. Before we go rummaging > > through the source code tryi

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-10 Thread David Orman
Just adding our feedback - this is affecting us as well. We reboot periodically to test durability of the clusters we run, and this is fairly impactful. I could see power loss/other scenarios in which this could end quite poorly for those with less than perfect redundancy in DCs across multiple

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread David Orman
Hi, We are seeing very similar behavior on 16.2.5, and also have noticed that an undeploy/deploy cycle fixes things. Before we go rummaging through the source code trying to determine the root cause, has anybody else figured this out? It seems odd that a repeatable issue (I've seen other mailing

[ceph-users] Re: Having issues to start more than 24 OSDs per host

2021-06-22 Thread David Orman
https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 If you're brave (YMMV, test first non-prod), we pushed an image with the issue we encountered fixed as per above here: https://hub.docker.com/repository/docker/ormandj/ceph/tags?page=1 that you can use to

[ceph-users] Re: Ceph Managers dieing?

2021-06-17 Thread David Orman
Hi Peter, We fixed this bug: https://tracker.ceph.com/issues/47738 recently here: https://github.com/ceph/ceph/commit/b4316d257e928b3789b818054927c2e98bb3c0d6 which should hopefully be in the next release(s). David On Thu, Jun 17, 2021 at 12:13 PM Peter Childs wrote: > > Found the issue in

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-06-01 Thread David Orman
make it clear. On Tue, Jun 1, 2021 at 2:30 AM David Orman wrote: > > I do not believe it was in 16.2.4. I will build another patched version of > the image tomorrow based on that version. I do agree, I feel this breaks new > deploys as well as existing, and hope a point release will c

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-06-01 Thread David Orman
ce we began using it in > luminous/mimic, but situations such as this are hard to look past. It's > really unfortunate as our existing production clusters have been rock solid > thus far, but this does shake one's confidence, and I would wager that I'm > not alone. > > Marco > &g

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-05-31 Thread David Orman
ing but not >> detected by Linux, which makes me think I'm hitting some kernel limit. >> >> At this point I'm going to cut my loses and give up and use the small >> slightly more powerful 30x drive systems I have (with 256g memory), maybe >> transplanting the larger

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-05-29 Thread David Orman
You may be running into the same issue we ran into (make sure to read the first issue, there's a few mingled in there), for which we submitted a patch: https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 If you're brave (YMMV, test first non-prod), we pushed an

[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD

2021-05-26 Thread David Orman
We've found that after doing the osd rm, you can use: "ceph-volume lvm zap --osd-id 178 --destroy" on the server with that OSD as per: https://docs.ceph.com/en/latest/ceph-volume/lvm/zap/#removing-devices and it will clean things up so they work as expected. On Tue, May 25, 2021 at 6:51 AM Kai

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-14 Thread David Orman
We've created a PR to fix the root cause of this issue: https://github.com/alfredodeza/remoto/pull/63 Thank you, David On Mon, May 10, 2021 at 7:29 PM David Orman wrote: > > Hi Sage, > > We've got 2.0.27 installed. I restarted all the manager pods, just in > case, and I have th

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman
oblem. What version are you using? The > kubic repos currently have 2.0.27. See > https://build.opensuse.org/project/show/devel:kubic:libcontainers:stable > > We'll make sure the next release has the verbosity workaround! > > sage > > On Mon, May 10, 2021 at 5:47 PM David

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman
/ 12 OSDs per NVME), even when new OSDs are not being deployed, as it still tries to apply the OSD specification. On Mon, May 10, 2021 at 4:03 PM David Orman wrote: > > Hi, > > We are seeing the mgr attempt to apply our OSD spec on the various > hosts, then block. When we inve

[ceph-users] Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman
Hi, We are seeing the mgr attempt to apply our OSD spec on the various hosts, then block. When we investigate, we see the mgr has executed cephadm calls like so, which are blocking: root 1522444 0.0 0.0 102740 23216 ?S17:32 0:00 \_ /usr/bin/python3

[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-10 Thread David Orman
We are using 16.2.3. Thanks, David On Fri, May 7, 2021 at 9:06 AM David Orman wrote: > > Hi, > > I'm not attempting to remove the OSDs, but instead the > service/placement specification. I want the OSDs/data to persist. > --force did not work on the service, as noted in the original

[ceph-users] Re: x-amz-request-id logging with beast + rgw (ceph 15.2.10/containerized)?

2021-05-07 Thread David Orman
. David On Fri, May 7, 2021 at 4:21 PM Matt Benjamin wrote: > > Hi David, > > I think the solution is most likely the ops log. It is called for > every op, and has the transaction id. > > Matt > > On Fri, May 7, 2021 at 4:58 PM David Orman wrote: > > > >

[ceph-users] Re: x-amz-request-id logging with beast + rgw (ceph 15.2.10/containerized)?

2021-05-07 Thread David Orman
t; using lua scripting on the RGW: > https://docs.ceph.com/en/pacific/radosgw/lua-scripting/ > > Yuval > > On Thu, Apr 1, 2021 at 7:11 PM David Orman wrote: >> >> Hi, >> >> Is there any way to log the x-amz-request-id along with the request in >> the rg

[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-07 Thread David Orman
that everything was fine again. This is a Ceph 15.2.11 cluster on > Ubuntu 20.04 and podman. > > Hope that helps. > > ‐‐‐ Original Message ‐‐‐ > On Friday, May 7, 2021 1:24 AM, David Orman wrote: > > > Has anybody run into a 'stuck' OSD service specification? I've tried &

[ceph-users] Stuck OSD service specification - can't remove

2021-05-06 Thread David Orman
Has anybody run into a 'stuck' OSD service specification? I've tried to delete it, but it's stuck in 'deleting' state, and has been for quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3: NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd.osd_spec

[ceph-users] Re: Failed cephadm Upgrade - ValueError

2021-05-04 Thread David Orman
Can you please run: "cat /sys/kernel/security/apparmor/profiles"? See if any of the lines have a label but no mode. Let us know what you find! Thanks, David On Mon, May 3, 2021 at 8:58 AM Ashley Merrick wrote: > Created BugTicket : https://tracker.ceph.com/issues/50616 > > On Mon May 03 2021

[ceph-users] Re: using ec pool with rgw

2021-05-03 Thread David Orman
We haven't found a more 'elegant' way, but the process we follow: we pre-create all the pools prior to creating the realm/zonegroup/zone, then we period apply, then we remove the default zonegroup/zone, period apply, then remove the default pools. Hope this is at least somewhat helpful, David On

[ceph-users] Re: Version of podman for Ceph 15.2.10

2021-04-09 Thread David Orman
t; 08c4e95c0c03 docker.io/prom/prometheus:v2.18.1 --config.file=/et... 15 > hours ago Up 15 hours ago > ceph-8d47792c-987d-11eb-9bb6-a5302e00e1fa-prometheus.ceph1a > 19944dbf7a63 docker.io/prom/alertmanager:v0.20.0 --web.listen-addr... 15 > hours ago Up 15

[ceph-users] Re: Version of podman for Ceph 15.2.10

2021-04-08 Thread David Orman
The latest podman 3.0.1 release is fine (we have many production clusters running this). We have not tested 3.1 yet, however, but will soon. > On Apr 8, 2021, at 10:32, mabi wrote: > > Hello, > > I would like to install Ceph 15.2.10 using cephadm and just found the > following table by

[ceph-users] bluestore_min_alloc_size_hdd on Octopus (15.2.10) / XFS formatted RBDs

2021-04-07 Thread David Orman
Now that the hybrid allocator appears to be enabled by default in Octopus, is it safe to change bluestore_min_alloc_size_hdd to 4k from 64k on Octopus 15.2.10 clusters, and then redeploy every OSD to switch to the smaller allocation size, without massive performance impact for RBD? We're seeing a

  1   2   >