[ceph-users] Multi-MDS CephFS upgrades limitation

2021-10-06 Thread Bryan Stillwell
One of the main limitations of using CephFS is the requirement to reduce the number of active MDS daemons to one during upgrades. As far as I can tell this has been a known problem since Luminous (~2017). This issue essentially requires downtime during upgrades for any CephFS cluster that needs

[ceph-users] Re: v16.2.5 Pacific released

2021-07-09 Thread Bryan Stillwell
Thanks David! This looks good now. :) > On Jul 8, 2021, at 6:28 PM, David Galloway wrote: > > Done! > > On 7/8/21 3:51 PM, Bryan Stillwell wrote: >> There appears to be arm64 packages built for Ubuntu Bionic, but not for >> Focal. Any chance Focal pa

[ceph-users] name alertmanager/node-exporter already in use with v16.2.5

2021-07-08 Thread Bryan Stillwell
I upgraded one of my clusters to v16.2.5 today and now I'm seeing these messages from 'ceph -W cephadm': 2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1), 'service_type': 'alertmanager',

[ceph-users] Re: v16.2.5 Pacific released

2021-07-08 Thread Bryan Stillwell
There appears to be arm64 packages built for Ubuntu Bionic, but not for Focal. Any chance Focal packages can be built as well? Thanks, Bryan > On Jul 8, 2021, at 12:20 PM, David Galloway wrote: > > Caution: This email is from an external sender. Please do not click links or > open

[ceph-users] Re: cephadm removed mon. key when adding new mon node

2021-06-01 Thread Bryan Stillwell
upgrades after that, which means the global container image name was never changed. Bryan On Jun 1, 2021, at 9:38 AM, Bryan Stillwell mailto:bstillw...@godaddy.com>> wrote: This morning I tried adding a mon node to my home Ceph cluster with the following command: ceph orch daemon add mon

[ceph-users] cephadm removed mon. key when adding new mon node

2021-06-01 Thread Bryan Stillwell
This morning I tried adding a mon node to my home Ceph cluster with the following command: ceph orch daemon add mon ether This seemed to work at first, but then it decided to remove it fairly quickly which broke the cluster because the mon. keyring was also removed:

[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster

2021-05-15 Thread Bryan Stillwell
[8,17,4,1,14,0,19,8]p8 2021-05-11T22:41:11.332885+ 2021-05-11T22:41:11.332885+ I'm now considering using device classes and assigning the OSDs to either hdd1 or hdd2... Unless someone has another idea? Thanks, Bryan > On May 14, 2021, at 12:35 PM, Bryan Stillwell wrote: > > This wor

[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster

2021-05-14 Thread Bryan Stillwell
oseleaf indep 1 type osd > step emit > > J. > > ‐‐‐ Original Message ‐‐‐ > > On Wednesday, May 12th, 2021 at 17:58, Bryan Stillwell > wrote: > >> I'm trying to figure out a CRUSH rule that will spread data out across my >> cluster as much as possib

[ceph-users] cephadm stalled after adjusting placement

2021-05-14 Thread Bryan Stillwell
I'm looking for help in figuring out why cephadm isn't making any progress after I told it to redeploy an mds daemon with: ceph orch daemon redeploy mds.cephfs.aladdin.kgokhr ceph/ceph:v15.2.12 The output from 'ceph -W cephadm' just says: 2021-05-14T16:24:46.628084+ mgr.paris.glbvov [INF]

[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster

2021-05-12 Thread Bryan Stillwell
2 mandalaybay 2 paris ... Hopefully someone else will find this useful. Bryan > On May 12, 2021, at 9:58 AM, Bryan Stillwell wrote: > > I'm trying to figure out a CRUSH rule that will spread data out across my > cluster as much as possible, but not more than 2 chunks per host

[ceph-users] CRUSH rule for EC 6+2 on 6-node cluster

2021-05-12 Thread Bryan Stillwell
I'm trying to figure out a CRUSH rule that will spread data out across my cluster as much as possible, but not more than 2 chunks per host. If I use the default rule with an osd failure domain like this: step take default step choose indep 0 type osd step emit I get clustering of 3-4 chunks on

[ceph-users] Upgrade to 15.2.7 fails on mixed x86_64/arm64 cluster

2020-12-01 Thread Bryan Stillwell
I tried upgrading my home cluster to 15.2.7 (from 15.2.5) today and it appears to be entering a loop when trying to match docker images for ceph:v15.2.7: 2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr daemons... 2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom

[ceph-users] Is it possible to rebuild a bucket instance?

2020-08-06 Thread Bryan Stillwell
I have a cluster running Nautilus where the bucket instance (backups.190) has gone missing: # radosgw-admin metadata list bucket | grep 'backups.19[0-1]' | sort "backups.190", "backups.191", # radosgw-admin metadata list bucket.instance | grep 'backups.19[0-1]' | sort

[ceph-users] Multiple outages when disabling scrubbing

2020-06-03 Thread Bryan Stillwell
The last two days we've experienced a couple short outages shortly after setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters (~2,200 OSDs). This cluster is running Nautilus (14.2.6) and setting/unsetting these flags has been done many times in the past without a problem.

[ceph-users] Re: v15.2.0 Octopus released

2020-03-25 Thread Bryan Stillwell
On Mar 24, 2020, at 5:38 AM, Abhishek Lekshmanan wrote: > #. Upgrade monitors by installing the new packages and restarting the > monitor daemons. For example, on each monitor host,:: > > # systemctl restart ceph-mon.target > > Once all monitors are up, verify that the monitor upgrade

[ceph-users] Re: v15.2.0 Octopus released

2020-03-24 Thread Bryan Stillwell
Great work! Thanks to everyone involved! One minor thing I've noticed so far with the Ubuntu Bionic build is it's reporting the release as an RC instead of being 'stable': $ ceph versions | grep octopus "ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus (rc)": 1

[ceph-users] Re: Ubuntu Bionic arm64 repo missing packages

2019-12-20 Thread Bryan Stillwell
I just noticed that arm64 packages only exist for xenial. Is there a reason why bionic packages aren't being built? Thanks, Bryan > On Dec 20, 2019, at 4:22 PM, Bryan Stillwell wrote: > > I was going to try adding an OSD to my home cluster using one of the 4GB > Raspber

[ceph-users] Ubuntu Bionic arm64 repo missing packages

2019-12-20 Thread Bryan Stillwell
I was going to try adding an OSD to my home cluster using one of the 4GB Raspberry Pis today, but it appears that the Ubuntu Bionic arm64 repo is missing a bunch of packages: $ sudo grep ^Package: /var/lib/apt/lists/download.ceph.com_debian-nautilus_dists_bionic_main_binary-arm64_Packages

[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5

2019-12-18 Thread Bryan Stillwell
On Dec 18, 2019, at 1:48 PM, e...@lapsus.org wrote: > > That sounds very similar to what I described there: > https://tracker.ceph.com/issues/43364 I would agree that they're quite similar if not the same thing! Now that you mention it I see the thread is named mgr-fin in 'top -H' as well. I

[ceph-users] High CPU usage by ceph-mgr in 14.2.5

2019-12-18 Thread Bryan Stillwell
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes:

[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

2019-12-16 Thread Bryan Stillwell
, 2019, at 10:27 AM, Sasha Litvak mailto:alexander.v.lit...@gmail.com>> wrote: Notice: This email is from an external sender. Bryan, Were you able to resolve this? If yes, can you please share with the list? On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell mailto:bstillw...@godad

[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

2019-12-13 Thread Bryan Stillwell
alFrameEx 0.55% [kernel] [k] _raw_spin_unlock_irqrestore I increased mon debugging to 20 and nothing stuck out to me. Bryan > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell wrote: > > On our test cluster after upgrading to 14.2.5 I'm having problems with the &g

[ceph-users] ceph-mon using 100% CPU after upgrade to 14.2.5

2019-12-12 Thread Bryan Stillwell
On our test cluster after upgrading to 14.2.5 I'm having problems with the mons pegging a CPU core while moving data around. I'm currently converting the OSDs from FileStore to BlueStore by marking the OSDs out in multiple nodes, destroying the OSDs, and then recreating them with ceph-volume

[ceph-users] Re: RESEND: Re: PG Balancer Upmap mode not working

2019-12-10 Thread Bryan Stillwell
Rich, What's your failure domain (osd? host? chassis? rack?) and how big is each of them? For example I have a failure domain of type rack in one of my clusters with mostly even rack sizes: # ceph osd crush rule dump | jq -r '.[].steps' [ { "op": "take", "item": -1, "item_name":

[ceph-users] Re: osdmaps not trimmed until ceph-mon's restarted (if cluster has a down osd)

2019-12-09 Thread Bryan Stillwell
On Nov 18, 2019, at 8:12 AM, Dan van der Ster wrote: > > On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis wrote: >> >> On 19/11/14 11:04AM, Gregory Farnum wrote: >>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster >>> wrote: Hi Joao, I might have found the reason why

[ceph-users] Re: mgr hangs with upmap balancer

2019-11-22 Thread Bryan Stillwell
a solution yet so I'll stick with disabled balancer > for now since the current pg placement is fine. > > Regards, > Eugen > > > [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56994.html > [2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56890.h

[ceph-users] mgr hangs with upmap balancer

2019-11-19 Thread Bryan Stillwell
On multiple clusters we are seeing the mgr hang frequently when the balancer is enabled. It seems that the balancer is getting caught in some kind of infinite loop which chews up all the CPU for the mgr which causes problems with other modules like prometheus (we don't have the devicehealth

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Bryan Stillwell
r help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell > wrote: >> >> Closing the loop here. I figured ou

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Bryan Stillwell
this was to track down, maybe a check should be added before enabling msgr2 to make sure the require-osd-release is set to nautilus? Bryan > On Nov 18, 2019, at 5:41 PM, Bryan Stillwell wrote: > > I cranked up debug_ms to 20 on two of these clusters today and I'm still not > understand

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-18 Thread Bryan Stillwell
5.979 7f917becf700 1 -- 10.0.13.2:0/3084510 learned_addr learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0) The learned address is v1:10.0.13.2:0/0. What else can I do to figure out why it's deciding to use the legacy protocol only? Thanks, Bryan > On Nov 15, 2019, at

[ceph-users] msgr2 not used on OSDs in some Nautilus clusters

2019-11-15 Thread Bryan Stillwell
I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the output for osd.0 on 2 clusters of each type: ### Cluster 1 (v1 only): # ceph osd find 0 | jq -r '.addrs' { "addrvec": [ {

[ceph-users] Bad links on ceph.io for mailing lists

2019-11-14 Thread Bryan Stillwell
There are some bad links to the mailing list subscribe/unsubscribe/archives on this page that should get updated: https://ceph.io/resources/ The subscribe/unsubscribe/archives links point to the old lists vger and lists.ceph.com, and not the new lists on lists.ceph.io: ceph-devel

[ceph-users] Counting OSD maps

2019-11-13 Thread Bryan Stillwell
With FileStore you can get the number of OSD maps for an OSD by using a simple find command: # rpm -q ceph ceph-12.2.12-0.el7.x86_64 # find /var/lib/ceph/osd/ceph-420/current/meta/ -name 'osdmap*' | wc -l 42486 Does anyone know of an equivalent command that can be used with BlueStore? Thanks,

[ceph-users] Re: RGW compression not compressing

2019-11-07 Thread Bryan Stillwell
Thanks Casey! Adding the following to my swiftclient put_object call caused it to start compressing the data: headers={'x-object-storage-class': 'STANDARD'} I appreciate the help! Bryan > On Nov 7, 2019, at 9:26 AM, Casey Bodley wrote: > > On 11/7/19 10:35 AM, Bryan Stillw

[ceph-users] Re: Splitting PGs not happening on Nautilus 14.2.2

2019-10-30 Thread Bryan Stillwell
Responding to myself to follow up with what I found. While going over the release notes for 14.2.3/14.2.4 I found this was a known problem that has already been fixed. Upgrading the cluster to 14.2.4 fixed the issue. Bryan > On Oct 30, 2019, at 10:33 AM, Bryan Stillwell wr

[ceph-users] Re: Compression on existing RGW buckets

2019-10-29 Thread Bryan Stillwell
- > just note that some 'helpful' s3 clients will insert a > 'x-amz-storage-class: STANDARD' header to requests that don't specify > one, and the presence of this header will override the user's default > storage class. > > On 10/29/19 12:20 PM, Bryan Stillwell wrote: >>

[ceph-users] Re: Several ceph osd commands hang

2019-10-29 Thread Bryan Stillwell
1, in start >self.tick() > File > "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line > 2090, in tick >s, ssl_env = self.ssl_adapter.wrap(s) > File > "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py", >

[ceph-users] Compression on existing RGW buckets

2019-10-29 Thread Bryan Stillwell
I'm wondering if it's possible to enable compression on existing RGW buckets? The cluster is running Luminous 12.2.12 with FileStore as the backend (no BlueStore compression then). We have a cluster that recently started to rapidly fill up with compressible content (qcow2 images) and I would

[ceph-users] Re: Slow peering caused by "wait for new map"

2019-09-04 Thread Bryan Stillwell
ile" OSDs * check if everything is ok look ing their logs * taking off the NOUP flag * Take a coffee and wait till all data are drain []'s Arthur (aKa Guilherme Geronimo) On 04/09/2019 15:32, Bryan Stillwell wrote: We are not using jumbo frames anywhere on this cluster (all mtu 1500)

[ceph-users] Slow peering caused by "wait for new map"

2019-09-04 Thread Bryan Stillwell
Our test cluster is seeing a problem where peering is going incredibly slow shortly after upgrading it to Nautilus (14.2.2) from Luminous (12.2.12). >From what I can tell it seems to be caused by "wait for new map" taking a long >time. When looking at dump_historic_slow_ops on pretty much any