[ceph-users] Corrupted RBD image

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega
Hi: I tried get info from a RBD image but: - root@fond-beagle:/# rbd list --pool cinder-ceph | grep volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda root@fond-beagle:/# rbd info

[ceph-users] MDS restarts after enabling msgr2

2020-10-29 Thread Stefan Kooman
Hi List, After a successful upgrade from Mimic 13.2.8 to Nautilus 14.2.12 we enabled msgr2. Soon after that both of the MDS servers (active / active-standby) restarted. We did not hit any ASSERTS this time, so that's good :>. However, I have not seen this happening on four different test

[ceph-users] Re: Fix PGs states

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega
Great and thanks, i fixed all unknowns with the command, now left the incomplete, down, etc. El 2020-10-29 23:57, 胡 玮文 escribió: Hi, I have not tried, but maybe this will help with the unknown PGs, if you don’t care any data loss. ceph osd force-create-pg 在 2020年10月30日,10:46,Ing. Luis

[ceph-users] Re: Fix PGs states

2020-10-29 Thread 胡 玮文
Hi, I have not tried, but maybe this will help with the unknown PGs, if you don’t care any data loss. ceph osd force-create-pg 在 2020年10月30日,10:46,Ing. Luis Felipe Domínguez Vega 写道: Hi: I have this ceph status:

[ceph-users] Fix PGs states

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega
Hi: I have this ceph status: - cluster: id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78 health: HEALTH_WARN noout flag(s) set 1 osds down Reduced data availability: 191 pgs

[ceph-users] bluefs mount failed(crash) after a long time

2020-10-29 Thread Elians Wan
Anyone can help? Bluefs mount failed after a long time The error message: 2020-10-30 05:33:54.906725 7f1ad73f5e00 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-30/block size 7.28TiB 2020-10-30 05:33:54.906758 7f1ad73f5e00 1 bluefs mount 2020-10-30 06:00:32.881850 7f1ad73f5e00 -1

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Stefan Kooman
On 2020-10-29 06:55, Mark Johnson wrote: > I've been struggling with this one for a few days now. We had an OSD report > as near full a few days ago. Had this happen a couple of times before and a > reweight-by-utilization has sorted it out in the past. Tried the same again > but this time

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Frank Schilder
> ... i will use now only one site, but need first stabilice the > cluster to remove the EC erasure coding and use replicate ... If you change to one site only, there is no point in getting rid of the EC pool. Your main problem will be restoring the lost data. Do you have backup of everything?

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Janne Johansson
Den tors 29 okt. 2020 kl 20:16 skrev Tony Liu : > Typically, the number of nodes is 2n+1 to cover n failures. > It's OK to have 4 nodes, from failure covering POV, it's the same > as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the > cluster is down. It works, just not make much sense.

[ceph-users] How to reset Log Levels

2020-10-29 Thread Ml Ml
Hello, i played around with some log level i can´t remember and my logs are now getting bigger than my DVD-Movie collection. E.g.: journalctl -b -u ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service > out.file is 1,1GB big. I did already try: ceph tell mon.ceph03 config set debug_mon

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Tony Liu
Typically, the number of nodes is 2n+1 to cover n failures. It's OK to have 4 nodes, from failure covering POV, it's the same as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the cluster is down. It works, just not make much sense. Thanks! Tony > -Original Message- > From: Marc

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega
Uff.. now two of the OSD are crashing with... https://pastebin.ubuntu.com/p/qd6Tc2rpfm/ El 2020-10-29 13:11, Frank Schilder escribió: ... i will use now only one site, but need first stabilice the cluster to remove the EC erasure coding and use replicate ... If you change to one site only,

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Ing . Luis Felipe Domínguez Vega
Thanks for response... I dont have the old OSDs (and not backups because this cluster is not so important, this is the develop cluster, so the unknown PGs i need to delete it (how i can do that?). But i dont want wipe all the Ceph cluster, if i can delete the unkown and incomplete PGs, well

[ceph-users] Very high read IO during backfilling

2020-10-29 Thread Kamil Szczygieł
Hi, We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory each) that are running mon, mds and mgr and also 4 data nodes (12 core, 256 GB memory, 13x10TB HDDs each). We've increased number of PGs inside our pool, which resulted in all OSDs going crazy and reading the

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
Hi Alex, We found that there were a huge number of keys in the "logm" and "osdmap" table while using ceph-monstore-tool. I think that could be the root cause. Well, some pages also say that disable 'insight' module can resolve this issue, but I checked our cluster and we didn't enable this

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread Ki Wong
Thanks, David. I just double checked and they can all connect to one another, on both v1 and v2 ports. -kc > On Oct 29, 2020, at 12:41 AM, David Caro wrote: > > On 10/28 17:26, Ki Wong wrote: >> Hello, >> >> I am at my wit's end. >> >> So I made a mistake in the configuration of my router

[ceph-users] Re: Not all OSDs in rack marked as down when the rack fails

2020-10-29 Thread Dan van der Ster
Hi Wido, Could it be one of these? mon osd min up ratio mon osd min in ratio 36/120 is 0.3 so it might be one of those magic ratios at play. Cheers, Dan On Thu, 29 Oct 2020, 18:05 Wido den Hollander, wrote: > Hi, > > I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as

[ceph-users] Not all OSDs in rack marked as down when the rack fails

2020-10-29 Thread Wido den Hollander
Hi, I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as down when the network is cut to that rack. Situation: - Nautilus cluster - 3 racks - 120 OSDs, 40 per rack We performed a test where we turned off the network Top-of-Rack for each rack. This worked as expected with

[ceph-users] Re: How to reset Log Levels

2020-10-29 Thread Patrick Donnelly
On Thu, Oct 29, 2020 at 9:26 AM Ml Ml wrote: > > Hello, > i played around with some log level i can´t remember and my logs are > now getting bigger than my DVD-Movie collection. > E.g.: journalctl -b -u > ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@mon.ceph03.service > > out.file is 1,1GB big. > >

[ceph-users] Re: Very high read IO during backfilling

2020-10-29 Thread Eugen Block
Hi, you could lower the recovery settings to the default and see if that helps: osd_max_backfills = 1 osd_recovery_max_active = 3 Regards, Eugen Zitat von Kamil Szczygieł : Hi, We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory each) that are running mon, mds

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
I think you really need to sit down and explain the full story. Dropping one-liners with new information will not work via e-mail. I have never heard of the problem you are facing, so you did something that possibly no-one else has done before. Unless we know the full history from the last

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Alex Gracie
We hit this issue over the weekend on our HDD backed EC Nautilus cluster while removing a single OSD. We also did not have any luck using compaction. The mon-logs filled up our entire root disk on the mon servers and we were running on a single monitor for hours while we tried to finish

[ceph-users] Cloud Sync Module

2020-10-29 Thread Sailaja Yedugundla
I am trying to configure cloud sync module in my ceph cluster to implement backup to AWS S3 cluster. I could not find configure using the available documentation. Can someone help me to implement this? Thanks, Sailaja ___ ceph-users mailing list --

[ceph-users] Re: dashboard object gateway not working

2020-10-29 Thread Siegfried Höllrigl
On the machines with the radosgateways, there is also a haproxy running. (And makes https->http conversion). I have tried it in both ways alreads. on Port 443 (resolves to the external IP) and on the internal Port (with a hosts entry to the internal IP; on the machine, where the

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread Stefan Kooman
On 2020-10-29 01:26, Ki Wong wrote: > Hello, > > I am at my wit's end. > > So I made a mistake in the configuration of my router and one > of the monitors (out of 3) dropped out of the quorum and nothing > I’ve done allow it to rejoin. That includes reinstalling the > monitor with ceph-ansible.

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
I then follow someone's guidance, add 'mon compact on start = true' to the config and restart one mon. That mon has not joined the cluster until I added two mon deployed on virtual machines with ssd into the cluster. And now the cluster is fine except the pg status. [image: image.png] [image:

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
Hi, I was so anxious a few hours ago cause the sst files were growing so fast and I don't think the space on mon servers could afford it. Let me talk it from the beginning. I have a cluster with OSD deployed on SATA(7200rpm). 10T each OSD and I used ec pool for more space.I added new OSDs into

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
He he. > It will prevent OSDs from being marked out if you shut them down or the . ... down or the MONs loose heartbeats due to high network load during peering. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
It will prevent OSDs from being marked out if you shut them down or the . Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get overloaded by peering traffic and the MONs can loose contact for a while. Setting noout will prevent flapping and also reduce the

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
This does not explain incomplete and inactive PGs. Are you hitting https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover from OSD restart"? In that case, temporarily stopping and restarting all new OSDs might help. Best regards, = Frank Schilder AIT Risø

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
Cephfs pools are uncritical, because ceph fs splits very large files into chunks of objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB uploads and you have a problem. The calculation is confusing, because the term PG is used in two different meanings,

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
Your problem is the overall cluster health. The MONs store cluster history information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs only makes things worse right now. The health status is a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
Hi Mark, it looks like you have some very large PGs. Also, you run with a quite low PG count, in particular, for the large pool. Please post the output of "ceph df" and "ceph osd pool ls detail" to see how much data is in each pool and some pool info. I guess you need to increase the PG count

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
I reset the pg_num after adding osd, it made some pg inactive(in activating state) Frank Schilder 于2020年10月29日周四 下午3:56写道: > This does not explain incomplete and inactive PGs. Are you hitting > https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not > recover from OSD restart"? In

[ceph-users] Re: frequent Monitor down

2020-10-29 Thread Marc Roos
Really? First time I read this here, afaik you can get a split brain like this. -Original Message- Sent: Thursday, October 29, 2020 12:16 AM To: Eugen Block Cc: ceph-users Subject: [ceph-users] Re: frequent Monitor down Eugen, I've got four physical servers and I've installed mon on

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Mark Johnson
Thanks again Frank. That gives me something to digest (and try to understand). One question regarding maintenance mode, these are production systems that are required to be available all the time. What, exactly, will happen if I issue this command for maintenance mode? Thanks, Mark On Thu,

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread David Caro
On 10/28 17:26, Ki Wong wrote: > Hello, > > I am at my wit's end. > > So I made a mistake in the configuration of my router and one > of the monitors (out of 3) dropped out of the quorum and nothing > I’ve done allow it to rejoin. That includes reinstalling the > monitor with ceph-ansible. > >

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
After add OSDs into the cluster, the recovery and backfill progress has not finished yet Zhenshi Zhou 于2020年10月29日周四 下午3:29写道: > MGR is stopped by me cause it took too much memories. > For pg status, I added some OSDs in this cluster, and it > > Frank Schilder 于2020年10月29日周四 下午3:27写道: > >>

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
MGR is stopped by me cause it took too much memories. For pg status, I added some OSDs in this cluster, and it Frank Schilder 于2020年10月29日周四 下午3:27写道: > Your problem is the overall cluster health. The MONs store cluster history > information that will be trimmed once it reaches HEALTH_OK.

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Mark Johnson
Thanks for you swift reply. Below is the requested information. I understand the bit about not being able to reduce the pg count as we've come across this issue once before. This is the reason I've been hesitant to make any changes there without being 100% certain of getting it right and the

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
MISTAKE: version is 14.2.12 Zhenshi Zhou 于2020年10月29日周四 下午2:38写道: > My cluster is 12.2.12, with all sata disks. > the space of store.db: > [image: image.png] > > How can I deal with it? > > Zhenshi Zhou 于2020年10月29日周四 下午2:37写道: > >> Hi all, >> >> My cluster is in wrong state. SST files in

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
My cluster is 12.2.12, with all sata disks. the space of store.db: [image: image.png] How can I deal with it? Zhenshi Zhou 于2020年10月29日周四 下午2:37写道: > Hi all, > > My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db > continue growing. It claims mon are using a lot of disk

[ceph-users] monitor sst files continue growing

2020-10-29 Thread Zhenshi Zhou
Hi all, My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db continue growing. It claims mon are using a lot of disk space. I set "mon compact on start = true" and restart one of the monitors. But it started and campacting for a long time, seems it has no end. [image: