date:20210514

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin

I recommend to upgrade at least to 12.2.13, for luminous even .12 and .13 is 
significant difference in code.

k

> On 14 May 2021, at 09:22, Szabo, Istvan (Agoda)  
> wrote:
> 
> It is quite an older cluster, luminous 12.2.8.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin



> On 14 May 2021, at 10:50, Szabo, Istvan (Agoda)  
> wrote:
> 
> Is it also normal if this buffered_ioturned on, it eats all the memory on the 
> system? Hmmm.
> 
This is what actually do this option - eat all free memory as cached for bluefs 
speedups



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin

Nope, kernel reserves enough memory to free on pressure, for example 36OSD 
0.5TiB RAM host:

  totalusedfree  shared  buff/cache   available
Mem:   502G168G2.9G 18M331G472G
Swap:  952M248M704M


k

> On 14 May 2021, at 11:20, Szabo, Istvan (Agoda)  
> wrote:
> 
> When this stop 😃 ? When died … :D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v14.2.21 Nautilus released

2021-05-14 Thread Ilya Dryomov

On Fri, May 14, 2021 at 8:20 AM Rainer Krienke  wrote:
>
> Hello,
>
> has the "negative progress bug" also been fixed in 14.2.21? I cannot
> find any info about this in the changelog?

Unfortunately not -- this was a hotfix release driven by rgw and
dashboard CVEs.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Hi,

I had issue with the snaptrim after a hug amount of deleted data, it slows down
the team operations due to the snaptrim and snaptrim_wait pgs.

I've changed couple of things:

debug_ms = 0/0 #default 0/5
osd_snap_trim_priority = 1 # default 5
osd_pg_max_concurrent_snap_trims = 1 # default 2

But didn't help.

I've found this thread about buffered io and seems like it helped to them:
https://forum.proxmox.com/threads/ceph-storage-all-pgs-snaptrim-every-night-slowing-down-vms.71573/

I don't use swap on the OSD nodes, so I gave a try on 1 osd node and it caused
basically the complete node's pg-s are degraded. Is it normal? I hope it will
not rebalance the complete node because I don't have space for that. I changed
it back but still slowly decreasing, so not sure this settings is correct or
not or this behavior is good or not?

2021-05-14 12:18:11.447628 mon.2004 [WRN] Health check update: 3353/91976715
objects misplaced (0.004%) (OBJECT_MISPLACED)
2021-05-14 12:18:11.447640 mon.2004 [WRN] Health check update: Degraded data
redundancy: 33078466/91976715 objects degraded (35.964%), 254 pgs degraded, 253
pgs undersized (PG_DEGRADED)

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by copyright or
other legal rules. If you have received it by mistake please let us know by
reply email and delete it from your system. It is prohibited to copy this
message or disclose its content to anyone. Any confidentiality or privilege is
not waived or lost by any mistaken delivery or unauthorized disclosure of the
message. All messages sent to and from Agoda may be monitored to ensure
compliance with company policies, to protect the company's interests and to
remove potential malware. Electronic messages may be intercepted, amended, lost
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Zabbix module Octopus 15.2.3

2021-05-14 Thread Gerdriaan Mulder

Hi Reed, Gert, list,

On 28/07/2020 23:42, Reed Dier wrote:
> I'm going to resurrect this thread to throw my hat in the ring as I am
> having this issue as well.

Did not see any solutions to this thread (be it this one, or any thread
more recently) so forgive me for re-resurrecting this thread :-).

> I just moved to 15.2.4 on Ubuntu 18.04/bionic, and Zabbix is 5.0.2.
>> $ ceph zabbix config-show
>> Error EINVAL: Traceback (most recent call last):
>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1167, in _handle_command
>>     return self.handle_command(inbuf, cmd)
>>   File "/usr/share/ceph/mgr/zabbix/module.py", line 407, in handle_command
>>     return 0, json.dumps(self.config, index=4, sort_keys=True), ''
>>   File "/usr/lib/python3.6/json/__init__.py", line 238, in dumps
>>     **kw).encode(obj)
>> TypeError: __init__() got an unexpected keyword argument 'index'
> 
> Which looks to be exactly the same as your error.

As I ran into this exact error as well, I dug a bit deeper in the source
code. Currently running ceph-mgr-modules-core on Octopus 15.2.6 / Ubuntu
20.04:

# ceph zabbix config-show
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1177, in _handle_command
return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/zabbix/module.py", line 407, in handle_command
return 0, json.dumps(self.config, index=4, sort_keys=True), ''
  File "/lib/python3.8/json/__init__.py", line 234, in dumps
return cls(
TypeError: __init__() got an unexpected keyword argument 'index'

This bug was introduced here
,
where `index=4` was introduced, but it had to be `indent=4`.

It has been fixed in

for pacific, as well as backported to octopus:

via .

Relevant bug tracker links:
* 
* 

On an updated git clone of :

$ git tag --contains c40d97ae23d912532ad0c9330b0ed96b1477f20c | sort -V
v15.2.8
v15.2.9
v15.2.10
v15.2.11
v15.2.12

So, you could upgrade to at least 15.2.8 to have this bug fixed. Or:
simply patch the affected file /usr/share/ceph/mgr/zabbix/module.py (see
the backported commit
),
disable and enable the zabbix module:

# ceph mgr module disable zabbix
# ceph mgr module enable zabbix
# ceph zabbix config-show
{
"discovery_interval": 100,
"identifier": "ceph.redacted.local",
"interval": 60,
"log_level": "",
"log_to_cluster": false,
"log_to_cluster_level": "info",
"log_to_file": false,
"zabbix_host": "zabbix.redacted.local",
"zabbix_port": 10051,
"zabbix_sender": "/usr/bin/zabbix_sender"
}

Hope this helps.

Best regards,
Gerdriaan Mulder
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RGW segmentation fault on Pacific 16.2.1 with multipart upload

2021-05-14 Thread Daniel Iwan

Precisely this!
Thank you very much for the links.

This caught me by surprise after upgrading my test cluster to 16.2.1. Looks
like a regression in Pacific.
Not included in 16.2.3 as far as I understand.
Actually, with my attempt to solve the problem I upgraded to 16.2.3 at some
point. RGW did not even start, with an error from systemd.
This is probably a completely separate issue though
Any hints on that one?

Dziekuje raz jeszcze
Daniel

On Thu, 13 May 2021 at 14:22, Daniel Gryniewicz  wrote:

> This tracker:
> https://tracker.ceph.com/issues/50556
>
> and this PR:
> https://github.com/ceph/ceph/pull/41288
>
> Daniel
>
> On 5/12/21 7:00 AM, Daniel Iwan wrote:
> > Hi
> > I have started to see segfaults during multiplart upload to one of the
> > buckets
> > File is about 60MB in size
> > Upload of the same file to a brand new bucket works OK
> >
> > Command used
> > aws --profile=tester --endpoint=$HOST_S3_API --region="" s3 cp
> > ./pack-a9201afb4682b74c7c5a5d6070e661662bdfea1a.pack
> > s3://tester-bucket/pack-a9201afb4682b74c7c5a5d6070e661662bdfea1a.pack
> >
> > For some reason log shows upload to  tester-bucket-2 ???
> > Bucket tester-bucket-2 is owned by the same user TESTER.
> >
> > I'm using Ceph 16.2.1 (recently upgraded from Octopus).
> > Installed with cephadm in Docker
> > OS Ubuntu 18.04.5 LTS
> >
> > Logs show as below
> >
> > May 11 11:00:46 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:46.891+ 7ffb0e25e700  1 == starting new request
> > req=0x7ffa8e15d620 =
> > May 11 11:00:46 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:46.907+ 7ffb0b258700  1 == req done
> > req=0x7ffa8e15d620 op status=0 http_status=200 latency=0.011999841s
> ==
> > May 11 11:00:46 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:46.907+ 7ffb0b258700  1 beast: 0x7ffa8e15d620:
> > 11.1.150.14 - TESTER [11/May/2021:11:00:46.891 +] "POST
> >
> /tester-bucket-2/pack-a9201afb4682b74c7c5a5d6070e661662bdfea1a.pack?uploads
> > HTTP/1.1" 200 296 - "aws-cli/2.1.23 Python/3.7.3
> > Linux/4.19.128-microsoft-standard exe/x86_64.ubuntu.18 p
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.055+ 7ffb09254700  1 == starting new request
> > req=0x7ffa8e15d620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.355+ 7ffb51ae5700  1 == starting new request
> > req=0x7ffa8e0dc620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.355+ 7ffb4eadf700  1 == starting new request
> > req=0x7ffa8e05b620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.355+ 7ffb46acf700  1 == starting new request
> > req=0x7ffa8df59620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.355+ 7ffb44acb700  1 == starting new request
> > req=0x7ffa8ded8620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.355+ 7ffb3dabd700  1 == starting new request
> > req=0x7ffa8dfda620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.359+ 7ffb1d27c700  1 == starting new request
> > req=0x7ffa8de57620 =
> > May 11 11:00:47 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:47.359+ 7ffb22a87700  1 == starting new request
> > req=0x7ffa8ddd6620 =
> > May 11 11:00:48 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:48.275+ 7ffb2d29c700  1 == req done
> > req=0x7ffa8e15d620 op status=0 http_status=200 latency=1.219983697s
> ==
> > May 11 11:00:48 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:48.275+ 7ffb2d29c700  1 beast: 0x7ffa8e15d620:
> > 11.1.150.14 - TESTER [11/May/2021:11:00:47.055 +] "PUT
> >
> /tester-bucket-2/pack-a9201afb4682b74c7c5a5d6070e661662bdfea1a.pack?uploadId=2~JhGavMwngl_FH6-LcE2vFxMRjcf4qTF&partNumber=8
> > HTTP/1.1" 200 2485288 - "aws-cli/2.1.23 Python/3.7.3 Linux
> > May 11 11:00:54 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:54.695+ 7ffad89f3700  1 == req done
> > req=0x7ffa8ddd6620 op status=0 http_status=200 latency=7.335902214s
> ==
> > May 11 11:00:54 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:54.695+ 7ffad89f3700  1 beast: 0x7ffa8ddd6620:
> > 11.1.150.14 - TESTER [11/May/2021:11:00:47.359 +] "PUT
> >
> /tester-bucket-2/pack-a9201afb4682b74c7c5a5d6070e661662bdfea1a.pack?uploadId=2~JhGavMwngl_FH6-LcE2vFxMRjcf4qTF&partNumber=6
> > HTTP/1.1" 200 8388608 - "aws-cli/2.1.23 Python/3.7.3 Linux
> > May 11 11:00:56 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:56.871+ 7ffb11a65700  1 == req done
> > req=0x7ffa8e0dc620 op status=0 http_status=200 latency=9.515872955s
> ==
> > May 11 11:00:56 ceph-om-vm-node1 bash[27881]: debug
> > 2021-05-11T11:00:56.871+ 7ffb11a65700  1 beast: 0x7ffa8e0dc620:
> > 11.1.150.14 - TESTER [11/May/2021:11:00:47.355 +] "PUT
> >
> /tester-bucket-2/pack-

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Hi,

It is quite an older cluster, luminous 12.2.8.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Konstantin Shalygin  
Sent: Friday, May 14, 2021 1:12 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: [Suspicious newsletter] [ceph-users] Re: bluefs_buffered_io turn to 
true

Hi,

This is not a normal, It's something different I think, like a crush changes on 
restart. This option will be enabled by default again in Nautilus next, so you 
can use it now with 14.2.19-20


k

Sent from my iPhone

> On 14 May 2021, at 08:21, Szabo, Istvan (Agoda)  
> wrote:
> 
> Hi,
> 
> I had issue with the snaptrim after a hug amount of deleted data, it slows 
> down the team operations due to the snaptrim and snaptrim_wait pgs.
> 
> I've changed couple of things:
> 
> debug_ms = 0/0 #default 0/5
> osd_snap_trim_priority = 1 # default 5 
> osd_pg_max_concurrent_snap_trims = 1 # default 2
> 
> But didn't help.
> 
> I've found this thread about buffered io and seems like it helped to them:
> https://forum.proxmox.com/threads/ceph-storage-all-pgs-snaptrim-every-
> night-slowing-down-vms.71573/
> 
> I don't use swap on the OSD nodes, so I gave a try on 1 osd node and it 
> caused basically the complete node's pg-s are degraded. Is it normal? I hope 
> it will not rebalance the complete node because I don't have space for that. 
> I changed it back but still slowly decreasing, so not sure this settings is 
> correct or not or this behavior is good or not?
> 
> 2021-05-14 12:18:11.447628 mon.2004 [WRN] Health check update: 
> 3353/91976715 objects misplaced (0.004%) (OBJECT_MISPLACED)
> 2021-05-14 12:18:11.447640 mon.2004 [WRN] Health check update: 
> Degraded data redundancy: 33078466/91976715 objects degraded 
> (35.964%), 254 pgs degraded, 253 pgs undersized (PG_DEGRADED)
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
> 
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin

It's enough, should be true now...


k

> On 14 May 2021, at 12:51, Szabo, Istvan (Agoda)  
> wrote:
> 
> Did I do something wrong?
> I set in the global config the bluefs option, and restarted ceph.target on 
> the osd node :/ ?
> 
> Doe this need some special thing to apply?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to "out" a mon/mgr node with orchestrator

2021-05-14 Thread mabi

Hello,

I need to re-install one node of my Octopus cluster (installed with cephadm) 
which is a mon/mgr node and did not find in the documentation how to do that 
with the new ceph orchestrator commands.

So my question would be what are the "ceph orch" commands I need to run in 
order to "out" nicely the mgr and mon services from that specific node?

I have a standby manager and 3 mons in total so from the redundancy it should 
be no problem to take that one node out for re-installing it.

Best regards,
Mabi



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin

I suggest to look into vm.min_free_kbytes kernel option, and increase it twice


k

> On 14 May 2021, at 13:45, Szabo, Istvan (Agoda)  
> wrote:
> 
> Is there anything that should be set just to be sure oom kill not happen? Or 
> nothing?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Irek Fasikhov

Hi.

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_object_gateway_for_production/deploying_a_cluster#reserving_free_memory_for_osds

пт, 14 мая 2021 г. в 14:21, Szabo, Istvan (Agoda) :

> Howmuch is yours? Mine is vm.min_free_kbytes = 90112.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> From: Konstantin Shalygin 
> Sent: Friday, May 14, 2021 6:07 PM
> To: Szabo, Istvan (Agoda) 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] [Suspicious newsletter] Re: bluefs_buffered_io
> turn to true
>
> I suggest to look into vm.min_free_kbytes kernel option, and increase it
> twice
>
>
> k
>
>
> On 14 May 2021, at 13:45, Szabo, Istvan (Agoda)  > wrote:
>
> Is there anything that should be set just to be sure oom kill not happen?
> Or nothing?
>
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Konstantin Shalygin



> On 14 May 2021, at 14:20, Szabo, Istvan (Agoda)  
> wrote:
> 
> Howmuch is yours? Mine is vm.min_free_kbytes = 90112.

I use 135168

> On 14 May 2021, at 14:31, Szabo, Istvan (Agoda)  
> wrote:
> 
> Yup, I just saw, should have 3GB :/ I will wait until the system goes back to 
> normal and will increase.
GBytes is okay too. You can write value in kernel in any time and kernel will 
reclaim RAM to this target



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Is it also normal if this buffered_ioturned on, it eats all the memory on the 
system? Hmmm.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Konstantin Shalygin 
Sent: Friday, May 14, 2021 2:12 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: Re: [Suspicious newsletter] [ceph-users] Re: bluefs_buffered_io turn 
to true

I recommend to upgrade at least to 12.2.13, for luminous even .12 and .13 is 
significant difference in code.



k

On 14 May 2021, at 09:22, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:

It is quite an older cluster, luminous 12.2.8.



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] mon vanished after cephadm upgrade

2021-05-14 Thread Ashley Merrick

I had a 3 mon CEPH cluster, after updating from 15.2.x to 16.2.x one of my 
mon's is showing as a stopped state in the Ceph Dashboard.And checking the 
cephadm logs on the server in question I can see "/usr/bin/docker: Error: No 
such object: ceph-30449cba-44e4-11eb-ba64-dda10beff041-mon.sn-m01"There is a 
few OSD services running on the same physical server and they all are 
starting/running fine via docker.I tried to do a cephadm apply mon to push a 
new mon to the same host, but it seems to not do anything, nothing shows in the 
same log file on sn-m01Also ceph -s shows full health and no errors and has no 
trace of the "failed" mon (not sure if this is expected), only in the ceph 
dashboard under services can I see the stopped not running mon.
 
Sent via MXlogin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

When this stop 😃 ? When died … :D

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Konstantin Shalygin 
Sent: Friday, May 14, 2021 3:00 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: Re: [Suspicious newsletter] [ceph-users] Re: bluefs_buffered_io turn 
to true


On 14 May 2021, at 10:50, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:

Is it also normal if this buffered_ioturned on, it eats all the memory on the 
system? Hmmm.

This is what actually do this option - eat all free memory as cached for bluefs 
speedups



k


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Limit memory of ceph-mgr

2021-05-14 Thread mabi

Hello,

I just noticed on my small Octopus cluster that the ceph-mgr on a mgr/mon node 
uses 3.6GB of resident memory (RES) as you can see below from the top output:

PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   2704 167   20   0 5030528   3.6g  35796 S   6.6  47.2  23:08.18 ceph-mgr
   2699 167   20   0 1291504 884796  23672 S   4.6  11.1  13:23.63 ceph-mon

Is there a way to limit the memory usage of ceph-mgr just like one can do with 
ceph OSD (osd_memory_target)?

I tried something like mgr_memory_target but that parameter does not exist.

Thanks,
Mabi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mon vanished after cephadm upgrade

2021-05-14 Thread Sebastian Wagner


Hi Ashley,

is sn-m01 listed in `ceph -s`? Which hosts are listed in `ceph orch ps 
--daemon-type mon ?



Otherwise, there are a two helpful commands now:

 * `cpeh orch daemon rm mon.sn-m01` to remove the mon
 * `ceph orch daemon start mon.sn-m01` to start it again

Am 14.05.21 um 14:14 schrieb Ashley Merrick:

I had a 3 mon CEPH cluster, after updating from 15.2.x to 16.2.x one of my mon's is showing as a 
stopped state in the Ceph Dashboard.And checking the cephadm logs on the server in question I can 
see "/usr/bin/docker: Error: No such object: 
ceph-30449cba-44e4-11eb-ba64-dda10beff041-mon.sn-m01"There is a few OSD services running on 
the same physical server and they all are starting/running fine via docker.I tried to do a cephadm 
apply mon to push a new mon to the same host, but it seems to not do anything, nothing shows in the 
same log file on sn-m01Also ceph -s shows full health and no errors and has no trace of the 
"failed" mon (not sure if this is expected), only in the ceph dashboard under services 
can I see the stopped not running mon.
  
Sent via MXlogin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mon vanished after cephadm upgrade

2021-05-14 Thread Ashley Merrick

Hello,Is not listed under ceph -s, ceph-s reports no issues on the cluster.Is 
lised under orch ps and dashboard but reports "mon.sn-m01 sn-m01 stopped  
114s ago  4M  -  "Let me know if anything 
else useful you would like before I try remove and redeploy.Thanks
> On Fri May 14 2021 21:44:11 GMT+0800 (Singapore Standard Time), Sebastian 
> Wagner  wrote:
> Hi Ashley,
> 
> is sn-m01 listed in `ceph -s`? Which hosts are listed in `ceph orch ps 
> --daemon-type mon ?
> 
> 
> Otherwise, there are a two helpful commands now:
> 
>   * `cpeh orch daemon rm mon.sn-m01` to remove the mon
>   * `ceph orch daemon start mon.sn-m01` to start it again
> 
> Am 14.05.21 um 14:14 schrieb Ashley Merrick:
>> I had a 3 mon CEPH cluster, after updating from 15.2.x to 16.2.x one of my 
>> mon's is showing as a stopped state in the Ceph Dashboard.And checking the 
>> cephadm logs on the server in question I can see "/usr/bin/docker: Error: No 
>> such object: ceph-30449cba-44e4-11eb-ba64-dda10beff041-mon.sn-m01"There is a 
>> few OSD services running on the same physical server and they all are 
>> starting/running fine via docker.I tried to do a cephadm apply mon to push a 
>> new mon to the same host, but it seems to not do anything, nothing shows in 
>> the same log file on sn-m01Also ceph -s shows full health and no errors and 
>> has no trace of the "failed" mon (not sure if this is expected), only in the 
>> ceph dashboard under services can I see the stopped not running mon.
>>   
>> Sent via MXlogin
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> 

 
Sent via MXlogin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Ok, seems like it doesn’t go below 600MB out of the 256GB, let’s wait until the 
pg_degradation healed.

Did I do something wrong?
I set in the global config the bluefs option, and restarted ceph.target on the 
osd node :/ ?

Doe this need some special thing to apply?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Konstantin Shalygin 
Sent: Friday, May 14, 2021 3:26 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] [Suspicious newsletter] Re: bluefs_buffered_io turn 
to true

Nope, kernel reserves enough memory to free on pressure, for example 36OSD 
0.5TiB RAM host:

  totalusedfree  shared  buff/cache   available
Mem:   502G168G2.9G 18M331G472G
Swap:  952M248M704M


k


On 14 May 2021, at 11:20, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:

When this stop 😃 ? When died … :D



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Is there anything that should be set just to be sure oom kill not happen? Or 
nothing?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Konstantin Shalygin 
Sent: Friday, May 14, 2021 5:32 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] [Suspicious newsletter] Re: bluefs_buffered_io turn 
to true

It's enough, should be true now...


k


On 14 May 2021, at 12:51, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:

Did I do something wrong?
I set in the global config the bluefs option, and restarted ceph.target on the 
osd node :/ ?

Doe this need some special thing to apply?



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Howmuch is yours? Mine is vm.min_free_kbytes = 90112.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Konstantin Shalygin 
Sent: Friday, May 14, 2021 6:07 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] [Suspicious newsletter] Re: bluefs_buffered_io turn 
to true

I suggest to look into vm.min_free_kbytes kernel option, and increase it twice


k


On 14 May 2021, at 13:45, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:

Is there anything that should be set just to be sure oom kill not happen? Or 
nothing?



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

2021-05-14 Thread Szabo, Istvan (Agoda)

Yup, I just saw, should have 3GB :/ I will wait until the system goes back to 
normal and will increase.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Irek Fasikhov 
Sent: Friday, May 14, 2021 6:28 PM
To: Szabo, Istvan (Agoda) 
Cc: Konstantin Shalygin ; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io 
turn to true

Hi.

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_object_gateway_for_production/deploying_a_cluster#reserving_free_memory_for_osds

пт, 14 мая 2021 г. в 14:21, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>:
Howmuch is yours? Mine is vm.min_free_kbytes = 90112.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com>
---

From: Konstantin Shalygin mailto:k0...@k0ste.ru>>
Sent: Friday, May 14, 2021 6:07 PM
To: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] [Suspicious newsletter] Re: bluefs_buffered_io turn 
to true

I suggest to look into vm.min_free_kbytes kernel option, and increase it twice


k


On 14 May 2021, at 13:45, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>>
 wrote:

Is there anything that should be set just to be sure oom kill not happen? Or 
nothing?



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-14 Thread David Orman

We've created a PR to fix the root cause of this issue:
https://github.com/alfredodeza/remoto/pull/63

Thank you,
David

On Mon, May 10, 2021 at 7:29 PM David Orman  wrote:
>
> Hi Sage,
>
> We've got 2.0.27 installed. I restarted all the manager pods, just in
> case, and I have the same behavior afterwards.
>
> David
>
> On Mon, May 10, 2021 at 6:53 PM Sage Weil  wrote:
> >
> > The root cause is a bug in conmon.  If you can upgrade to >= 2.0.26
> > this will also fix the problem.  What version are you using?  The
> > kubic repos currently have 2.0.27.  See
> > https://build.opensuse.org/project/show/devel:kubic:libcontainers:stable
> >
> > We'll make sure the next release has the verbosity workaround!
> >
> > sage
> >
> > On Mon, May 10, 2021 at 5:47 PM David Orman  wrote:
> > >
> > > I think I may have found the issue:
> > >
> > > https://tracker.ceph.com/issues/50526
> > > It seems it may be fixed in: https://github.com/ceph/ceph/pull/41045
> > >
> > > I hope this can be prioritized as an urgent fix as it's broken
> > > upgrades on clusters of a relatively normal size (14 nodes, 24x OSDs,
> > > 2x NVME for DB/WAL w/ 12 OSDs per NVME), even when new OSDs are not
> > > being deployed, as it still tries to apply the OSD specification.
> > >
> > > On Mon, May 10, 2021 at 4:03 PM David Orman  wrote:
> > > >
> > > > Hi,
> > > >
> > > > We are seeing the mgr attempt to apply our OSD spec on the various
> > > > hosts, then block. When we investigate, we see the mgr has executed
> > > > cephadm calls like so, which are blocking:
> > > >
> > > > root 1522444  0.0  0.0 102740 23216 ?S17:32   0:00
> > > >  \_ /usr/bin/python3
> > > > /var/lib/ceph/X/cephadm.30cb78bdbbafb384af862e1c2292b944f15942b586128e91262b43e91e11ae90
> > > > --image 
> > > > docker.io/ceph/ceph@sha256:694ba9cdcbe6cb7d25ab14b34113c42c2d1af18d4c79c7ba4d1f62cf43d145fe
> > > > ceph-volume --fsid X -- lvm list --format json
> > > >
> > > > This occurs on all hosts in the cluster, following
> > > > starting/restarting/failing over a manager. It's blocking an
> > > > in-progress upgrade post-manager updates on one cluster, currently.
> > > >
> > > > Looking at the cephadm logs on the host(s) in question, we see the
> > > > last entry appears to be truncated, like:
> > > >
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.db_uuid": "1n2f5v-EEgO-1Kn6-hQd2-v5QF-AN9o-XPkL6b",
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.encrypted": "0",
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.osd_fsid": "",
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.osd_id": "205",
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.osdspec_affinity": "osd_spec",
> > > > 2021-05-10 17:32:06,471 INFO /usr/bin/podman:
> > > > "ceph.type": "block",
> > > >
> > > > The previous entry looks like this:
> > > >
> > > > 2021-05-10 17:32:06,469 INFO /usr/bin/podman:
> > > > "ceph.db_uuid": "TMTPD5-MLqp-06O2-raqp-S8o5-TfRG-hbFmpu",
> > > > 2021-05-10 17:32:06,469 INFO /usr/bin/podman:
> > > > "ceph.encrypted": "0",
> > > > 2021-05-10 17:32:06,469 INFO /usr/bin/podman:
> > > > "ceph.osd_fsid": "",
> > > > 2021-05-10 17:32:06,469 INFO /usr/bin/podman:
> > > > "ceph.osd_id": "195",
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman:
> > > > "ceph.osdspec_affinity": "osd_spec",
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman:
> > > > "ceph.type": "block",
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: 
> > > > "ceph.vdo": "0"
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: },
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: "type": 
> > > > "block",
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: "vg_name":
> > > > "ceph-ffd1a4a7-316c-4c85-acde-06459e26f2c4"
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: }
> > > > 2021-05-10 17:32:06,470 INFO /usr/bin/podman: ],
> > > >
> > > > We'd like to get to the bottom of this, please let us know what other
> > > > information we can provide.
> > > >
> > > > Thank you,
> > > > David
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph-Dokan on windows 10 not working after upgrade to pacific

2021-05-14 Thread Robert W. Eckert

Hi- I recently upgraded to pacific, and I am now getting an error connecting on 
my windows 10 machine:
The error is the handle_auth_bad_method,  I tried a few combinations of 
cephx,none on the monitors, but I keep getting the same error.

The same config(With paths updated) and key ring works on my WSL instance 
running an old luminous client (I can't seem to get it to install a newer 
client )

Do you have any suggestions on where to look?
Thanks,
Rob.
-

PS C:\Program Files\Ceph\bin> .\ceph-dokan.exe --id rob -l Q
2021-05-14T12:19:58.172Eastern Daylight Time 5 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]
failed to fetch mon config (--no-mon-config to skip)

PS C:\Program Files\Ceph\bin> cat  c:/ProgramData/ceph/ceph.client.rob.keyring
[client.rob]
key = 
caps mon = "allow rwx"
caps osd = "allow rwx"

PS C:\Program Files\Ceph\bin> cat C:\ProgramData\Ceph\ceph.conf
# minimal ceph.conf
[global]
log to stderr = true
; Uncomment the following in order to use the Windows Event Log
log to syslog = true

run dir = C:/ProgramData/ceph/out
crash dir = C:/ProgramData/ceph/out

; Use the following to change the cephfs client log level
debug client = 2
[global]
fsid = 
mon_host = []
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
[client]
keyring = c:/ProgramData/ceph/ceph.client.rob.keyring
log file = C:/ProgramData/ceph/out/$name.$pid.log
admin socket = C:/ProgramData/ceph/out/$name.$pid.asok
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

2021-05-14 Thread Neha Ojha

On Fri, May 14, 2021 at 10:47 AM Andrius Jurkus
 wrote:
>
> Hello, I will try to keep it sad and short :) :(PS sorry if this
> dublicate I tried post it from web also.
>
> Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds.
> After data migration for few hours, 1 SSD failed, then another and
> another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same
> host has SSD and HDD, but only SSD's are failing so I think this has to
> be balancing refiling or something bug and probably not upgrade bug.
>
> Cluster has been in pause for 4 hours and no more OSD's are failing.
>
> full trace
> https://pastebin.com/UxbfFYpb

This looks very similar to https://tracker.ceph.com/issues/50656.
Adding Igor for more ideas.

Neha

>
> Now I m googling and learning but, Is there a way how to easily test
> lets say 15.2.XX version on osd without losing anything?
>
> Any help would be appreciated.
>
> Error start like this
>
> May 14 16:58:52 dragon-ball-radar systemd[1]: Starting Ceph osd.2 for
> 4e01640b-951b-4f75-8dca-0bad4faf1b11...
> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:53.057836433 + UTC m=+0.454352919 container create
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> GIT_BRANCH=HEAD, maintainer=D
> May 14 16:58:53 dragon-ball-radar systemd[1]: Started libcrun container.
> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:53.3394116 + UTC m=+0.735928098 container init
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> maintainer=Dimitri Savineau  May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:53.446921192 + UTC m=+0.843437626 container start
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> GIT_BRANCH=HEAD, org.label-sch
> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:53.447050119 + UTC m=+0.843566553 container attach
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> org.label-schema.name=CentOS
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
> /dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
> --path /var/lib/ceph/osd/ceph-2 --no-mon-config
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/ln -snf
> /dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
> /var/lib/ceph/osd/ceph-2/block
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/chown -R ceph:ceph /dev/dm-1
> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> May 14 16:58:53 dragon-ball-radar bash[113558]: --> ceph-volume lvm
> activate successful for osd ID: 2
> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:53.8147653 + UTC m=+1.211281741 container died
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate)
> May 14 16:58:55 dragon-ball-radar podman[113650]: 2021-05-14
> 16:58:55.044964534 + UTC m=+2.441480996 container remove
> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> CEPH_POINT_RELEASE=-16.2.4, R
> May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
> 16:58:55.594265612 + UTC m=+0.369978347 container create
> 31364008fcb8b290643d6e892fba16d19618f5682f590373feabed23061749da
> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2, RELEASE=HEAD,
> org.label-schema.build-d
> May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
> 16:58:55.864589286

[ceph-users] cephadm stalled after adjusting placement

2021-05-14 Thread Bryan Stillwell

I'm looking for help in figuring out why cephadm isn't making any progress 
after I told it to redeploy an mds daemon with:

ceph orch daemon redeploy mds.cephfs.aladdin.kgokhr ceph/ceph:v15.2.12


The output from 'ceph -W cephadm' just says:

2021-05-14T16:24:46.628084+ mgr.paris.glbvov [INF] Schedule redeploy daemon 
mds.cephfs.aladdin.kgokhr


However, the mds never gets redeployed.  I do see this warning in 'ceph health 
detail' which might have something to do with it:

Module 'cephadm' has failed: 'NoneType' object has no attribute 'target_id'


What steps can I do to figure out why cephadm is hung?

Thanks,
Bryan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

2021-05-14 Thread Mark Schouten

On Mon, May 10, 2021 at 10:46:45PM +0200, Mark Schouten wrote:
> I still have three active ranks. Do I simply restart two of the MDS'es
> and force max_mds to one daemon, or is there a nicer way to move two
> mds'es from active to standby?

It seems (documentation was no longer available, so ik took some
searching) that I needed to run ceph mds deactivate $fs:$rank for every
MDS I wanted to deactivate.

That helped!

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "No space left on device" when deleting a file

2021-05-14 Thread Mark Schouten

On Tue, May 11, 2021 at 02:55:05PM +0200, Mark Schouten wrote:
> On Tue, May 11, 2021 at 09:53:10AM +0200, Mark Schouten wrote:
> > This helped me too. However, should I see num_strays decrease again?
> > I'm  running a `find -ls` over my CephFS tree..
> 
> This helps, the amount of stray files is slowly decreasing. But given
> the number of files in the cluster, it'll take a while ...


Deactivating one of the MDS'es triggered a lot of work too.

-- 
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | i...@tuxis.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

2021-05-14 Thread Neha Ojha

You are welcome! We still need to get to the bottom of this, I will
update the tracker to make a note of this occurrence.

Thanks,
Neha

On Fri, May 14, 2021 at 12:25 PM Andrius Jurkus
 wrote:
>
> Big thanks, Much appreciated help.
>
> It probably is same bug.
>
> bluestore_allocator = bitmap
>
> by setting this parameter all failed OSD started.
>
> Thanks again!
>
> On 2021-05-14 21:09, Neha Ojha wrote:
> > On Fri, May 14, 2021 at 10:47 AM Andrius Jurkus
> >  wrote:
> >>
> >> Hello, I will try to keep it sad and short :) :(PS sorry if this
> >> dublicate I tried post it from web also.
> >>
> >> Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds.
> >> After data migration for few hours, 1 SSD failed, then another and
> >> another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same
> >> host has SSD and HDD, but only SSD's are failing so I think this has
> >> to
> >> be balancing refiling or something bug and probably not upgrade bug.
> >>
> >> Cluster has been in pause for 4 hours and no more OSD's are failing.
> >>
> >> full trace
> >> https://pastebin.com/UxbfFYpb
> >
> > This looks very similar to https://tracker.ceph.com/issues/50656.
> > Adding Igor for more ideas.
> >
> > Neha
> >
> >>
> >> Now I m googling and learning but, Is there a way how to easily test
> >> lets say 15.2.XX version on osd without losing anything?
> >>
> >> Any help would be appreciated.
> >>
> >> Error start like this
> >>
> >> May 14 16:58:52 dragon-ball-radar systemd[1]: Starting Ceph osd.2 for
> >> 4e01640b-951b-4f75-8dca-0bad4faf1b11...
> >> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:53.057836433 + UTC m=+0.454352919 container create
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> >> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> >> GIT_BRANCH=HEAD, maintainer=D
> >> May 14 16:58:53 dragon-ball-radar systemd[1]: Started libcrun
> >> container.
> >> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:53.3394116 + UTC m=+0.735928098 container init
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> >> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> >> maintainer=Dimitri Savineau  >> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:53.446921192 + UTC m=+0.843437626 container start
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> >> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> >> GIT_BRANCH=HEAD, org.label-sch
> >> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:53.447050119 + UTC m=+0.843566553 container attach
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> >> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
> >> org.label-schema.name=CentOS
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
> >> /dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
> >> --path /var/lib/ceph/osd/ceph-2 --no-mon-config
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/ln -snf
> >> /dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
> >> /var/lib/ceph/osd/ceph-2/block
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/chown -R ceph:ceph /dev/dm-1
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
> >> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> >> May 14 16:58:53 dragon-ball-radar bash[113558]: --> ceph-volume lvm
> >> activate successful for osd ID: 2
> >> May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:53.8147653 + UTC m=+1.211281741 container died
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >> (image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
> >> name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate)
> >> May 14 16:58:55 dragon-ball-radar podman[113650]: 2021-05-14
> >> 16:58:55.044964534 + UTC m=+2.441480996 container remove
> >> 3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
> >

[ceph-users] radosgw lost config during upgrade 14.2.16 -> 21

2021-05-14 Thread Jan Kasprzak

Hello,

I have just upgraded my cluster from 14.2.16 to 14.2.21, and after the
upgrade, radosgw was listening on the default port 7480 instead of the SSL port
it used before the upgrade. It might be I mishandled
"ceph config assimilate-conf" previously or forgot to restart radosgw
after the assimilate-conf or something. What is the correct
way to store radosgw configuration in ceph config?

I have the following (which I think worked previously, but I might be wrong,
e.g. forgot to restart radosgw or something):

# ceph config dump
[...]
client.rgw.   basicrgw_frontends beast ssl_port= 
ssl_certificate=/etc/pki/tls/certs/.crt+bundle 
ssl_private_key=/etc/pki/tls/private/.key *  

However, after rgw startup, there was the following in
/var/log/ceph/ceph-client.rgw..log:

2021-05-14 21:38:35.075 7f6ffd621900  1 mgrc service_daemon_register 
rgw. metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph 
version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) nautilus 
(stable),ceph_version_short=14.2.21,cpu=AMD 
...,distro=centos,distro_description=CentOS Linux 7 
(Core),distro_version=7,frontend_config#0=beast 
port=7480,frontend_type#0=beast,hostname=,kernel_description=#1 SMP 
...,kernel_version=...,mem_swap_kb=...,mem_total_kb=...,num_handles=1,os=Linux,pid=20451,zone_id=...,zone_name=default,zonegroup_id=...,zonegroup_name=default}
 

(note the port=7480 and no SSL).

After adding the following into /etc/ceph/ceph.conf on the host where
rgw is running, it started to use the correct SSL port again:

[client.rgw.]
rgw_frontends = beast ssl_port= 
ssl_certificate=/etc/pki/tls/certs/.crt+bundle 
ssl_private_key=/etc/pki/tls/private/.key

How can I configure this using "ceph config"?
Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

2021-05-14 Thread Andrius Jurkus


Hello, I will try to keep it sad and short :) :(PS sorry if this
dublicate I tried post it from web also. 


Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds.
After data migration for few hours, 1 SSD failed, then another and
another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same
host has SSD and HDD, but only SSD's are failing so I think this has to
be balancing refiling or something bug and probably not upgrade bug. 

Cluster has been in pause for 4 hours and no more OSD's are failing. 

full trace 
https://pastebin.com/UxbfFYpb 


Now I m googling and learning but, Is there a way how to easily test
lets say 15.2.XX version on osd without losing anything? 

Any help would be appreciated. 

Error start like this 


May 14 16:58:52 dragon-ball-radar systemd[1]: Starting Ceph osd.2 for
4e01640b-951b-4f75-8dca-0bad4faf1b11...
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.057836433 + UTC m=+0.454352919 container create
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
GIT_BRANCH=HEAD, maintainer=D
May 14 16:58:53 dragon-ball-radar systemd[1]: Started libcrun container.
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.3394116 + UTC m=+0.735928098 container init
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
maintainer=Dimitri Savineau org.label-schema.name=CentOS 
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:

/usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
/usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
--path /var/lib/ceph/osd/ceph-2 --no-mon-config
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
/usr/bin/ln -snf
/dev/ceph-45e6ef2e-fbdc-4289-a900-3d1ffc81ee14/osd-block-973cfe73-06c8-4ea0-9aea-1361d063eb25
/var/lib/ceph/osd/ceph-2/block
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
/usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
/usr/bin/chown -R ceph:ceph /dev/dm-1
May 14 16:58:53 dragon-ball-radar bash[113558]: Running command:
/usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
May 14 16:58:53 dragon-ball-radar bash[113558]: --> ceph-volume lvm
activate successful for osd ID: 2
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.8147653 + UTC m=+1.211281741 container died
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate)
May 14 16:58:55 dragon-ball-radar podman[113650]: 2021-05-14
16:58:55.044964534 + UTC m=+2.441480996 container remove
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
CEPH_POINT_RELEASE=-16.2.4, R
May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
16:58:55.594265612 + UTC m=+0.369978347 container create
31364008fcb8b290643d6e892fba16d19618f5682f590373feabed23061749da
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2, RELEASE=HEAD,
org.label-schema.build-d
May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
16:58:55.864589286 + UTC m=+0.640302021 container init
31364008fcb8b290643d6e892fba16d19618f5682f590373feabed23061749da
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2,
org.label-schema.schema-version=1.0, GIT
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 set uid:gid to 167:167
(ceph:ceph)
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 ceph version 16.2.4
(3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process
ceph-osd, pid 2
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 pidfile_write: ignore empty
--pid-file
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 1 bdev(0x564ad3a8c800
/var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14

[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster

2021-05-14 Thread Bryan Stillwell

This works better than my solution.  It allows the cluster to put more PGs on 
the systems with more space on them:

# for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r 
'.pg_stats[].pgid'); do
>   echo $pg
>   for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do
> ceph osd find $osd | jq -r '.host'
>   done | sort | uniq -c | sort -n -k1
> done
8.0
  1 excalibur
  1 mandalaybay
  2 aladdin
  2 harrahs
  2 paris
8.1
  1 aladdin
  1 excalibur
  1 harrahs
  1 mirage
  2 mandalaybay
  2 paris
8.2
  1 aladdin
  1 mandalaybay
  2 harrahs
  2 mirage
  2 paris
...

Thanks!
Bryan

> On May 13, 2021, at 2:58 AM, Ján Senko  wrote:
> 
> Caution: This email is from an external sender. Please do not click links or 
> open attachments unless you recognize the sender and know the content is 
> safe. Forward suspicious emails to isitbad@.
> 
> 
> 
> Would something like this work?
> 
> step take default
> step choose indep 4 type host
> step chooseleaf indep 1 type osd
> step emit
> step take default
> step choose indep 0 type host
> step chooseleaf indep 1 type osd
> step emit
> 
> J.
> 
> ‐‐‐ Original Message ‐‐‐
> 
> On Wednesday, May 12th, 2021 at 17:58, Bryan Stillwell 
>  wrote:
> 
>> I'm trying to figure out a CRUSH rule that will spread data out across my 
>> cluster as much as possible, but not more than 2 chunks per host.
>> 
>> If I use the default rule with an osd failure domain like this:
>> 
>> step take default
>> 
>> step choose indep 0 type osd
>> 
>> step emit
>> 
>> I get clustering of 3-4 chunks on some of the hosts:
>> 
>> for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r 
>> '.pg_stats[].pgid'); do
>> ===
>> 
>>> echo $pg
>>> 
>>> for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do
>>> 
>>> ceph osd find $osd | jq -r '.host'
>>> 
>>> done | sort | uniq -c | sort -n -k1
>> 
>> 8.0
>> 
>> 1 harrahs
>> 
>> 3 paris
>> 
>> 4 aladdin
>> 
>> 8.1
>> 
>> 1 aladdin
>> 
>> 1 excalibur
>> 
>> 2 mandalaybay
>> 
>> 4 paris
>> 
>> 8.2
>> 
>> 1 harrahs
>> 
>> 2 aladdin
>> 
>> 2 mirage
>> 
>> 3 paris
>> 
>> ...
>> 
>> However, if I change the rule to use:
>> 
>> step take default
>> 
>> step choose indep 0 type host
>> 
>> step chooseleaf indep 2 type osd
>> 
>> step emit
>> 
>> I get the data spread across 4 hosts with 2 chunks per host:
>> 
>> for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r 
>> '.pg_stats[].pgid'); do
>> ===
>> 
>>> echo $pg
>>> 
>>> for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do
>>> 
>>> ceph osd find $osd | jq -r '.host'
>>> 
>>> done | sort | uniq -c | sort -n -k1
>>> 
>>> done
>> 
>> 8.0
>> 
>> 2 aladdin
>> 
>> 2 harrahs
>> 
>> 2 mandalaybay
>> 
>> 2 paris
>> 
>> 8.1
>> 
>> 2 aladdin
>> 
>> 2 harrahs
>> 
>> 2 mandalaybay
>> 
>> 2 paris
>> 
>> 8.2
>> 
>> 2 harrahs
>> 
>> 2 mandalaybay
>> 
>> 2 mirage
>> 
>> 2 paris
>> 
>> ...
>> 
>> Is it possible to get the data to spread out over more hosts? I plan on 
>> expanding the cluster in the near future and would like to see more hosts 
>> get 1 chunk instead of 2.
>> 
>> Also, before you recommend adding two more hosts and switching to a 
>> host-based failure domain, the cluster is on a variety of hardware with 
>> between 2-6 drives per host and drives that are 4TB-12TB in size (it's part 
>> of my home lab).
>> 
>> Thanks,
>> 
>> Bryan
>> 
>> ceph-users mailing list -- ceph-users@ceph.io
>> 
>> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

2021-05-14 Thread Igor Fedotov


This looks similar to #50656 indeed.

Hopefully will fix that next week.


Thanks,

Igor

On 5/14/2021 9:09 PM, Neha Ojha wrote:

On Fri, May 14, 2021 at 10:47 AM Andrius Jurkus
 wrote:

Hello, I will try to keep it sad and short :) :(PS sorry if this
dublicate I tried post it from web also.

Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds.
After data migration for few hours, 1 SSD failed, then another and
another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same
host has SSD and HDD, but only SSD's are failing so I think this has to
be balancing refiling or something bug and probably not upgrade bug.

Cluster has been in pause for 4 hours and no more OSD's are failing.

full trace
https://pastebin.com/UxbfFYpb

This looks very similar to https://tracker.ceph.com/issues/50656.
Adding Igor for more ideas.

Neha


Now I m googling and learning but, Is there a way how to easily test
lets say 15.2.XX version on osd without losing anything?

Any help would be appreciated.

Error start like this

May 14 16:58:52 dragon-ball-radar systemd[1]: Starting Ceph osd.2 for
4e01640b-951b-4f75-8dca-0bad4faf1b11...
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.057836433 + UTC m=+0.454352919 container create
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
GIT_BRANCH=HEAD, maintainer=D
May 14 16:58:53 dragon-ball-radar systemd[1]: Started libcrun container.
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.3394116 + UTC m=+0.735928098 container init
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
maintainer=Dimitri Savineau  ceph-volume lvm
activate successful for osd ID: 2
May 14 16:58:53 dragon-ball-radar podman[113650]: 2021-05-14
16:58:53.8147653 + UTC m=+1.211281741 container died
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate)
May 14 16:58:55 dragon-ball-radar podman[113650]: 2021-05-14
16:58:55.044964534 + UTC m=+2.441480996 container remove
3b44520aa651b8196cd0bf0c96daa2bd03845ef5f8cfaf9a689410a1f98d84dd
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2-activate,
CEPH_POINT_RELEASE=-16.2.4, R
May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
16:58:55.594265612 + UTC m=+0.369978347 container create
31364008fcb8b290643d6e892fba16d19618f5682f590373feabed23061749da
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2, RELEASE=HEAD,
org.label-schema.build-d
May 14 16:58:55 dragon-ball-radar podman[113909]: 2021-05-14
16:58:55.864589286 + UTC m=+0.640302021 container init
31364008fcb8b290643d6e892fba16d19618f5682f590373feabed23061749da
(image=docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949,
name=ceph-4e01640b-951b-4f75-8dca-0bad4faf1b11-osd.2,
org.label-schema.schema-version=1.0, GIT
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 set uid:gid to 167:167
(ceph:ceph)
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 ceph version 16.2.4
(3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process
ceph-osd, pid 2
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 0 pidfile_write: ignore empty
--pid-file
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.896+ 7fcf16aa2080 1 bdev(0x564ad3a8c800
/var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.900+ 7fcf16aa2080 1 bdev(0x564ad3a8c800
/var/lib/ceph/osd/ceph-2/block) open size 500103643136 (0x747080,
466 GiB) block_size 4096 (4 KiB) non-rotational discard supported
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.900+ 7fcf16aa2080 1
bluestore(/var/lib/ceph/osd/ceph-2) _set_cache_sizes cache_size
3221225472 meta 0.45 kv 0.45 data 0.06
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.900+ 7fcf16aa2080 1 bdev(0x564ad3a8cc00
/var/lib/ceph/osd/ceph-2/block) open path /var/lib/ceph/osd/ceph-2/block
May 14 16:58:55 dragon-ball-radar conmon[113957]: debug
2021-05-14T16:58:55.900+ 7fcf16aa2080 1 bdev(0x564ad3a8cc00
/var/lib/ceph/osd/ceph-2/block) open si

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: v14.2.21 Nautilus released

[ceph-users] bluefs_buffered_io turn to true

[ceph-users] Re: Zabbix module Octopus 15.2.3

[ceph-users] Re: RGW segmentation fault on Pacific 16.2.1 with multipart upload

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] How to "out" a mon/mgr node with orchestrator

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] mon vanished after cephadm upgrade

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Limit memory of ceph-mgr

[ceph-users] Re: mon vanished after cephadm upgrade

[ceph-users] Re: mon vanished after cephadm upgrade

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: [Suspicious newsletter] Re: bluefs_buffered_io turn to true

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

[ceph-users] ceph-Dokan on windows 10 not working after upgrade to pacific

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

[ceph-users] cephadm stalled after adjusting placement

[ceph-users] Re: Upgrade tips from Luminous to Nautilus?

[ceph-users] Re: "No space left on device" when deleting a file

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

[ceph-users] radosgw lost config during upgrade 14.2.16 -> 21

[ceph-users] after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster

[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.

34 matches

Site Navigation

Mail list logo

Footer information