[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Stefan Kooman

On 3/19/21 2:20 AM, Philip Brown wrote:

yup cephadm and orch was used to set all this up.

Current state of things:

ceph osd tree shows

  33hdd1.84698  osd.33   destroyed 0  1.0



^^ Destroyed, ehh, this doesn't look good to me. Ceph thinks this OSD is 
destroyed. Do you know what might have happened to osd.33? Did you 
perform a "kill an OSD" while testing?


AFAIK you can't fix that anymore. You will have to remove it and redploy 
it. Might even get a new osd.id.





cephadm logs --name osd.33 --fsid xx-xx-xx-xx

along with the systemctl stuff I already saw, showed me new things such as

ceph-osd[1645438]: did not load config file, using default settings.

ceph-osd[1645438]: 2021-03-18T14:31:32.990-0700 7f8bf14e3bc0 -1 parse_file: 
filesystem error: cannot get file size: No such file or directory

This suggested to me that I needed to copy over /etc/ceph/ceph.conf to the OSD 
node.
which I did.
I then also copied over the admin key and generated a fresh bootstrap-osd key 
with it, just for good measure, with
   ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring



I had saved the previous output of ceph-volume lvm list
and on the OSD node, ran

ceph-volume lvm prepare --data  --block.db 

But it says osd is already prepared.


I tried an activate... it tells me

--> ceph-volume lvm activate successful for osd ID: 33



but now the cephadm logs output shows me


ceph-osd[1677135]: 2021-03-18T17:57:47.982-0700 7ff64593f700 -1 
monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i 
only support [2]



Not the best error message :-}



Indeed, would be nice to have a references to [2]. But I think why you 
get this is because of the destroyed OSD. I would use cephadm docu on 
how to replace an osd. Does that exist? We add a large thread about this 
"container" topic (see "[ceph-users] ceph-ansible in Pacific and beyond?").




Now what do I need to do?


I would remove osd.33. Even manually editing crushmaps if need to 
(should not be the case), and then redeploy this osd and wait for recovery.



If you have not manually "destroyed" this osd than either things work 
differently in Octopus from things I have seen so far, my memory is 
failing me, or some really weird stuff is happening and I would really 
like to know what that is.


Wat version are you running? Do note that 15.2.10 has been released.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Philip Brown
Unfortunately, the pod wont stay up. So "podman logs" wont work for it.

it is not even visible with "podman ps -a"



- Original Message -
From: "胡 玮文" 
To: "Philip Brown" 
Cc: "ceph-users" 
Sent: Thursday, March 18, 2021 5:56:20 PM
Subject: Re: [ceph-users] ceph octopus mysterious OSD crash

“podman logs ceph-xxx-osd-xxx” may contains additional logs.

> 在 2021年3月19日,04:29,Philip Brown  写道:
> 
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs. 
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as 
> a db partition.
> 
> service_type: osd
> service_id: osd_spec_default
> placement:
>  host_pattern: '*'
> data_devices:
>  rotational: 1
> db_devices:
>  rotational: 0
> 
> 
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs 
> were "down".
> 
> I went to check the logs, with 
> journalctl -u ceph-x...@osd.xxx
> 
> all it showed were a bunch of generic debug info, and the fact that it 
> stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
> 
> 
> sample output:
> 
> 
> systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
> systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
> bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 
> container create
> podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container 
> init
> .
> bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
> podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 
> container died 
> bash[1611473]: ceph-xx-xx-xx-xx-osd.33
> bash[1611473]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> (repeat, repeat...)
> podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 
> container create
> 
> 
> systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
> systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, 
> code=exited, status=1/FAILURE
> bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate
> 
> and eventually it just gives up.
> 
> smartctl -a doesnt show any errors on the HDD
> 
> 
> dmesg doesnt show anything.
> 
> So... what do I do?
> 
> 
> 
> 
> 
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
> 5 Peters Canyon Rd Suite 250 
> Irvine CA 92606 
> Office 714.918.1310| Fax 714.918.1325 
> pbr...@medata.com| 
> https://apac01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.medata.com%2F&data=04%7C01%7C%7C739f028cfcc04020c94c08d8ea4c9673%7C84df9e7fe9f640afb435%7C1%7C0%7C637516961950804014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TRkxSSU8BhLWM7cNpyJ8lX6J7U6Fdfi7ubrkFt91DkU%3D&reserved=0
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Philip Brown
yup cephadm and orch was used to set all this up.

Current state of things:

ceph osd tree shows

 33hdd1.84698  osd.33   destroyed 0  1.0


cephadm logs --name osd.33 --fsid xx-xx-xx-xx

along with the systemctl stuff I already saw, showed me new things such as

ceph-osd[1645438]: did not load config file, using default settings.

ceph-osd[1645438]: 2021-03-18T14:31:32.990-0700 7f8bf14e3bc0 -1 parse_file: 
filesystem error: cannot get file size: No such file or directory

This suggested to me that I needed to copy over /etc/ceph/ceph.conf to the OSD 
node.
which I did.
I then also copied over the admin key and generated a fresh bootstrap-osd key 
with it, just for good measure, with
  ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring



I had saved the previous output of ceph-volume lvm list
and on the OSD node, ran

ceph-volume lvm prepare --data  --block.db 

But it says osd is already prepared.


I tried an activate... it tells me

--> ceph-volume lvm activate successful for osd ID: 33



but now the cephadm logs output shows me


ceph-osd[1677135]: 2021-03-18T17:57:47.982-0700 7ff64593f700 -1 
monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i 
only support [2]



Not the best error message :-}

Now what do I need to do?





- Original Message -
From: "Stefan Kooman" 
To: "Philip Brown" , "ceph-users" 
Sent: Thursday, March 18, 2021 2:04:09 PM
Subject: Re: [ceph-users] ceph octopus mysterious OSD crash

On 3/18/21 9:28 PM, Philip Brown wrote:
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs.
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as 
> a db partition.
> 
> service_type: osd
> service_id: osd_spec_default
> placement:
>host_pattern: '*'
> data_devices:
>rotational: 1
> db_devices:
>rotational: 0
> 
> 
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs 
> were "down".
> 
> I went to check the logs, with
> journalctl -u ceph-x...@osd.xxx
> 
> all it showed were a bunch of generic debug info, and the fact that it 
> stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
> 

It's a deployment made with cephadm? Looks like it as I see podman 
messages. Are these all the log messages you can find on those OSDs? 
I.e. have you tried to gather logs with cephadm logs [1].

Gr. Stefan

[1]: 
https://docs.ceph.com/en/latest/cephadm/troubleshooting/#gathering-log-files
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Stefan Kooman

On 3/18/21 9:28 PM, Philip Brown wrote:

I've been banging on my ceph octopus test cluster for a few days now.
8 nodes. each node has 2 SSDs and 8 HDDs.
They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a 
db partition.

service_type: osd
service_id: osd_spec_default
placement:
   host_pattern: '*'
data_devices:
   rotational: 1
db_devices:
   rotational: 0


things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were 
"down".

I went to check the logs, with
journalctl -u ceph-x...@osd.xxx

all it showed were a bunch of generic debug info, and the fact that it stopped.
and various automatic attempts to restart.
but no indication of what was wrong, and why the restarts KEEP failing.



It's a deployment made with cephadm? Looks like it as I see podman 
messages. Are these all the log messages you can find on those OSDs? 
I.e. have you tried to gather logs with cephadm logs [1].


Gr. Stefan

[1]: 
https://docs.ceph.com/en/latest/cephadm/troubleshooting/#gathering-log-files

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread 胡 玮文
“podman logs ceph-xxx-osd-xxx” may contains additional logs.

> 在 2021年3月19日,04:29,Philip Brown  写道:
> 
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs. 
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as 
> a db partition.
> 
> service_type: osd
> service_id: osd_spec_default
> placement:
>  host_pattern: '*'
> data_devices:
>  rotational: 1
> db_devices:
>  rotational: 0
> 
> 
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs 
> were "down".
> 
> I went to check the logs, with 
> journalctl -u ceph-x...@osd.xxx
> 
> all it showed were a bunch of generic debug info, and the fact that it 
> stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
> 
> 
> sample output:
> 
> 
> systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
> systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
> bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 
> container create
> podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container 
> init
> .
> bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
> podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 
> container died 
> bash[1611473]: ceph-xx-xx-xx-xx-osd.33
> bash[1611473]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> (repeat, repeat...)
> podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 
> container create
> 
> 
> systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
> systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, 
> code=exited, status=1/FAILURE
> bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate
> 
> and eventually it just gives up.
> 
> smartctl -a doesnt show any errors on the HDD
> 
> 
> dmesg doesnt show anything.
> 
> So... what do I do?
> 
> 
> 
> 
> 
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
> 5 Peters Canyon Rd Suite 250 
> Irvine CA 92606 
> Office 714.918.1310| Fax 714.918.1325 
> pbr...@medata.com| 
> https://apac01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.medata.com%2F&data=04%7C01%7C%7C739f028cfcc04020c94c08d8ea4c9673%7C84df9e7fe9f640afb435%7C1%7C0%7C637516961950804014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TRkxSSU8BhLWM7cNpyJ8lX6J7U6Fdfi7ubrkFt91DkU%3D&reserved=0
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread David Orman
Use journalctl -xe (maybe with -S/-U if you want to filter) to find
the time period in which a restart attempt has happened, and see
what's logged at that period. If that's not helpful, then what you may
want to do is disable that service (systemctl disable blah) then get
the ExecStart out of it, then try running it by hand and seeing what
happens (the symlinked systemd unit will will refer to a unit.run file
in /var/lib/ceph that will have the actual podman cmd). If the pod
dies, you should still see it in podman ps -a and you can podman logs
on it to get the details. Then you can correct the issue then
re-enable the service and restart it properly to do the housekeeping.
Follow these directions at your own risk; make sure you understand the
ramifications of whatever you might be doing!

David

On Thu, Mar 18, 2021 at 3:29 PM Philip Brown  wrote:
>
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs.
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as 
> a db partition.
>
> service_type: osd
> service_id: osd_spec_default
> placement:
>   host_pattern: '*'
> data_devices:
>   rotational: 1
> db_devices:
>   rotational: 0
>
>
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs 
> were "down".
>
> I went to check the logs, with
> journalctl -u ceph-x...@osd.xxx
>
> all it showed were a bunch of generic debug info, and the fact that it 
> stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
>
>
> sample output:
>
>
> systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
> systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
> bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> bash[9340]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 
> container create
> podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container 
> init
> .
> bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
> podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 
> container died
> bash[1611473]: ceph-xx-xx-xx-xx-osd.33
> bash[1611473]: WARNING: The same type, major and minor should not be used for 
> multiple devices.
> (repeat, repeat...)
> podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 
> container create
>
> 
> systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
> systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, 
> code=exited, status=1/FAILURE
> bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate
>
> and eventually it just gives up.
>
> smartctl -a doesnt show any errors on the HDD
>
>
> dmesg doesnt show anything.
>
> So... what do I do?
>
>
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph octopus mysterious OSD crash

2021-03-18 Thread Philip Brown
I've been banging on my ceph octopus test cluster for a few days now.
8 nodes. each node has 2 SSDs and 8 HDDs. 
They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a 
db partition.

service_type: osd
service_id: osd_spec_default
placement:
  host_pattern: '*'
data_devices:
  rotational: 1
db_devices:
  rotational: 0


things were going pretty good, until... yesterday.. i noticed TWO of the OSDs 
were "down".

I went to check the logs, with 
journalctl -u ceph-x...@osd.xxx

all it showed were a bunch of generic debug info, and the fact that it stopped.
and various automatic attempts to restart.
but no indication of what was wrong, and why the restarts KEEP failing.


sample output:


systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
bash[9340]: WARNING: The same type, major and minor should not be used for 
multiple devices.
bash[9340]: WARNING: The same type, major and minor should not be used for 
multiple devices.
podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 container 
create
podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container 
init
.
bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 
container died 
bash[1611473]: ceph-xx-xx-xx-xx-osd.33
bash[1611473]: WARNING: The same type, major and minor should not be used for 
multiple devices.
(repeat, repeat...)
podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 
container create


systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, code=exited, 
status=1/FAILURE
bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate

and eventually it just gives up.

smartctl -a doesnt show any errors on the HDD


dmesg doesnt show anything.

So... what do I do?





--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbr...@medata.com| www.medata.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Reed Dier
I too would be amenable to a cephadm orchestrator on bare metal if that were an 
option.

I thought that I would drop this video for anyone that hasn't seen it yet.
It hits on a lot of people's sentiments about containers being not great for 
debugging, while also going into some of the pros to the containerization at 
the same time.
Great presentation.
https://www.youtube.com/watch?v=pPZsN_urpqw 


Reed

> On Mar 18, 2021, at 11:10 AM, Milan Kupcevic  
> wrote:
> 
> On 3/18/21 2:36 AM, Lars Täuber wrote:
>> I vote for an SSH orchestrator for a bare metal installation too!
> 
> 
> +1
> 
> 
> Cephadm with a no containers option would do.
> 
> 
> Milan
> 
> 
> -- 
> Milan Kupcevic
> Senior Cyberinfrastructure Engineer at Project NESE
> Harvard University
> FAS Research Computing
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] howto:: emergency shutdown procedure and maintenance

2021-03-18 Thread Adrian Sevcenco

Hi! What steps/procedures are required for emergency shutdown and for machine 
maintenance?

Thanks a lot!
Adrian




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Suspicious newsletter] v15.2.10 Octopus released

2021-03-18 Thread Szabo, Istvan (Agoda)
Hi David,

I guess this one fixes the non-containerized deployment too, isn't it?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: David Galloway 
Sent: Thursday, March 18, 2021 9:10 PM
To: ceph-annou...@ceph.io; ceph-users@ceph.io; d...@ceph.io; 
ceph-maintain...@ceph.io
Subject: [Suspicious newsletter] [ceph-users] v15.2.10 Octopus released

We're happy to announce the 10th backport release in the Octopus series.
We recommend users to update to this release. For a detailed release notes with 
links & changelog please refer to the official blog entry at 
https://ceph.io/releases/v15-2-10-octopus-released

Notable Changes
---

* The containers include an updated tcmalloc that avoids crashes seen on 
15.2.9.  See `issue#49618 `_ for details.

* RADOS: BlueStore handling of huge(>4GB) writes from RocksDB to BlueFS has 
been fixed.

* When upgrading from a previous cephadm release, systemctl may hang when 
trying to start or restart the monitoring containers. (This is caused by a 
change in the systemd unit to use `type=forking`.) After the upgrade, please 
run::

ceph orch redeploy nfs
ceph orch redeploy iscsi
ceph orch redeploy node-exporter
ceph orch redeploy prometheus
ceph orch redeploy grafana
ceph orch redeploy alertmanager


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-15.2.10.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 27917a557cca91e4da407489bbaa64ad4352cc02
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendations on problem with PG

2021-03-18 Thread Gabriel Medve

Hi,

I want to clarify that I am using ceph under docker version 15.2.9

I await comments.

--

El 12/3/21 a las 15:42, Gabriel Medve escribió:

Hi,

We have a problem with a PG that was inconsistent, currently the PG in 
our cluster have 3 copies.


It was not possible for us to repair this pg with "ceph pg repair" 
(This PG is in osd 14,1,2) so we deleted some of the copies of osd 14 
with the following command.
ceph-objectstore-tool --data-path /var/lib/ceph/osd.14/ --pgid 22.f 
--op remove --force
This caused an automatic attempt to create the missing copy entering 
the backfilling state, but when doing this it crashed osd 1 and 2 and 
threw the IOPS to 0, freezing the cluster.


Is there any way to remove this entire pg or try to recreate the 
missing copy or ignore it completely? It causes instability in the 
cluster.


Thank you, I await comments


--
Untitled Document


Gabriel I. Medve

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond? [EXT]

2021-03-18 Thread Matthew Vernon

Hi,

On 18/03/2021 15:03, Guillaume Abrioux wrote:


ceph-ansible@stable-6.0 supports pacific and the current content in the
branch 'master' (future stable-7.0) is intended to support Ceph Quincy.

I can't speak on behalf of Dimitri but I'm personally willing to keep
maintaining ceph-ansible if there are interests, but people must be aware
that:


This is good to know, thank you :)

I hadn't realised my question would spawn such a monster thread!

Regards,

Matthew



--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v15.2.10 Octopus released

2021-03-18 Thread David Galloway
Terribly sorry for the mistake.  There was a bug in the script I use to
sync packages to download.ceph.com that wasn't listing directories in
the desired order.  That meant the download.ceph.com/{rpm,deb}-octopus
symlinks still pointed to 15.2.9.  This is fixed.

I'm re-running the container jobs to get those pushed too.

On 3/18/21 10:45 AM, David Orman wrote:
> Hi David,
> 
> The "For Packages" link in your email/the blog posts do not appear to
> work. Additionally, we browsed the repo, and it doesn't appear the
> packages are uploaded, at least for debian-octopus:
> http://download.ceph.com/debian-octopus/pool/main/c/ceph/. We only use
> the release packages for cephadm bootstrapping, so it's not a
> deal-breaker for us, just wanted to give you a head's up.
> 
> Cheers,
> David Orman
> 
> On Thu, Mar 18, 2021 at 9:11 AM David Galloway  wrote:
>>
>> We're happy to announce the 10th backport release in the Octopus series.
>> We recommend users to update to this release. For a detailed release
>> notes with links & changelog please refer to the official blog entry at
>> https://ceph.io/releases/v15-2-10-octopus-released
>>
>> Notable Changes
>> ---
>>
>> * The containers include an updated tcmalloc that avoids crashes seen on
>> 15.2.9.  See `issue#49618 `_ for
>> details.
>>
>> * RADOS: BlueStore handling of huge(>4GB) writes from RocksDB to BlueFS
>> has been fixed.
>>
>> * When upgrading from a previous cephadm release, systemctl may hang
>> when trying to start or restart the monitoring containers. (This is
>> caused by a change in the systemd unit to use `type=forking`.) After the
>> upgrade, please run::
>>
>> ceph orch redeploy nfs
>> ceph orch redeploy iscsi
>> ceph orch redeploy node-exporter
>> ceph orch redeploy prometheus
>> ceph orch redeploy grafana
>> ceph orch redeploy alertmanager
>>
>>
>> Getting Ceph
>> 
>> * Git at git://github.com/ceph/ceph.git
>> * Tarball at http://download.ceph.com/tarballs/ceph-15.2.10.tar.gz
>> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
>> * Release git sha1: 27917a557cca91e4da407489bbaa64ad4352cc02
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Email alerts from Ceph [EXT]

2021-03-18 Thread Konstantin Shalygin
Just use ceph-dash and chec_ceph_dash [1]



[1] https://github.com/Crapworks/check_ceph_dash
k

Sent from my iPhone

> On 18 Mar 2021, at 12:02, Matthew Vernon  wrote:
> 
> I'm afraid we used our existing Nagios infrastructure for checking HEALTH 
> status, and have a script that runs daily to report on failed OSDs.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Milan Kupcevic
On 3/18/21 2:36 AM, Lars Täuber wrote:
> I vote for an SSH orchestrator for a bare metal installation too!


+1


Cephadm with a no containers option would do.


Milan


-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] v15.2.10 Octopus released

2021-03-18 Thread Dimitri Savineau
Hi,

https://download.ceph.com/rpm-octopus/ symlink isn't updated to the new
release [1] and the content is still 15.2.9 [2]

As a consequence, the new Octopus container images 15.2.10 can't be built.

[1] https://download.ceph.com/rpm-15.2.10/
[2] https://download.ceph.com/rpm-15.2.9/

Regards,

Dimitri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG export import

2021-03-18 Thread Szabo, Istvan (Agoda)
Yeah, finally started just super slow.
Currently I want to export import the pgs from the died OSDs make the cluster 
be able to start cephfs and save the data. Also looking for some space to be 
able to export the pg because it's quite big, 100s of gb.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Frank Schilder  
Sent: Thursday, March 18, 2021 6:16 PM
To: Szabo, Istvan (Agoda) ; Ceph Users 

Subject: Re: PG export import

It sounds like there is a general problem on this cluster with OSDs not 
starting. You probably need to go back to the logs and try to find out why the 
MONs don't allow the OSDs to join. MON IPs, cluster ID, network config in 
ceph.conf and on host, cluster name, authentication, ports, messenger version 
etc.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 March 2021 10:48:05
To: Ceph Users
Subject: [ceph-users] PG export import

Hi,

I’ve tried to save some pg from a dead osd, I made this:

Picked on the same server an osd which is not really used and stopped that osd 
and import the exported one from the dead one.

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-33 
--no-mon-config --pgid 44.c0s0 --op export --file ./pg44c0s0 Exporting 44.c0s0 
info 44.c0s0( empty local-lis/les=0/0 n=0 ec=192123/175799 
lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493) Export successful

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 
--no-mon-config --op import --file ./pg44c0s0 get_pg_num_history pg_num_history 
pg_num_history(e5583546 pg_nums 
{20={173213=256},21={219434=64},22={220991=64},24={219240=32},25={1446965=128},42={175793=32},43={197388=64},44={192123=512}}
 deleted_pools ) Importing pgid 44.c0s0 write_pg epoch 4865498 info 44.c0s0( 
empty local-lis/les=0/0 n=0 ec=192123/175799 lis/c=4865474/4851556 
les/c/f=4865475/4851557/0 sis=4865493) Import successful

Started back 34 and it says the osd is running but in the cluster map it is 
down :/

root@server:~# systemctl status ceph-osd@34 -l ● 
ceph-osd@34.service - Ceph object storage daemon 
osd.34
 Loaded: loaded 
(/lib/systemd/system/ceph-osd@.service;
 enabled-runtime; vendor preset: enabled)
 Active: active (running) since Thu 2021-03-18 10:38:00 CET; 8min ago
Process: 45388 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id 34 (code=exited, sta>
   Main PID: 45392 (ceph-osd)
  Tasks: 60
 Memory: 856.2M
 CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@34.service
 └─45392 /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph 
--setgroup ceph

Mar 18 10:38:00 server systemd[1]: Starting Ceph object storage daemon osd.34...
Mar 18 10:38:00 server systemd[1]: Started Ceph object storage daemon osd.34.
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.817+0100 
7f41738d5dc0 -1 osd.34 5583546 log_to_mon> Mar 18 10:38:21 server 
ceph-osd[45392]: 2021-03-18T10:38:21.825+0100 7f41738d5dc0 -1 osd.34 5583546 
mon_cmd_ma>


Any idea?


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG export import

2021-03-18 Thread Frank Schilder
It sounds like there is a general problem on this cluster with OSDs not 
starting. You probably need to go back to the logs and try to find out why the 
MONs don't allow the OSDs to join. MON IPs, cluster ID, network config in 
ceph.conf and on host, cluster name, authentication, ports, messenger version 
etc.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 March 2021 10:48:05
To: Ceph Users
Subject: [ceph-users] PG export import

Hi,

I’ve tried to save some pg from a dead osd, I made this:

Picked on the same server an osd which is not really used and stopped that osd 
and import the exported one from the dead one.

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-33 
--no-mon-config --pgid 44.c0s0 --op export --file ./pg44c0s0
Exporting 44.c0s0 info 44.c0s0( empty local-lis/les=0/0 n=0 ec=192123/175799 
lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Export successful

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 
--no-mon-config --op import --file ./pg44c0s0
get_pg_num_history pg_num_history pg_num_history(e5583546 pg_nums 
{20={173213=256},21={219434=64},22={220991=64},24={219240=32},25={1446965=128},42={175793=32},43={197388=64},44={192123=512}}
 deleted_pools )
Importing pgid 44.c0s0
write_pg epoch 4865498 info 44.c0s0( empty local-lis/les=0/0 n=0 
ec=192123/175799 lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Import successful

Started back 34 and it says the osd is running but in the cluster map it is 
down :/

root@server:~# systemctl status ceph-osd@34 -l
● ceph-osd@34.service - Ceph object storage daemon 
osd.34
 Loaded: loaded 
(/lib/systemd/system/ceph-osd@.service;
 enabled-runtime; vendor preset: enabled)
 Active: active (running) since Thu 2021-03-18 10:38:00 CET; 8min ago
Process: 45388 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id 34 (code=exited, sta>
   Main PID: 45392 (ceph-osd)
  Tasks: 60
 Memory: 856.2M
 CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@34.service
 └─45392 /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph 
--setgroup ceph

Mar 18 10:38:00 server systemd[1]: Starting Ceph object storage daemon osd.34...
Mar 18 10:38:00 server systemd[1]: Started Ceph object storage daemon osd.34.
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.817+0100 
7f41738d5dc0 -1 osd.34 5583546 log_to_mon>
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.825+0100 
7f41738d5dc0 -1 osd.34 5583546 mon_cmd_ma>


Any idea?


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Guillaume Abrioux
Hi all,

ceph-ansible@stable-6.0 supports pacific and the current content in the
branch 'master' (future stable-7.0) is intended to support Ceph Quincy.

I can't speak on behalf of Dimitri but I'm personally willing to keep
maintaining ceph-ansible if there are interests, but people must be aware
that:

- the official and supported installer will be cephadm,
- officially, ceph-ansible will be unsupported with the consequences it
might bring (testing efforts, ci ressources, etc.),
- Fewer engineering efforts will be carried out (contributions are welcome!)

Thanks,

On Thu, 18 Mar 2021 at 11:26, Stefan Kooman  wrote:

> On 3/18/21 9:09 AM, Janne Johansson wrote:
> > Den ons 17 mars 2021 kl 20:17 skrev Matthew H  >:
> >>
> >> "A containerized environment just makes troubleshooting more difficult,
> getting access and retrieving details on Ceph processes isn't as
> straightforward as with a non containerized infrastructure. I am still not
> convinced that containerizing everything brings any benefits except the
> collocation of services."
> >>
> >> It changes the way you troubleshoot, but I don't find it more difficult
> in the issues I have seen and had. Even today without containers, all
> services can be co-located within the same hosts (mons,mgrs,osds,mds).. Is
> there a situation you've seen where that has not been the case?
> >
> > New ceph users pop in all the time on the #ceph IRC and have
> > absolutely no idea on how to see the relevant logs from the
> > containerized services.
>
> While you might not need that much Ceph knowledge to get Ceph up and
> running, it does require users to know how container deployments work. I
> had to put quite a bit of work in to get what ceph-ansible was doing to
> deploy the containers, and why it would fail (after some other tries).
> You do need to have Ceph knowledge, still, when things do not go as
> expected, and even beforehand to make the right decisions on how to set
> up all the infrastructure. So arguably you need even more knowledge to
> understand what is going on under the hood, be it Ceph or containers.
>
> >
> > Me being one of the people that do run services on bare metal (and
> > VMs) I actually can't help them, and it seems several other old ceph
> > admins can't either.
> >
> > Not that it is impossible or might not even be hard to get them, but
> > somewhere in the "it is so easy to get it up and running, just pop a
> > container and off you go" docs there seem to be a lack of the parts
> > "when the OSD crashes at boot, run this to export the file normally
> > called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
> > to the users and they are left to wipe/reinstall or something else
> > when it doesn't work. At the end, I guess the project will see less
> > useful reports with Assert Failed logs from impossible conditions and
> > more people turning away from something that could be fixed in the
> > long run.
>
> There is a ceph manager module for that:
> https://docs.ceph.com/en/latest/mgr/crash/
>
> I guess an option to "always send crash logs to Ceph" could be build in.
> If you trust Ceph with this data of course (opt-in).
>
>
> >
> > I get some of the advantages, and for stateless services elsewhere it
> > might be gold to have containers, I am not equally enthusiastic about
> > it for ceph.
> >
>
> Yeah, so I think it's good to discuss pros and cons and see what problem
> it solves, and what extra problems it creates.
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 

*Guillaume Abrioux*Senior Software Engineer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v15.2.10 Octopus released

2021-03-18 Thread Kenneth Waegeman

Hi,

The 15.2.10 image is not yet on docker? (An hour ago *14*.2.10 image was 
pushed)


Thanks!

Kenneth

On 18/03/2021 15:10, David Galloway wrote:

We're happy to announce the 10th backport release in the Octopus series.
We recommend users to update to this release. For a detailed release
notes with links & changelog please refer to the official blog entry at
https://ceph.io/releases/v15-2-10-octopus-released

Notable Changes
---

* The containers include an updated tcmalloc that avoids crashes seen on
15.2.9.  See `issue#49618 `_ for
details.

* RADOS: BlueStore handling of huge(>4GB) writes from RocksDB to BlueFS
has been fixed.

* When upgrading from a previous cephadm release, systemctl may hang
when trying to start or restart the monitoring containers. (This is
caused by a change in the systemd unit to use `type=forking`.) After the
upgrade, please run::

 ceph orch redeploy nfs
 ceph orch redeploy iscsi
 ceph orch redeploy node-exporter
 ceph orch redeploy prometheus
 ceph orch redeploy grafana
 ceph orch redeploy alertmanager


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-15.2.10.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 27917a557cca91e4da407489bbaa64ad4352cc02
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON slow ops and growing MON store

2021-03-18 Thread Janek Bevendorff
We just had the same problem again after a power outage that took out 
62% of our cluster and three out of five MONs. Once everything was back 
up, the MONs started lagging and piling up slow ops while to MON store 
was growing to double-digit gigabytes. It was so bad that I couldn't 
even list the flying ops anymore, because ceph daemon mon.XXX ops did 
not return at all.


Like last time, after I restarted all five MONs, the store size 
decreased and everything went back to normal. I also had to restart MGRs 
and MDSs afterwards. This starts looking like a bug to me.


Janek


On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's 
not a problem any more (for now).


Unfortunately, just disabling clog_to_monitors didn't have the wanted 
effect when I tried it yesterday. But I also believe that it is 
somehow related. I could not find any specific reason for the incident 
yesterday in the logs besides a few more RocksDB status and compact 
messages than usual, but that's more symptomatic.



On 26/02/2021 13:05, Mykola Golub wrote:

On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:


On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:

 {
 "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",

 "initiated_at": "2021-02-25T20:40:34.698932+0100",
 "age": 183.762551121,
 "duration": 183.762599201,

The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.

If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:

  clog_to_monitors = false

And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].

[1] https://tracker.ceph.com/issues/48946


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Stefan Kooman

On 3/18/21 9:09 AM, Janne Johansson wrote:

Den ons 17 mars 2021 kl 20:17 skrev Matthew H :


"A containerized environment just makes troubleshooting more difficult, getting 
access and retrieving details on Ceph processes isn't as straightforward as with a non 
containerized infrastructure. I am still not convinced that containerizing everything 
brings any benefits except the collocation of services."

It changes the way you troubleshoot, but I don't find it more difficult in the 
issues I have seen and had. Even today without containers, all services can be 
co-located within the same hosts (mons,mgrs,osds,mds).. Is there a situation 
you've seen where that has not been the case?


New ceph users pop in all the time on the #ceph IRC and have
absolutely no idea on how to see the relevant logs from the
containerized services.


While you might not need that much Ceph knowledge to get Ceph up and 
running, it does require users to know how container deployments work. I 
had to put quite a bit of work in to get what ceph-ansible was doing to 
deploy the containers, and why it would fail (after some other tries). 
You do need to have Ceph knowledge, still, when things do not go as 
expected, and even beforehand to make the right decisions on how to set 
up all the infrastructure. So arguably you need even more knowledge to 
understand what is going on under the hood, be it Ceph or containers.




Me being one of the people that do run services on bare metal (and
VMs) I actually can't help them, and it seems several other old ceph
admins can't either.

Not that it is impossible or might not even be hard to get them, but
somewhere in the "it is so easy to get it up and running, just pop a
container and off you go" docs there seem to be a lack of the parts
"when the OSD crashes at boot, run this to export the file normally
called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
to the users and they are left to wipe/reinstall or something else
when it doesn't work. At the end, I guess the project will see less
useful reports with Assert Failed logs from impossible conditions and
more people turning away from something that could be fixed in the
long run.


There is a ceph manager module for that: 
https://docs.ceph.com/en/latest/mgr/crash/


I guess an option to "always send crash logs to Ceph" could be build in. 
If you trust Ceph with this data of course (opt-in).





I get some of the advantages, and for stateless services elsewhere it
might be gold to have containers, I am not equally enthusiastic about
it for ceph.



Yeah, so I think it's good to discuss pros and cons and see what problem 
it solves, and what extra problems it creates.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] v15.2.10 Octopus released

2021-03-18 Thread David Galloway
We're happy to announce the 10th backport release in the Octopus series.
We recommend users to update to this release. For a detailed release
notes with links & changelog please refer to the official blog entry at
https://ceph.io/releases/v15-2-10-octopus-released

Notable Changes
---

* The containers include an updated tcmalloc that avoids crashes seen on
15.2.9.  See `issue#49618 `_ for
details.

* RADOS: BlueStore handling of huge(>4GB) writes from RocksDB to BlueFS
has been fixed.

* When upgrading from a previous cephadm release, systemctl may hang
when trying to start or restart the monitoring containers. (This is
caused by a change in the systemd unit to use `type=forking`.) After the
upgrade, please run::

ceph orch redeploy nfs
ceph orch redeploy iscsi
ceph orch redeploy node-exporter
ceph orch redeploy prometheus
ceph orch redeploy grafana
ceph orch redeploy alertmanager


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-15.2.10.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 27917a557cca91e4da407489bbaa64ad4352cc02
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG export import

2021-03-18 Thread Szabo, Istvan (Agoda)
Hi,

I’ve tried to save some pg from a dead osd, I made this:

Picked on the same server an osd which is not really used and stopped that osd 
and import the exported one from the dead one.

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-33 
--no-mon-config --pgid 44.c0s0 --op export --file ./pg44c0s0
Exporting 44.c0s0 info 44.c0s0( empty local-lis/les=0/0 n=0 ec=192123/175799 
lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Export successful

root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 
--no-mon-config --op import --file ./pg44c0s0
get_pg_num_history pg_num_history pg_num_history(e5583546 pg_nums 
{20={173213=256},21={219434=64},22={220991=64},24={219240=32},25={1446965=128},42={175793=32},43={197388=64},44={192123=512}}
 deleted_pools )
Importing pgid 44.c0s0
write_pg epoch 4865498 info 44.c0s0( empty local-lis/les=0/0 n=0 
ec=192123/175799 lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Import successful

Started back 34 and it says the osd is running but in the cluster map it is 
down :/

root@server:~# systemctl status ceph-osd@34 -l
● ceph-osd@34.service - Ceph object storage daemon 
osd.34
 Loaded: loaded 
(/lib/systemd/system/ceph-osd@.service;
 enabled-runtime; vendor preset: enabled)
 Active: active (running) since Thu 2021-03-18 10:38:00 CET; 8min ago
Process: 45388 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id 34 (code=exited, sta>
   Main PID: 45392 (ceph-osd)
  Tasks: 60
 Memory: 856.2M
 CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@34.service
 └─45392 /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph 
--setgroup ceph

Mar 18 10:38:00 server systemd[1]: Starting Ceph object storage daemon osd.34...
Mar 18 10:38:00 server systemd[1]: Started Ceph object storage daemon osd.34.
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.817+0100 
7f41738d5dc0 -1 osd.34 5583546 log_to_mon>
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.825+0100 
7f41738d5dc0 -1 osd.34 5583546 mon_cmd_ma>


Any idea?


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Email alerts from Ceph [EXT]

2021-03-18 Thread Matthew Vernon

Hi,

On 17/03/2021 22:26, Andrew Walker-Brown wrote:


How have folks implemented getting email or snmp alerts out of Ceph?
Getting things like osd/pool nearly full or osd/daemon failures etc.
I'm afraid we used our existing Nagios infrastructure for checking 
HEALTH status, and have a script that runs daily to report on failed OSDs.


Our existing metrics infrastructure is collectd/graphite/grafana so we 
have dashboards and so on, but as far as I'm aware the Octopus dashboard 
only supports prometheus, so we're a bit stuck there :-(


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Jaroslaw Owsiewski
czw., 18 mar 2021 o 09:42 Wido den Hollander  napisał(a):

>
> Me being one of them.
>
> Yes, it's all possible with containers, but it's different. And I don't
> see the true benefit of running Ceph in Docker just yet.
>
> Another layer of abstraction which you need to understand. Also, when
> you need to do real emergency stuff like working with
> ceph-objectstore-tool to fix broken OSDs/PGs it's just much easier to
> work on a bare-metal box than with containers (if you ask me).
>
> So no, I am not convinced yet. Not against it, but personally I would
> say it's not the only way forward.
>
> DEB and RPM packages are still alive and kicking.
>
> Wido
>
>
+1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Martin Verges
> So no, I am not convinced yet. Not against it, but personally I would say
it's not the only way forward.

100% agree to your whole answer

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


On Thu, 18 Mar 2021 at 09:42, Wido den Hollander  wrote:

>
>
> On 18/03/2021 09:09, Janne Johansson wrote:
> > Den ons 17 mars 2021 kl 20:17 skrev Matthew H  >:
> >>
> >> "A containerized environment just makes troubleshooting more difficult,
> getting access and retrieving details on Ceph processes isn't as
> straightforward as with a non containerized infrastructure. I am still not
> convinced that containerizing everything brings any benefits except the
> collocation of services."
> >>
> >> It changes the way you troubleshoot, but I don't find it more difficult
> in the issues I have seen and had. Even today without containers, all
> services can be co-located within the same hosts (mons,mgrs,osds,mds).. Is
> there a situation you've seen where that has not been the case?
> >
> > New ceph users pop in all the time on the #ceph IRC and have
> > absolutely no idea on how to see the relevant logs from the
> > containerized services.
> >
> > Me being one of the people that do run services on bare metal (and
> > VMs) I actually can't help them, and it seems several other old ceph
> > admins can't either.
> >
>
> Me being one of them.
>
> Yes, it's all possible with containers, but it's different. And I don't
> see the true benefit of running Ceph in Docker just yet.
>
> Another layer of abstraction which you need to understand. Also, when
> you need to do real emergency stuff like working with
> ceph-objectstore-tool to fix broken OSDs/PGs it's just much easier to
> work on a bare-metal box than with containers (if you ask me).
>
> So no, I am not convinced yet. Not against it, but personally I would
> say it's not the only way forward.
>
> DEB and RPM packages are still alive and kicking.
>
> Wido
>
> > Not that it is impossible or might not even be hard to get them, but
> > somewhere in the "it is so easy to get it up and running, just pop a
> > container and off you go" docs there seem to be a lack of the parts
> > "when the OSD crashes at boot, run this to export the file normally
> > called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
> > to the users and they are left to wipe/reinstall or something else
> > when it doesn't work. At the end, I guess the project will see less
> > useful reports with Assert Failed logs from impossible conditions and
> > more people turning away from something that could be fixed in the
> > long run.
> >
> > I get some of the advantages, and for stateless services elsewhere it
> > might be gold to have containers, I am not equally enthusiastic about
> > it for ceph.
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Wido den Hollander




On 18/03/2021 09:09, Janne Johansson wrote:

Den ons 17 mars 2021 kl 20:17 skrev Matthew H :


"A containerized environment just makes troubleshooting more difficult, getting 
access and retrieving details on Ceph processes isn't as straightforward as with a non 
containerized infrastructure. I am still not convinced that containerizing everything 
brings any benefits except the collocation of services."

It changes the way you troubleshoot, but I don't find it more difficult in the 
issues I have seen and had. Even today without containers, all services can be 
co-located within the same hosts (mons,mgrs,osds,mds).. Is there a situation 
you've seen where that has not been the case?


New ceph users pop in all the time on the #ceph IRC and have
absolutely no idea on how to see the relevant logs from the
containerized services.

Me being one of the people that do run services on bare metal (and
VMs) I actually can't help them, and it seems several other old ceph
admins can't either.



Me being one of them.

Yes, it's all possible with containers, but it's different. And I don't 
see the true benefit of running Ceph in Docker just yet.


Another layer of abstraction which you need to understand. Also, when 
you need to do real emergency stuff like working with 
ceph-objectstore-tool to fix broken OSDs/PGs it's just much easier to 
work on a bare-metal box than with containers (if you ask me).


So no, I am not convinced yet. Not against it, but personally I would 
say it's not the only way forward.


DEB and RPM packages are still alive and kicking.

Wido


Not that it is impossible or might not even be hard to get them, but
somewhere in the "it is so easy to get it up and running, just pop a
container and off you go" docs there seem to be a lack of the parts
"when the OSD crashes at boot, run this to export the file normally
called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
to the users and they are left to wipe/reinstall or something else
when it doesn't work. At the end, I guess the project will see less
useful reports with Assert Failed logs from impossible conditions and
more people turning away from something that could be fixed in the
long run.

I get some of the advantages, and for stateless services elsewhere it
might be gold to have containers, I am not equally enthusiastic about
it for ceph.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Janne Johansson
Den ons 17 mars 2021 kl 20:17 skrev Matthew H :
>
> "A containerized environment just makes troubleshooting more difficult, 
> getting access and retrieving details on Ceph processes isn't as 
> straightforward as with a non containerized infrastructure. I am still not 
> convinced that containerizing everything brings any benefits except the 
> collocation of services."
>
> It changes the way you troubleshoot, but I don't find it more difficult in 
> the issues I have seen and had. Even today without containers, all services 
> can be co-located within the same hosts (mons,mgrs,osds,mds).. Is there a 
> situation you've seen where that has not been the case?

New ceph users pop in all the time on the #ceph IRC and have
absolutely no idea on how to see the relevant logs from the
containerized services.

Me being one of the people that do run services on bare metal (and
VMs) I actually can't help them, and it seems several other old ceph
admins can't either.

Not that it is impossible or might not even be hard to get them, but
somewhere in the "it is so easy to get it up and running, just pop a
container and off you go" docs there seem to be a lack of the parts
"when the OSD crashes at boot, run this to export the file normally
called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
to the users and they are left to wipe/reinstall or something else
when it doesn't work. At the end, I guess the project will see less
useful reports with Assert Failed logs from impossible conditions and
more people turning away from something that could be fixed in the
long run.

I get some of the advantages, and for stateless services elsewhere it
might be gold to have containers, I am not equally enthusiastic about
it for ceph.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Email alerts from Ceph

2021-03-18 Thread Martin Verges
by adding a hook script within croit "onHealthDegrate" and
"onHealthRecover" that notifies us using telegram/slack/... ;)

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mi., 17. März 2021 um 23:27 Uhr schrieb Andrew Walker-Brown <
andrew_jbr...@hotmail.com>:

> Hi all,
>
> How have folks implemented getting email or snmp alerts out of Ceph?
> Getting things like osd/pool nearly full or osd/daemon failures etc.
>
> Kind regards
>
> Andrew
>
> Sent from my iPhone
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io