[ceph-users] Re: How to remove stuck daemon?

2022-01-26 Thread Fyodor Ustinov
Hi!

No one knows how to fix it?


- Original Message -
> From: "Fyodor Ustinov" 
> To: "ceph-users" 
> Sent: Tuesday, 25 January, 2022 11:29:53
> Subject: [ceph-users] How to remove stuck daemon?

> Hi!
> 
> I have Ceph cluster version 16.2.7 with this error:
> 
> root@s-26-9-19-mon-m1:~# ceph health detail
> HEALTH_WARN 1 failed cephadm daemon(s)
> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
>daemon osd.91 on s-26-8-2-1 is in error state
> 
> But I don't have that osd anymore. I deleted it.
> 
> root@s-26-9-19-mon-m1:~# ceph orch ps|grep s-26-8-2-1
> crash.s-26-8-2-1 s-26-8-2-1 running (2d)
> 1h ago   3M9651k-  16.2.7   cc266d6139f4  2ed049f74b66
> node-exporter.s-26-8-2-1 s-26-8-2-1*:9100   running (2d)
> 1h ago   3M24.3M-  0.18.1   e5a616e4b9cf  817cc5370e7e
> osd.90   s-26-8-2-1 running (2d)
> 1h ago   3M25.6G4096M  16.2.7   cc266d6139f4  beb2ea3efb3b
> 
> root@s-26-8-2-1:~# cephadm ls|grep osd
>"name": "osd.90",
>"systemd_unit": "ceph-1ef45b26-dbac-11eb-a357-616c355f48cb@osd.90",
>"service_name": "osd",
> 
> Can you please tell me how to reset this error message?
> 
> WBR,
>Fyodor
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to remove stuck daemon?

2022-01-26 Thread Eugen Block

Hi,

have you tried to failover the mgr service? I noticed similar  
behaviour in Octopus.



Zitat von Fyodor Ustinov :


Hi!

No one knows how to fix it?


- Original Message -

From: "Fyodor Ustinov" 
To: "ceph-users" 
Sent: Tuesday, 25 January, 2022 11:29:53
Subject: [ceph-users] How to remove stuck daemon?



Hi!

I have Ceph cluster version 16.2.7 with this error:

root@s-26-9-19-mon-m1:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
   daemon osd.91 on s-26-8-2-1 is in error state

But I don't have that osd anymore. I deleted it.

root@s-26-9-19-mon-m1:~# ceph orch ps|grep s-26-8-2-1
crash.s-26-8-2-1 s-26-8-2-1 running (2d)
1h ago   3M9651k-  16.2.7   cc266d6139f4  2ed049f74b66
node-exporter.s-26-8-2-1 s-26-8-2-1*:9100   running (2d)
1h ago   3M24.3M-  0.18.1   e5a616e4b9cf  817cc5370e7e
osd.90   s-26-8-2-1 running (2d)
1h ago   3M25.6G4096M  16.2.7   cc266d6139f4  beb2ea3efb3b

root@s-26-8-2-1:~# cephadm ls|grep osd
   "name": "osd.90",
   "systemd_unit": "ceph-1ef45b26-dbac-11eb-a357-616c355f48cb@osd.90",
   "service_name": "osd",

Can you please tell me how to reset this error message?

WBR,
   Fyodor
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Limitations of ceph fs snapshot mirror for read-only folders?

2022-01-26 Thread Manuel Holtgrewe
Dear all,

I want to mirror a snapshot in Ceph v16.2.6 deployed with cephadm
using the stock quay.io images. My source file system has a folder
"/src/folder/x" where "/src/folder" has mode "ug=r,o=", in other words
no write permissions for the owner (root).

The sync of a snapshot "initial" now fails with the following log excerpt.

remote_mkdir: remote epath=./src/folder/x
remote_mkdir: failed to create remote directory=./src/folder/x: (13)
Permission denied
do_synchronize: closing local directory=./src/folder
do_synchronize: closing local directory=./src/
do_synchronize: closing local directory=.
post_sync_close_handles
do_sync_snaps: failed to synchronize dir_root=/src/folder, snapshot=initial
sync_snaps: failed to sync snapshots for dir_root=/src/folder

The capabilities on the remote site are:

client.mirror-tier-2-remote
   key: REDACTED
   caps: [mds] allow * fsname=cephfs
   caps: [mon] allow r fsname=cephfs
   caps: [osd] allow * tag cephfs data=cephfs

I also just reported this in the tracker [1]. Can anyone think of a
workaround (in the lines of "sudo make me a sandwich ;-)?"

Best wishes,
Manuel

[1] https://tracker.ceph.com/issues/54017
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is it possible to stripe rados object?

2022-01-26 Thread lin yunfan
Hi,
I know with rbd and cephfs there is a stripe setting to stripe data
into multiple rodos object.
Is it possible to use librados api to stripe a large object into many
small ones?

linyunfan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to remove stuck daemon?

2022-01-26 Thread Fyodor Ustinov
Hi!

I restarted mgr - it didn't help. Or do you mean something else?

> Hi,
> 
> have you tried to failover the mgr service? I noticed similar
> behaviour in Octopus.
> 
> 
> Zitat von Fyodor Ustinov :
> 
>> Hi!
>>
>> No one knows how to fix it?
>>
>>
>> - Original Message -
>>> From: "Fyodor Ustinov" 
>>> To: "ceph-users" 
>>> Sent: Tuesday, 25 January, 2022 11:29:53
>>> Subject: [ceph-users] How to remove stuck daemon?
>>
>>> Hi!
>>>
>>> I have Ceph cluster version 16.2.7 with this error:
>>>
>>> root@s-26-9-19-mon-m1:~# ceph health detail
>>> HEALTH_WARN 1 failed cephadm daemon(s)
>>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
>>>daemon osd.91 on s-26-8-2-1 is in error state
>>>
>>> But I don't have that osd anymore. I deleted it.
>>>
>>> root@s-26-9-19-mon-m1:~# ceph orch ps|grep s-26-8-2-1
>>> crash.s-26-8-2-1 s-26-8-2-1 running (2d)
>>> 1h ago   3M9651k-  16.2.7   cc266d6139f4  2ed049f74b66
>>> node-exporter.s-26-8-2-1 s-26-8-2-1*:9100   running (2d)
>>> 1h ago   3M24.3M-  0.18.1   e5a616e4b9cf  817cc5370e7e
>>> osd.90   s-26-8-2-1 running (2d)
>>> 1h ago   3M25.6G4096M  16.2.7   cc266d6139f4  beb2ea3efb3b
>>>
>>> root@s-26-8-2-1:~# cephadm ls|grep osd
>>>"name": "osd.90",
>>>"systemd_unit": "ceph-1ef45b26-dbac-11eb-a357-616c355f48cb@osd.90",
>>>"service_name": "osd",
>>>
>>> Can you please tell me how to reset this error message?
>>>
>>> WBR,
>>>Fyodor
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: switch restart facilitating cluster/client network.

2022-01-26 Thread Marc
Thanks for the tips!!!

> 
> I would still set noout on relevant parts of the cluster in case something
> goes south and it does take longer than 2 minutes. Otherwise OSDs will
> start outing themselves after 10 minutes or so by default and then you
> have a lot of churn going on.
> 
> The monitors monitors will be fine unless you lose quorum, but even so
> they'll just recover once the switch comes back. You just won't be able to
> make changes to the cluster if you lose mon quorum, nor will the OSDs
> start recovering etc. until that occurs.
> 
> Depending on which version of Ceph/libvirt/etc. you are running, I have
> seen issues with older releases of the same where a handful of VMs get
> indefinitely stuck with really high I/Owait afterwards and needed to be
> manually rebooted on occasion when doing something like this.
> 
> As another user mentioned, the kernels softlockup handler kicks in after
> 120 seconds by default so you'll see lots of stacktraces in the VMs due to
> processes blocked on I/O if the reboot and repeering doesn't all happen
> within exactly two minutes.
> 
> If you can afford to shutdown all the VMs in the cluster, it might be for
> the best as they'll be losing I/O...
> 
> 
> On Tue, Jan 25, 2022, 4:27 AM Marc   > wrote:
> 
> 
> 
>   If the switch needs an update and needs to be restarted (expected 2
> minutes). Can I just leave the cluster as it is, because ceph will handle
> this correctly? Or should I eg. put some vm's I am running in pause mode,
> or even stop them. What happens to the monitors? Can they handle this, or
> maybe better to switch from 3 to 1 one?
> 
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible to stripe rados object?

2022-01-26 Thread Sebastian Wagner
libradosstriper ?

Am 26.01.22 um 10:16 schrieb lin yunfan:
> Hi,
> I know with rbd and cephfs there is a stripe setting to stripe data
> into multiple rodos object.
> Is it possible to use librados api to stripe a large object into many
> small ones?
>
> linyunfan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Different OSD file structure

2022-01-26 Thread Zoth
I've got a cluster with different OSD structures, some are updated to 
15.2.12 and the others are 15.2.9 (bluestore).


No problem so far with the cluster, but I think it's better to normalize 
the situation.


*15.2.9*
drwxr-xr-x 23 ceph ceph 4096 Nov 30 15:50 ../
lrwxrwxrwx  1 ceph ceph   24 Nov 22 14:20 block -> /dev/data-vg03/data-lv03
lrwxrwxrwx  1 ceph ceph   20 Nov 22 14:20 block.db -> /dev/db-vg03/db-lv03
lrwxrwxrwx  1 ceph ceph   22 Nov 22 14:20 block.wal -> 
/dev/wal-vg03/wal-lv03

-rw---  1 ceph ceph   37 Nov 22 14:20 ceph_fsid
-rw---  1 ceph ceph   37 Nov 22 14:20 fsid
-rw---  1 ceph ceph   56 Nov 22 14:20 keyring
-rw---  1 ceph ceph    6 Nov 22 14:20 ready
-rw---  1 ceph ceph    3 Nov 22 14:21 require_osd_release
-rw---  1 ceph ceph   10 Nov 22 14:20 type
-rw---  1 ceph ceph    3 Nov 22 14:20 whoami

*15.2.12*
-rw-r--r--  1 ceph ceph  460 Jan 17 21:29 activate.monmap
lrwxrwxrwx  1 ceph ceph   24 Jan 17 21:29 block -> /dev/data-vg01/data-lv01
lrwxrwxrwx  1 ceph ceph   20 Jan 17 21:29 block.db -> /dev/db-vg01/db-lv01
lrwxrwxrwx  1 ceph ceph   22 Jan 17 21:29 block.wal -> 
/dev/wal-vg01/wal-lv01

-rw---  1 ceph ceph    2 Jan 17 21:29 bluefs
-rw---  1 ceph ceph   37 Jan 17 21:29 ceph_fsid
-rw-r--r--  1 ceph ceph   37 Jan 17 21:29 fsid
-rw---  1 ceph ceph   56 Jan 17 21:29 keyring
-rw---  1 ceph ceph    8 Jan 17 21:29 kv_backend
-rw---  1 ceph ceph   21 Jan 17 21:29 magic
-rw---  1 ceph ceph    4 Jan 17 21:29 mkfs_done
-rw---  1 ceph ceph   41 Jan 17 21:29 osd_key
-rw---  1 ceph ceph    6 Jan 17 21:29 ready
-rw---  1 ceph ceph    3 Jan 17 21:29 require_osd_release
-rw---  1 ceph ceph   10 Jan 17 21:29 type
-rw---  1 ceph ceph    3 Jan 17 21:29 whoami


What's the best way to normalize all the disks to version 15.2.12 ?


Thx guys !
Sylvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-26 Thread Irek Fasikhov
Hi.
Basic logic:
1.bucket policy transition
2.radosgw-admin gc process --include-all

3.1.rados ls -p pool | grep  >bucket_objects.txt
3.2.rados listxattr -p pool objname | xargs -L1 echo rados getattr -p pool
objname >> objname.txt
3.3.rados create -p pool objname
3.4.cat objname.txt | xargs -L1 echo rados setattr -p pool objname attr
value

4.radosgw-admin metadata get bucket.instance.. | tee
bucket.json bucket_backup.json
5.change placement_rule to "default-placement/NEW_CLASS",
6.radosgw-admin metadata rm bucket.instance..
7.radosgw-admin metadata put bucket.instance.. <
bucker.json

вт, 25 янв. 2022 г. в 21:31, Frédéric Nass :

>
> Le 25/01/2022 à 18:28, Casey Bodley a écrit :
> > On Tue, Jan 25, 2022 at 11:59 AM Frédéric Nass
> >  wrote:
> >>
> >> Le 25/01/2022 à 14:48, Casey Bodley a écrit :
> >>> On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
> >>>  wrote:
>  Hello,
> 
>  I've just heard about storage classes and imagined how we could use
> them
>  to migrate all S3 objects within a placement pool from an ec pool to a
>  replicated pool (or vice-versa) for data resiliency reasons, not to
> save
>  space.
> 
>  It looks possible since ;
> 
>  1. data pools are associated to storage classes in a placement pool
>  2. bucket lifecycle policies can take care of moving data from a
> storage
>  class to another
>  3. we can set a user's default_storage_class to have all new objects
>  written by this user reach the new storage class / data pool.
>  4. after all objects have been transitioned to the new storage class,
> we
>  can delete the old storage class, rename the new storage class to
>  STANDARD so that it's been used by default and unset any user's
>  default_storage_class setting.
> >>> i don't think renaming the storage class will work the way you're
> >>> hoping. this storage class string is stored in each object and used to
> >>> locate its data, so renaming it could render the transitioned objects
> >>> unreadable
> >> Hello Casey,
> >>
> >> Thanks for pointing that out.
> >>
> >> Do you believe this scenario would work if stopped at step 3.? (keeping
> >> default_storage_class set on users's profiles and not renaming the new
> >> storage class to STANDARD. Could we delete the STANDARD storage class
> >> btw since we would not use it anymore?).
> >>
> >> If there is no way to define the default storage class of a placement
> >> pool without naming it STANDARD could we imaging transitioning all
> >> objects again by:
> >>
> >> 4. deleting the storage class named STANDARD
> >> 5. creating a new one named STANDARD (using a ceph pool of the same data
> >> placement scheme than the one used by the temporary storage class
> >> created above)
> > instead of deleting/recreating STANDARD, you could probably just
> > modify it's data pool. only do this once you're certain that there are
> > no more objects in the old data pool. you might need to wait for
> > garbage collection to clean up the tail objects there too (or force it
> > with 'radosgw-admin gc process --include-all')
>
> Interesting scenario. So in the end we'd have objects named after both
> storage classes in the same ceph pool, the old ones named after the new
> storage class name and the new ones being written after the STANDARD
> storage class, right?
>
> >
> >> 6. transitioning all objects again to the new STANDARD storage class.
> >> Then delete the temporary storage class.
> > i think this step 6 would run into the
> > https://tracker.ceph.com/issues/50974 that Konstantin shared - if the
> > two storage classes have the same pool name, the transition doesn't
> > actually take effect. you might consider leaving this 'temporary'
> > storage class around, but pointing the defaults back at STANDARD
>
> Well, in step 6., I'd thought about using another new pool for the
> recreated STANDARD storage class (to avoid the issue shared by
> Konstantin , thanks to him btw) and move all objects to this new pool
> again in a new global transition.
>
> But, I understand you'd recommend avoiding deleting/recreating STANDARD
> and just modify the STANDARD data pool after GC execution, am I right?
>
> Frédéric.
>
> >
> >> ?
> >>
> >> Best regards,
> >>
> >> Frédéric.
> >>
>  Would that work?
> 
>  Anyone tried this with success yet?
> 
>  Best regards,
> 
>  Frédéric.
> 
>  --
>  Cordialement,
> 
>  Frédéric Nass
>  Direction du Numérique
>  Sous-direction Infrastructures et Services
> 
>  Tél : 03.72.74.11.35
> 
>  ___
>  ceph-users mailing list -- ceph-users@ceph.io
>  To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph

[ceph-users] Do not use VMware Storage I/O Control with Ceph iSCSI GWs!

2022-01-26 Thread Frédéric Nass

Hi,

For anyone using VMware ESXi (6.7) with Ceph iSCSI GWs (Nautilus), I 
thought you might benefit from our experience: I have finally identified 
what was causing a permanent ~500 MB/s and ~4k iops load on our cluster, 
specifically on one of our RBD image used as a VMware Datastore and it 
was Storage I/O control. Not sure whether this is a bug that could be 
taken care of on the ceph side (as a misinterpretation of a SCSI 
instruction that the ESXi would replay madly) but disabling Storage I/O 
Control definitely solved the problem. By disabling I mean choosing 
"Disable Storage I/O Control **and** statistics collection" on each 
Datastore.


Regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread David Orman
What version of Ceph are you using? Newer versions deploy a dashboard and
prometheus module, which has some of this built in. It's a great start to
seeing what can be done using Prometheus and the built in exporter. Once
you learn this, if you decide you want something more robust, you can do an
external deployment of Prometheus (clusters), Alertmanager, Grafana, and
all the other tooling that might interest you for a more scalable solution
when dealing with more clusters. It's the perfect way to get your feet wet
and it showcases a lot of the interesting things you can do with this
solution!

https://docs.ceph.com/en/latest/mgr/dashboard/
https://docs.ceph.com/en/latest/mgr/prometheus/

David

On Wed, Jan 26, 2022 at 1:42 AM Michel Niyoyita  wrote:

> Thank you for your email Szabo, these can be helpful , can you provide
> links then I start to work on it.
>
> Michel.
>
> On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
> wrote:
>
> > Which monitoring tool? Like prometheus or nagios style thing?
> > We use sensu for keepalive and ceph health reporting + prometheus with
> > grafana for metrics collection.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > Hello team,
> >
> > I would like to monitor my ceph cluster using one of the
> > monitoring tool, does someone has a help on that ?
> >
> > Michel
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS Snapshot Scheduling stops creating Snapshots after a restart of the Manager

2022-01-26 Thread Sebastian Mazza
I have a problem with the snap_schedule MGR module. It seams to forget at least 
parts of the configuration after the active MGR is restarted.
The following cli commands (lines starting with ‘$’) and their std out (lines 
starting with >) demonstrates the problem.

$ ceph fs snap-schedule add /shares/users 1h 2021-10-31T18:00
> Schedule set for path /shares/users

$ ceph fs snap-schedule retention add /shares/users 14h10d12m 
> Retention added to path /shares/users

Wait until the next complete hour.

$ ceph fs snap-schedule status /shares/users
> {"fs": "cephfs", "subvol": null, "path": "/shares/users", "rel_path": 
> "/shares/users", "schedule": "1h", "retention": {"h": 14, "d": 10, "m": 12}, 
> "start": "2021-10-31T18:00:00", "created": "2022-01-26T23:52:03", "first": 
> "2022-01-27T00:00:00", "last": "2022-01-27T00:00:00", "last_pruned": 
> "2022-01-27T00:00:00", "created_count": 1, "pruned_count": 1, "active": true}

Now everything looks and works as expected. However, if I restart the active 
MGR, no new snapshots will be created and the status command does unexpectedly 
report NULL for some of the properties. 

$ systemctl restart ceph-mgr@apollon.service

$ ceph fs snap-schedule status /shares/users
> {"fs": "cephfs", "subvol": null, "path": "/shares/users", "rel_path": 
> "/shares/users", "schedule": "1h", "retention": {}, "start": 
> "2021-10-31T18:00:00", "created": "2022-01-26T23:52:03", "first": null, 
> "last": null, "last_pruned": null, "created_count": 0, "pruned_count": 0, 
> "active": true}


I did look into the source file mgr/snap_schedule/fs/schedule.py. Since, I 
never used python I do not understand much, but I understand the SQL code that 
is given.
Therefore, I did save the sqlight DB dump before and after a MGR restart by the 
following commands:

List RADOS objects in order to find the sqlight DB dump:
$ rados --pool fs.metadata-root-pool --namespace cephfs-snap-schedule ls 
> snap_db_v0

Copy the sqlight DB dump into a regular file
$ rados --pool fs.metadata-root-pool --namespace cephfs-snap-schedule get 
snap_db_v0 /tmp/snap_db_v0

To my surprise, the sqlight DB dump never contains the information for 
retention, first, last, and last_pruned.
The sqlight DB dump always looks like this:

BEGIN TRANSACTION;
CREATE TABLE schedules(
id INTEGER PRIMARY KEY ASC,
path TEXT NOT NULL UNIQUE,
subvol TEXT,
retention TEXT DEFAULT '{}',
rel_path TEXT NOT NULL
);
INSERT INTO "schedules" VALUES(2,'/shares/groups',NULL,'{}','/shares/groups');
INSERT INTO "schedules" 
VALUES(3,'/shares/backup-clients',NULL,'{}','/shares/backup-clients');
INSERT INTO "schedules" VALUES(4,'/shares/users',NULL,'{}','/shares/users');
CREATE TABLE schedules_meta(
id INTEGER PRIMARY KEY ASC,
schedule_id INT,
start TEXT NOT NULL,
first TEXT,
last TEXT,
last_pruned TEXT,
created TEXT NOT NULL,
repeat INT NOT NULL,
schedule TEXT NOT NULL,
created_count INT DEFAULT 0,
pruned_count INT DEFAULT 0,
active INT NOT NULL,
FOREIGN KEY(schedule_id) REFERENCES schedules(id) ON DELETE CASCADE,
UNIQUE (schedule_id, start, repeat)
);
INSERT INTO "schedules_meta" 
VALUES(2,2,'2021-10-31T18:00:00',NULL,NULL,NULL,'2022-01-21T11:41:35',3600,'1h',0,0,1);
INSERT INTO "schedules_meta" 
VALUES(3,3,'2021-10-31T13:30:00',NULL,NULL,NULL,'2022-01-21T11:41:41',21600,'6h',0,0,1);
INSERT INTO "schedules_meta" 
VALUES(4,4,'2021-10-31T18:00:00',NULL,NULL,NULL,'2022-01-26T23:52:03',3600,'1h',0,0,1);
COMMIT;


Why are the information about retention, first, last, and last_pruned are not 
part of the sqlight dump?
Is this the reason why my snapshot scheduling stops working after the active 
MGR is restarted?


My ceph version is: 16.2.6


Thanks is advance,
Sebastian 



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread Anthony D'Atri

What David said!

A couple of additional thoughts:

o Nagios (and derivatives like Icinga and check_mk) have been popular for 
years.  Note that they’re monitoring solutions vs metrics solutions — it’s good 
to have both.  One issue I’ve seen multiple times with Nagios-family monitoring 
is that over time as checks and the fleet grow, the server tends to bog down, 
and the litany of active checks starts taking longer to run than the check 
interval.  Prometheus alertmanager is more scalable, and with some thought most 
active checks can be recast in terms of metrics.

o Prometheus (forked node_exporter) was INVALUABLE to me when characterizing 
and engaging two seperate SSD firmware design flaw issues. It includes a data 
query interface for ad-hoc queries and expression development

o Grafana pairs well with Prometheus for dashboard-style visualization and 
trending across many clusters / nodes


> On Jan 26, 2022, at 1:22 PM, David Orman  wrote:
> 
> What version of Ceph are you using? Newer versions deploy a dashboard and
> prometheus module, which has some of this built in. It's a great start to
> seeing what can be done using Prometheus and the built in exporter. Once
> you learn this, if you decide you want something more robust, you can do an
> external deployment of Prometheus (clusters), Alertmanager, Grafana, and
> all the other tooling that might interest you for a more scalable solution
> when dealing with more clusters. It's the perfect way to get your feet wet
> and it showcases a lot of the interesting things you can do with this
> solution!
> 
> https://docs.ceph.com/en/latest/mgr/dashboard/
> https://docs.ceph.com/en/latest/mgr/prometheus/
> 
> David
> 
> On Wed, Jan 26, 2022 at 1:42 AM Michel Niyoyita  wrote:
> 
>> Thank you for your email Szabo, these can be helpful , can you provide
>> links then I start to work on it.
>> 
>> Michel.
>> 
>> On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
>> wrote:
>> 
>>> Which monitoring tool? Like prometheus or nagios style thing?
>>> We use sensu for keepalive and ceph health reporting + prometheus with
>>> grafana for metrics collection.
>>> 
>>> Istvan Szabo
>>> Senior Infrastructure Engineer
>>> ---
>>> Agoda Services Co., Ltd.
>>> e: istvan.sz...@agoda.com
>>> ---
>>> 
>>> On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
>>> 
>>> Email received from the internet. If in doubt, don't click any link nor
>>> open any attachment !
>>> 
>>> 
>>> Hello team,
>>> 
>>> I would like to monitor my ceph cluster using one of the
>>> monitoring tool, does someone has a help on that ?
>>> 
>>> Michel
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG count deviation alert on OSDs of high weight

2022-01-26 Thread Nicola Mori
I set up a test cluster (Pacific 16.2.7 deployed with cephadm) with 
several hdds of different sizes, 1.8 Tb and 3.6 TB; they have weight 1.8 
and 3.6, respectively, with 2 pools (metadata+data for CephFS). I'm 
currently having a PG count varying from 177 to 182 for OSDs with small 
disks and from 344 to 352 for big disks. To me everything looks fine: 
big s have more PGs than small ones, and the ratio reflects the disk 
weights ratio quite nicely.


Still I have one "high pg count deviation" warning for all the big OSDs 
in the monitoring section of the Ceph dashboard, with messages like this:


  OSD osd.4 on bofur deviates by more than 30% from average PG count.

I don't understand the reason of these warnings since as I explained 
above the PG count looks good to me. It just looks like the monitoring 
doesn't take the disk weights into account, considering only the raw PG 
count for this metric and thus inevitably generating a warning. Can this 
be true? If so, is this the intended behavior or a bug?


Thanks in advance for any help.

Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io