[ceph-users] Ceph EC PG calculation

2020-11-17 Thread Szabo, Istvan (Agoda)
Hi,

I have this error:
I have 36 osd and get this:
Error ERANGE:  pg_num 4096 size 6 would mean 25011 total pgs, which exceeds max 
10500 (mon_max_pg_per_osd 250 * num_in_osds 42)

If I want to calculate the max pg in my server, how it works if I have EC pool?

I have 4:2 data EC pool, and the others are replicated.

These are the pools:
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 597 
flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 598 flags 
hashpspool stripe_width 0 application rgw
pool 6 'sin.rgw.log' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 599 flags 
hashpspool stripe_width 0 application rgw
pool 7 'sin.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 600 flags 
hashpspool stripe_width 0 application rgw
pool 8 'sin.rgw.meta' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 601 lfor 0/393/391 
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
pool 10 'sin.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 602 
lfor 0/529/527 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 
application rgw
pool 11 'sin.rgw.buckets.data.old' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 603 
flags hashpspool stripe_width 0 application rgw
pool 12 'sin.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 
crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 604 flags hashpspool,ec_overwrites stripe_width 16384 application 
rgw

So how I can calculate the pgs?

This is my osd tree:
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 534.38354  root default
-5  89.06392  host cephosd-6s01
36   nvme1.74660  osd.36up   1.0  1.0
  0ssd   14.55289  osd.0 up   1.0  1.0
  8ssd   14.55289  osd.8 up   1.0  1.0
15ssd   14.55289  osd.15up   1.0  1.0
18ssd   14.55289  osd.18up   1.0  1.0
24ssd   14.55289  osd.24up   1.0  1.0
30ssd   14.55289  osd.30up   1.0  1.0
-3  89.06392  host cephosd-6s02
37   nvme1.74660  osd.37up   1.0  1.0
  1ssd   14.55289  osd.1 up   1.0  1.0
11ssd   14.55289  osd.11up   1.0  1.0
17ssd   14.55289  osd.17up   1.0  1.0
23ssd   14.55289  osd.23up   1.0  1.0
28ssd   14.55289  osd.28up   1.0  1.0
35ssd   14.55289  osd.35up   1.0  1.0
-11  89.06392  host cephosd-6s03
41   nvme1.74660  osd.41up   1.0  1.0
  2ssd   14.55289  osd.2 up   1.0  1.0
  6ssd   14.55289  osd.6 up   1.0  1.0
13ssd   14.55289  osd.13up   1.0  1.0
19ssd   14.55289  osd.19up   1.0  1.0
26ssd   14.55289  osd.26up   1.0  1.0
32ssd   14.55289  osd.32up   1.0  1.0
-13  89.06392  host cephosd-6s04
38   nvme1.74660  osd.38up   1.0  1.0
  5ssd   14.55289  osd.5 up   1.0  1.0
  7ssd   14.55289  osd.7 up   1.0  1.0
14ssd   14.55289  osd.14up   1.0  1.0
20ssd   14.55289  osd.20up   1.0  1.0
25ssd   14.55289  osd.25up   1.0  1.0
31ssd   14.55289  osd.31up   1.0  1.0
-9  89.06392  host cephosd-6s05
40   nvme1.74660  osd.40up   1.0  1.0
  3ssd   14.55289  osd.3 up   1.0  1.0
10ssd   14.55289  osd.10up   1.0  1.0
12ssd   14.55289  osd.12up   1.0  1.0
21ssd   14.55289  osd.21up   1.0  1.0
29ssd   14.55289  osd.29up   1.0  1.0
33ssd   14.55289  

[ceph-users] Re: Accessing Ceph Storage Data via Ceph Block Storage

2020-11-17 Thread DHilsbos
Vaughan;

An absolute minimal Ceph cluster really needs to be 3 servers, and at that 
usable space should be 1/3 of raw space (see the archives of this mailing list 
for many discussions of why size=2 is bad).

While it is possible to run other tasks on Ceph servers, memory utilization of 
Ceph processes can be quite large, so it's often discouraged, especially on 
memory constrained servers.

Would it be feasible to acquire a system with sufficient RAM to run both VMs?

I believe RBD can be cached, but I can't speak to how it's configured, or how 
well it works.  I believe you would want a really fast drive (SSD) to store the 
cache on.

Depending on your performance and storage volume needs, you might be able to 
get away with building a micro-cluster, based on ARM CPUs.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com

-Original Message-
From: Vaughan Beckwith [mailto:vaughan.beckw...@bluesphere.co.za] 
Sent: Tuesday, November 17, 2020 3:54 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Accessing Ceph Storage Data via Ceph Block Storage

Hi All,

I'm not sure if this is the correct place to ask this question, I have tried 
the channels, but have received very little help there.

I am currently very new to Ceph and am investigating it as to a possible 
replacement for a legacy application which use to provide us with replication.

At the moment my company has three servers, two primary servers running Ubuntu 
and a backup server also running Ubuntu, the two primary servers each host a 
virtual machine, and it is these virtual machines that the office workers use 
for shared folder access, email and as a domain server, the office workers are 
not aware of the underlying linux servers.  In the past the legacy software 
would replicate the running VM files on both primary servers to the backup 
server.  The replication is done at the underlying linux host level and not 
from within the guest VMs.  I was hoping that I could get Ceph to do this as 
well.  From what I have read and I speak under correction, the best Ceph client 
type for this would be the block access, whereby I would then mount the block 
and start up the VMs.  As I would be running the VMs, as per normal routine, 
would Ceph then have to retrieve the large VM files from the storage nodes 
across the lan and bring the data back to the client to run in the VM. 
  Is there an option to cache certain parts of the data on certain clients?

Also none of the primary servers as they currently stand have the capacity to 
run both VMs together, so each primary has a dedicated VM which it runs, the 
backup server currently keeps replicated copies of both VM images from each 
primary, the replication is provided by the legacy application.  I'm also 
wondering if I need to get a fourth server, so I have 2 clients and 2 storage 
nodes.

Any suggestions or help would be greatly appreciated.

Yours sincerely

Vaughan Beckwith
Bluesphere Technologies
BSC I.T. (Honours)

vaughan.beckw...@bluesphere.co.za
Telephone: 011 675 6354
Fax: (011) 675 6423

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Accessing Ceph Storage Data via Ceph Block Storage

2020-11-17 Thread Vaughan Beckwith
Hi All,

I'm not sure if this is the correct place to ask this question, I have tried 
the channels, but have received very little help there.

I am currently very new to Ceph and am investigating it as to a possible 
replacement for a legacy application which use to provide us with replication.

At the moment my company has three servers, two primary servers running Ubuntu 
and a backup server also running Ubuntu, the two primary servers each host a 
virtual machine, and it is these virtual machines that the office workers use 
for shared folder access, email and as a domain server, the office workers are 
not aware of the underlying linux servers.  In the past the legacy software 
would replicate the running VM files on both primary servers to the backup 
server.  The replication is done at the underlying linux host level and not 
from within the guest VMs.  I was hoping that I could get Ceph to do this as 
well.  From what I have read and I speak under correction, the best Ceph client 
type for this would be the block access, whereby I would then mount the block 
and start up the VMs.  As I would be running the VMs, as per normal routine, 
would Ceph then have to retrieve the large VM files from the storage nodes 
across the lan and bring the data back to the client to run in the VM. 
  Is there an option to cache certain parts of the data on certain clients?

Also none of the primary servers as they currently stand have the capacity to 
run both VMs together, so each primary has a dedicated VM which it runs, the 
backup server currently keeps replicated copies of both VM images from each 
primary, the replication is provided by the legacy application.  I'm also 
wondering if I need to get a fourth server, so I have 2 clients and 2 storage 
nodes.

Any suggestions or help would be greatly appreciated.

Yours sincerely

Vaughan Beckwith
Bluesphere Technologies
BSC I.T. (Honours)

vaughan.beckw...@bluesphere.co.za
Telephone: 011 675 6354
Fax: (011) 675 6423

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS error: currently failed to rdlock, waiting. clients crashing and evicted

2020-11-17 Thread Thomas Hukkelberg
Hi all!

Hopefully some of you can shed some light on this. We have big problems with 
samba crashing when macOS smb clients access certain/random folders/files over 
vfs_ceph.

When browsing cephfs folder in question directly on a cephnode where cephfs is 
mouted we experience some issues like slow dir listing. We suspect that maybe 
macOS fetching of xattr metadata creates a lot of traffic, but it should not 
lockup the cluster like this. In logs we see both rdlock and wrlock, but mostly 
rdlocks.

End clients experience spurious disconnects when issue occurs, roughly up to a 
handfull times a day. Is this a config issue? Have we hit a bug? It's certainly 
not a feature :/

Any pointers on how to troubleshoot or rectify this problem is most welcome.

ceph version 14.2.11
samba version 4.12.10-SerNet-Ubuntu-10.focal
Supermicro X11, Intel Silver 4110, 9 ceph nodes, 2x40gbe network, 150OSD 
spinners, NVMe db/journal

--

2020-11-17 22:09:07.525706 [WRN] evicting unresponsive client bo-samba-03 
(3887652779), after 301.746 seconds
2020-11-17 22:09:07.525580 [INF] Evicting (and blacklisting) client session 
3877970532 (10.40.30.133:0/3971626932)
2020-11-17 22:09:07.525536 [WRN] evicting unresponsive client bo-samba-03 
(3877970532), after 302.034 seconds
2020-11-17 22:07:23.915412 [INF] Cluster is now healthy
2020-11-17 22:07:23.915381 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 
MDSs report slow requests)
2020-11-17 22:07:23.915330 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE 
(was: 1 clients failing to respond to capability release)
2020-11-17 22:07:23.064492 [INF] MDS health message cleared (mds.?): 1 slow 
requests are blocked > 30 secs
2020-11-17 22:07:23.064457 [INF] MDS health message cleared (mds.?): Client 
bo-samba-03 failing to respond to capability release
2020-11-17 22:07:17.524023 [WRN] client.3887663354 isn't responding to 
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, 
sent 63.325997 seconds ago
2020-11-17 22:07:17.523987 [INF] Evicting (and blacklisting) client session 
3887663354 (10.40.30.133:0/3230547239)
2020-11-17 22:07:17.523967 [WRN] evicting unresponsive client bo-samba-03 
(3887663354), after 64.5412 seconds
2020-11-17 22:07:17.523610 [WRN] slow request 63.325528 seconds old, received 
at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup 
#0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=39, 
caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:07:17.523596 [WRN] 1 slow requests, 1 included below; oldest 
blocked for > 63.325529 secs
2020-11-17 22:07:19.255177 [WRN] Health check failed: 1 clients failing to 
respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:07:12.523453 [WRN] 1 slow requests, 0 included below; oldest 
blocked for > 58.325433 secs
2020-11-17 22:07:07.523382 [WRN] 1 slow requests, 0 included below; oldest 
blocked for > 53.325362 secs
2020-11-17 22:07:02.523360 [WRN] 1 slow requests, 0 included below; oldest 
blocked for > 48.325307 secs
2020-11-17 22:06:57.523218 [WRN] 1 slow requests, 0 included below; oldest 
blocked for > 43.325199 secs
2020-11-17 22:06:52.523203 [WRN] 1 slow requests, 0 included below; oldest 
blocked for > 38.325158 secs
2020-11-17 22:06:47.523105 [WRN] slow request 33.325065 seconds old, received 
at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup 
#0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=39, 
caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:06:47.523100 [WRN] 1 slow requests, 1 included below; oldest 
blocked for > 33.325065 secs
2020-11-17 22:06:51.431745 [WRN] Health check failed: 1 MDSs report slow 
requests (MDS_SLOW_REQUEST)
2020-11-17 22:06:20.045030 [INF] Cluster is now healthy
2020-11-17 22:06:20.045008 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 
MDSs report slow requests)
2020-11-17 22:06:20.044960 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE 
(was: 1 clients failing to respond to capability release)
2020-11-17 22:06:19.062307 [INF] MDS health message cleared (mds.?): 1 slow 
requests are blocked > 30 secs
2020-11-17 22:06:19.062253 [INF] MDS health message cleared (mds.?): Client 
bo-samba-03 failing to respond to capability release
2020-11-17 22:06:15.936150 [WRN] Health check failed: 1 clients failing to 
respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:06:12.522624 [WRN] client.3869410498 isn't responding to 
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, 
sent 64.045677 seconds ago


--thomas

--
Thomas Hukkelberg
tho...@hovedkvarteret.no
+47 971 81 192
--
supp...@hovedkvarteret.no
+47 966 44 999




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGR restart loop

2020-11-17 Thread Frank Schilder
Addition: This happens only when I stop mon.ceph-01, I can stop any other MON 
daemon without problems. I checked network connectivity and all hosts can see 
all other hosts.

I already increased mon_mgr_beacon_grace to a huge value due to another bug a 
long time ago:

global advanced mon_mgr_beacon_grace 86400

This restart cycle seems to have another reason. The log contains this line 
just before the MGR goes out:

Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.179 7f7c544ea700  1 mgr 
send_beacon active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.193 7f7c544ea700  0 
log_channel(cluster) log [DBG] : pgmap v4: 3215 pgs: 3208 active+clean, 7 
active+clean+scrubbing+deep; 689 TiB data, 877 TiB used, 1.1 PiB / 1.9 PiB avail
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [INF] : Manager daemon ceph-03 is unresponsive.  No 
standby daemons available.
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.313 7f756700  0 
log_channel(cluster) log [DBG] : mgrmap e1330: no daemons active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700 -1 mgr 
handle_mgr_map I was active but no longer am
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700  1 mgr 
respawn  e: '/usr/bin/ceph-mgr'

The beacon has been sent. Why does it not arrive at the MONs? There is only 
little load right now.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 17 November 2020 16:25:36
To: ceph-users@ceph.io
Subject: [ceph-users] MGR restart loop

Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
 7active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects:

[ceph-users] MGR restart loop

2020-11-17 Thread Frank Schilder
Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
 
  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
 7active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

It is cycling through these 3 states and I couldn't find a reason why. The node 
ceph-01 is not special in any way.

Any hint would be greatly appreciated.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-17 Thread Anthony D'Atri

> 
> I'm probably going to get crucified for this

Naw.   The <> in your From: header, though ….

;)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-17 Thread DHilsbos
Phil;

I'm probably going to get crucified for this, but I put a year of testing into 
this before determining it was sufficient to the needs of my organization...

If the primary concerns are capability and cost (not top of the line 
performance), then I can tell you that we have had great success utilizing 
Intel Atom C3000 series CPUs.  We have built 2 clusters with capacities on the 
order of 130TiB, for less than $30,000 each.  The initial clusters cost $20,000 
each, for half the capacity.  Our testing cluster cost $8,000 to build, and 
most of that hardware could have been wrapped into the first production cluster 
build.

For those keeping track, no that is not the lowest cost / unit space.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Phil Merricks [mailto:seffyr...@gmail.com] 
Sent: Monday, November 16, 2020 5:52 PM
To: Janne Johansson
Cc: Hans van den Bogert; ceph-users
Subject: [ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - 
Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

Thanks for all the replies folks.  I think it's testament to the
versatility of Ceph that there are some differences of opinion and
experience here.

With regards to the purpose of this cluster, it is providing distributed
storage for stateful workloads of containers.  The data produced is
somewhat immutable, it can be regenerated over time, however that does
cause some slowdown for the teams that use the data as part of their
development pipeline.  To the best of my understanding the goals here were
to provide a data loss safety net but still make efficient use of the block
devices assigned to the cluster, which is I imagine where the EC direction
came from.  The cluster is 3 nodes with the OSDs themselves mainly housed
in two of those.  Additionally there was an initiative to 'use what we
have' (or as I like to put it, 'cobble it together') with commodity
hardware that was immediately available to hand.  The departure of my
predecessor has left some unanswered questions so I am not going to bother
second guessing beyond what I already know.  As I understand it my steps
are:

1:  Move off the data and scrap the cluster as it stands currently.
(already under way)
2:  Group the block devices into pools of the same geometry and type (and
maybe do some tiering?)
3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily
compromised by a loss at the bare metal level
4. Add more hosts/OSDs if EC is the right solution (this may be outside of
the scope of this implementation, but I'll keep a-cobblin'!)

The additional ceph outputs follow:
ceph osd tree 
ceph osd erasure-code-profile get cephfs-media-ec 

I am fully prepared to do away with EC to keep things simple and efficient
in terms of CPU occupancy.



On Mon, 16 Nov 2020 at 02:32, Janne Johansson  wrote:

> Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert <
> hansbog...@gmail.com>:
>
> > > With this profile you can only loose one OSD at a time, which is really
> > > not that redundant.
> > That's rather situation dependent. I don't have really large disks, so
> > the repair time isn't that large.
> > Further, my SLO isn't that high that I need 99.xxx% uptime, if 2 disks
> > break in the same repair window, that would be unfortunate, but I'd just
> > grab a backup from a mirroring cluster. Looking at it from another
> > perspective, I came from a single host RAID5 scenario, I'd argue this is
> > better since I can survive a host failure.
> >
> > Also this is a sliding problem right? Someone with K+3 could argue K+2
> >   is not enough as well.
> >
>
> There are a few situations like when you are moving data or when a scrub
> found a bad PG where you are suddenly out of copies in case something bad
> happens. I think Raid5 operators also found this out, when your cold spare
> disk kicks in, you find that old undetected error on one of the other disks
> and think repairs are bad or stress your raid too much.
>
> As with raids, the cheapest resource is often the actual disks and not
> operator time, restore-wait-times and so on, so that is why many on this
> list advocates for K+2-or-more, or Repl=3 because we have seen the errors
> one normally didn't expect. Yes, a double surprise of two disks failing in
> the same night after running for years is uncommon, but it is not as
> uncommon to resize pools, move PGs around or find a scrub error or two some
> day.
>
> So while one could always say "one more drive is better than your amount",
> there are people losing data with repl=2 or K+1 because some more normal
> operation was in flight and _then_ a single surprise happens.  So you can
> have a weird reboot, causing those PGs needing backfill later, and if one
> of the uptodate hosts have any single surprise during the recove

[ceph-users] Re: Ceph RBD - High IOWait during the Writes

2020-11-17 Thread Tony Liu
I am not sure any configuration tuning would help here.
The bottleneck is on HDD. In my case, I have a SSD for
WAL/DB and it provides pretty good write performance.
The part I don't quite understand in your case is that,
random read is quite fast. Due to the HDD seeking latency,
the random read is normally slow. Not sure how it's so fast
in your case.

Tony
> -Original Message-
> From: athreyavc 
> Sent: Tuesday, November 17, 2020 8:40 AM
> Cc: ceph-users 
> Subject: [ceph-users] Re: Ceph RBD - High IOWait during the Writes
> 
> I disabled the CephX authentication now. Though the Performance is
> Slightly better, it is not yet there.
> 
> Are there any other recommendations for all HDD ceph clusters ?
> 
> From another thread
> https://lists.ceph.io/hyperkitty/list/ceph-
> us...@ceph.io/thread/DFHXXN4KKI5PS7LYPZJO4GYHU67JYVVL/
> 
> 
> *In our test based v15.2.2, i found
> osd_numa_prefer_iface/osd_numa_auto_affinity make onlyehalf cpu used.
> for 4K RW, it make performance drop much. So you can check this
> whetheroccur.*
> 
> I do see "set_numa_affinity unable to identify cluster interface" alerts.
> But I am not sure that is a cause for concern.
> 
> Thanks and regards,
> 
> Athreya
> 
> On Thu, Nov 12, 2020 at 1:30 PM athreyavc  wrote:
> 
> > Hi,
> >
> > Thanks for the email, But we are not using RAID at all, we are using
> > HBAs LSI HBA 9400-8e. Eash HDD is configured as an OSD.
> >
> > On Thu, Nov 12, 2020 at 12:19 PM Edward kalk  wrote:
> >
> >> for certain CPU architecture, disable spectre and meltdown
> mitigations.
> >> (be certain network to physical nodes is secure from internet access)
> >> (use apt proxy, http(s), curl proxy servers) Try to toggle on or off
> >> the physical on disk cache. (raid controller
> >> command)
> >> ^I had same issue, doing both of these fixed it. In my case the disks
> >> I had needed on disk cache hard set to ‘on’. raid card default was
> not good.
> >> (be sure to have diverse power and UPS protection if needed to run on
> >> disk cache on) (good RAID. battery if using raid cache improves
> >> perf.)
> >>
> >> to see the perf impact of spec. and melt. mitigation vs. off, run: dd
> >> if=/dev/zero of=/dev/null ^i run for 5 seconds and then ctl+c will
> >> show a max north bridge ops.
> >>
> >> to see the difference in await and IOPs when toggle RAID card
> >> features and on disk cache I run: iostat -xtc 2 and use fio to
> >> generate disk load for testing IOPs. (google fio example
> >> commands)
> >> ^south bridge +raid controller to disks ops and latency.
> >>
> >> -Edward Kalk
> >> Datacenter Virtualization
> >> Performance Engineering
> >> Socket Telecom
> >> Columbia, MO, USA
> >> ek...@socket.net
> >>
> >> > On Nov 12, 2020, at 4:45 AM, athreyavc  wrote:
> >> >
> >> > Jumbo frames enabled  and MTU is 9000
> >>
> >>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reclassify crush map

2020-11-17 Thread Seena Fallah
Also when I reclassify-bucket to a non exist base bucket it says: "default
parent test does not exist"
But as documented in
https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/ it should
create it!

On Tue, Nov 17, 2020 at 6:05 PM Seena Fallah  wrote:

> Hi all,
>
> I want to reclassify my crushmap. I have two roots, one hiops and one
> default. In hiops root I have one datacenter and in that I have three rack
> and in each rack I have 3 osds. When I run the command below it says "item
> -55 in bucket -54 is not also a reclassified bucket". I see the new
> crushmap before reclassify-bucket command and item -55 was for my
> datacenter! What should I do? If I reclassify my datacenter I will lose my
> datacenter!
> crushtool -i crush-map --reclassify \
> --reclassify-root default hdd \
> --set-subtree-class default hdd \
> --reclassify-root hiops ssd \
> --reclassify-bucket hiops ssd default \
> -o new-crush-map
>
> Thanks
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no attribute 'type'

2020-11-17 Thread Marcelo
Hello all.

I'm trying to deploy the dashboard (Nautilus 14.2.8), and after I run ceph
dashboard create-self-signed-cert, the cluster started to show this warning:
# ceph health detail
HEALTH_ERR Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object
has no attribute 'type'
MGR_MODULE_ERROR Module 'dashboard' has failed: '_cffi_backend.CDataGCP'
object has no attribute 'type'
Module 'dashboard' has failed: '_cffi_backend.CDataGCP' object has no
attribute 'type'

If I set ceph config set mgr mgr/dashboard/ssl false, the error goes away.

I tried to manually upload the certs, but I'm still hitting the error.

Has anyone experienced something similar?

Thanks, Marcelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD - High IOWait during the Writes

2020-11-17 Thread athreyavc
I disabled the CephX authentication now. Though the Performance is Slightly
better, it is not yet there.

Are there any other recommendations for all HDD ceph clusters ?

From another thread
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DFHXXN4KKI5PS7LYPZJO4GYHU67JYVVL/


*In our test based v15.2.2, i found
osd_numa_prefer_iface/osd_numa_auto_affinity make onlyehalf cpu used. for
4K RW, it make performance drop much. So you can check this whetheroccur.*

I do see "set_numa_affinity unable to identify cluster interface" alerts.
But I am not sure that is a cause for concern.

Thanks and regards,

Athreya

On Thu, Nov 12, 2020 at 1:30 PM athreyavc  wrote:

> Hi,
>
> Thanks for the email, But we are not using RAID at all, we are using HBAs
> LSI HBA 9400-8e. Eash HDD is configured as an OSD.
>
> On Thu, Nov 12, 2020 at 12:19 PM Edward kalk  wrote:
>
>> for certain CPU architecture, disable spectre and meltdown mitigations.
>> (be certain network to physical nodes is secure from internet access) (use
>> apt proxy, http(s), curl proxy servers)
>> Try to toggle on or off the physical on disk cache. (raid controller
>> command)
>> ^I had same issue, doing both of these fixed it. In my case the disks I
>> had needed on disk cache hard set to ‘on’. raid card default was not good.
>> (be sure to have diverse power and UPS protection if needed to run on disk
>> cache on) (good RAID. battery if using raid cache improves perf.)
>>
>> to see the perf impact of spec. and melt. mitigation vs. off, run: dd
>> if=/dev/zero of=/dev/null
>> ^i run for 5 seconds and then ctl+c
>> will show a max north bridge ops.
>>
>> to see the difference in await and IOPs when toggle RAID card features
>> and on disk cache I run: iostat -xtc 2
>> and use fio to generate disk load for testing IOPs. (google fio example
>> commands)
>> ^south bridge +raid controller to disks ops and latency.
>>
>> -Edward Kalk
>> Datacenter Virtualization
>> Performance Engineering
>> Socket Telecom
>> Columbia, MO, USA
>> ek...@socket.net
>>
>> > On Nov 12, 2020, at 4:45 AM, athreyavc  wrote:
>> >
>> > Jumbo frames enabled  and MTU is 9000
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bucket notification is working strange

2020-11-17 Thread Yuval Lifshitz
Hi Krasaev,
Thanks for pointing out this issue! This is currently under review here:
[1], and tracked here: [2].
Once merged, the fix would be available on the master development branch,
and the plan is to backport the fix to Octopus in the future.

Yuval

[1] https://github.com/ceph/ceph/pull/38136
[2] https://tracker.ceph.com/issues/47904

On Mon, Oct 19, 2020 at 7:26 PM Krasaev  wrote:

> Hi everyone, I asked the same question in stackoverflow, but will repeat
> here.
>
> I configured bucket notification using a bucket owner creds and when the
> owner does actions I can see new events in a configured endpoint(kafka
> actually). However, when I try to do actions in the bucket, but with
> another user creds I do not see events in the configured notification
> topic. Is it expected behavior and each user has to configure own topic(is
> it possible if a user is not system at all)? Or I have missed something?
> Thank you.
>
>
>
> https://stackoverflow.com/questions/64384060/enable-bucket-notifications-for-all-users-in-ceph-octopus
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Reclassify crush map

2020-11-17 Thread Seena Fallah
Hi all,

I want to reclassify my crushmap. I have two roots, one hiops and one
default. In hiops root I have one datacenter and in that I have three rack
and in each rack I have 3 osds. When I run the command below it says "item
-55 in bucket -54 is not also a reclassified bucket". I see the new
crushmap before reclassify-bucket command and item -55 was for my
datacenter! What should I do? If I reclassify my datacenter I will lose my
datacenter!
crushtool -i crush-map --reclassify \
--reclassify-root default hdd \
--set-subtree-class default hdd \
--reclassify-root hiops ssd \
--reclassify-bucket hiops ssd default \
-o new-crush-map

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Kalle Happonen
Hi,

> I don't think the default osd_min_pg_log_entries has changed recently.
> In https://tracker.ceph.com/issues/47775 I proposed that we limit the
> pg log length by memory -- if it is indeed possible for log entries to
> get into several MB, then this would be necessary IMHO.

I've had a surprising crash course on pg_log in the last 36 hours. But for the 
size of each entry, you're right. I counted pg log * ODS, and did not take into 
factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD 
uses for pg_log was ~22GB / OSD process. 
   
 
> But you said you were trimming PG logs with the offline tool? How long
> were those logs that needed to be trimmed?

The logs we are trimming were ~3000, we trimmed them to the new size of 500. 
After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to 
what we guess is 2-3GB but with the cluster at this state, it's hard to be 
specific. 

Cheers,
Kalle


 
> -- dan
> 
> 
> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen  wrote:
>>
>> Another idea, which I don't know if has any merit.
>>
>> If 8 MB is a realistic log size (or has this grown for some reason?), did the
>> enforcement (or default) of the minimum value change lately
>> (osd_min_pg_log_entries)?
>>
>> If the minimum amount would be set to 1000, at 8 MB per log, we would have
>> issues with memory.
>>
>> Cheers,
>> Kalle
>>
>>
>>
>> - Original Message -
>> > From: "Kalle Happonen" 
>> > To: "Dan van der Ster" 
>> > Cc: "ceph-users" 
>> > Sent: Tuesday, 17 November, 2020 12:45:25
>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>>
>> > Hi Dan @ co.,
>> > Thanks for the support (moral and technical).
>> >
>> > That sounds like a good guess, but it seems like there is nothing alarming 
>> > here.
>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional
>> > values.
>> >
>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>> >  "pgid": "37.2b9",
>> >  "ondisk_log_size": 3103,
>> >  "pgid": "33.e",
>> >  "ondisk_log_size": 3229,
>> >  "pgid": "7.2",
>> >  "ondisk_log_size": 3111,
>> >  "pgid": "26.4",
>> >  "ondisk_log_size": 3185,
>> >  "pgid": "33.4",
>> >  "ondisk_log_size": 3311,
>> >  "pgid": "33.8",
>> >  "ondisk_log_size": 3278,
>> >
>> > I also have no idea what the average size of a pg log entry should be, in 
>> > our
>> > case it seems it's around 8 MB (22GB/3000 entires).
>> >
>> > Cheers,
>> > Kalle
>> >
>> > - Original Message -
>> >> From: "Dan van der Ster" 
>> >> To: "Kalle Happonen" 
>> >> Cc: "ceph-users" , "xie xingguo" 
>> >> ,
>> >> "Samuel Just" 
>> >> Sent: Tuesday, 17 November, 2020 12:22:28
>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case
>> >
>> >> Hi Kalle,
>> >>
>> >> Do you have active PGs now with huge pglogs?
>> >> You can do something like this to find them:
>> >>
>> >>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
>> >> select(.ondisk_log_size > 3000)'
>> >>
>> >> If you find some, could you increase to debug_osd = 10 then share the osd 
>> >> log.
>> >> I am interested in the debug lines from calc_trim_to_aggressively (or
>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
>> >> might show other issues.
>> >>
>> >> Cheers, dan
>> >>
>> >>
>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  
>> >> wrote:
>> >>>
>> >>> Hi Kalle,
>> >>>
>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur
>> >>> after that incident. So I can mostly only offer moral support.
>> >>>
>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>> >>> think this is suspicious:
>> >>>
>> >>>b670715eb4 osd/PeeringState: do not trim pg log past 
>> >>> last_update_ondisk
>> >>>
>> >>>https://github.com/ceph/ceph/commit/b670715eb4
>> >>>
>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>> >>> there could be an unforeseen condition where `last_update_ondisk`
>> >>> isn't being updated correctly, and therefore the osd stops trimming
>> >>> the pg_log altogether.
>> >>>
>> >>> Xie or Samuel: does that sound possible?
>> >>>
>> >>> Cheers, Dan
>> >>>
>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  
>> >>> wrote:
>> >>> >
>> >>> > Hello all,
>> >>> > wrt:
>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>> >>> >
>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the 
>> >>> > thread above.
>> >>> >
>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk 
>> >>> > per node.
>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>> >>> >
>> >>> > The cluster has been running fine, and (as relevant to the post) the 
>> >>> > memory
>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log 
>> >>> > of 3000.
>> >>> > The user traffic doesn't seem to have been exception

[ceph-users] CephFS: Recovering from broken Mount

2020-11-17 Thread Julian Fölsch

Hello,

We are running a Octopus cluster however we still have some older Ubuntu 
16.04 clients connecting using libcephfs2 version 14.2.13-1xenial.


From time to time it happened that the network was having issues so the 
clients lost the connection to the cluster.
But the system still thinks the mount is running and it has to be 
restarted manually.
In some cases we even had to restart the complete machine because it 
would refuse to unmount.

I also attached a dmesg log so you can see the systems behaivior.

How do you deal with such issues?


Kind regards,
Julian Fölsch

--
Julian Fölsch

   Arbeitsgemeinschaft Dresdner Studentennetz (AG DSN)

   Telefon: +49 351 271816 69
   Mobil: +49 152 22915871
   Fax: +49 351 46469685
   Email: julian.foel...@agdsn.de

   Studierendenrat der TU Dresden
   Helmholtzstr. 10
   01069 Dresden
[1366901.940605] libceph: mon2 10.144.0.4:6789 session lost, hunting for new mon
[1366937.780384] ceph: mds0 caps stale
[1367164.851326] ceph: mds0 hung
[1367819.440486] libceph: mds0 10.144.0.3:6801 socket closed (con state OPEN)
[1367950.511242] libceph: mds0 10.144.0.3:6801 socket closed (con state 
CONNECTING)
[1368016.048263] libceph: mds0 10.144.0.3:6801 connection reset
[1368016.048578] libceph: reset on mds0
[1368016.048588] ceph: mds0 closed our session
[1368016.048589] ceph: mds0 reconnect start
[1368016.223200] libceph: mds0 10.144.0.3:6801 socket closed (con state 
NEGOTIATING)
[1368016.752349] ceph: mds0 rejected session
[1368024.244740] libceph: mon0 10.144.0.2:6789 session established
[1369816.744074] libceph: mds0 10.144.0.3:6801 socket closed (con state OPEN)___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-17 Thread Robert Sander
Hi Phil,

thanks for the background info.

Am 17.11.20 um 01:51 schrieb Phil Merricks:

> 1:  Move off the data and scrap the cluster as it stands currently.
> (already under way)
> 2:  Group the block devices into pools of the same geometry and type (and
> maybe do some tiering?)
> 3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily
> compromised by a loss at the bare metal level
> 4. Add more hosts/OSDs if EC is the right solution (this may be outside of
> the scope of this implementation, but I'll keep a-cobblin'!)

This looks like a plan.

> 
> The additional ceph outputs follow:
> ceph osd tree 
> ceph osd erasure-code-profile get cephfs-media-ec 

Your EC profile will not work on two hosts:

crush-device-class=
crush-failure-domain=host
crush-root=default
k=2
m=2

You need k+m=4 independent hosts for the EC parts, but your CRUSH map
only shows two hosts. This is why all your PGs are undersized and degraded.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Mark Nelson

Hi Dan,


I 100% agree with your proposal.  One of the goals I had in mind with 
the prioritycache framework is that pglog could end up becoming another 
prioritycache target that is balanced against the other caches.  The 
idea would be that we try to keep some amount of pglog data in memory at 
high priority but ultimately the longer the log gets the less priority 
it gets relative to onode cache and other things (with some 
minimums/maximums in place as well).  Just yesterday Josh and I were 
also talking about the possibility of keeping a longer running log on 
disk than what's represented in memory as well.  This could have 
implications for peering performance, but frankly I don't see how we 
keep using log based recovery in a world where we are putting OSDs on 
devices capable of hundreds of thousands of write IOPS.



Mark


On 11/17/20 5:13 AM, Dan van der Ster wrote:

I don't think the default osd_min_pg_log_entries has changed recently.
In https://tracker.ceph.com/issues/47775 I proposed that we limit the
pg log length by memory -- if it is indeed possible for log entries to
get into several MB, then this would be necessary IMHO.

But you said you were trimming PG logs with the offline tool? How long
were those logs that needed to be trimmed?

-- dan


On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen  wrote:

Another idea, which I don't know if has any merit.

If 8 MB is a realistic log size (or has this grown for some reason?), did the 
enforcement (or default) of the minimum value change lately 
(osd_min_pg_log_entries)?

If the minimum amount would be set to 1000, at 8 MB per log, we would have 
issues with memory.

Cheers,
Kalle



- Original Message -

From: "Kalle Happonen" 
To: "Dan van der Ster" 
Cc: "ceph-users" 
Sent: Tuesday, 17 November, 2020 12:45:25
Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
Hi Dan @ co.,
Thanks for the support (moral and technical).

That sounds like a good guess, but it seems like there is nothing alarming here.
In all our pools, some pgs are a bit over 3100, but not at any exceptional
values.

cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
  "pgid": "37.2b9",
  "ondisk_log_size": 3103,
  "pgid": "33.e",
  "ondisk_log_size": 3229,
  "pgid": "7.2",
  "ondisk_log_size": 3111,
  "pgid": "26.4",
  "ondisk_log_size": 3185,
  "pgid": "33.4",
  "ondisk_log_size": 3311,
  "pgid": "33.8",
  "ondisk_log_size": 3278,

I also have no idea what the average size of a pg log entry should be, in our
case it seems it's around 8 MB (22GB/3000 entires).

Cheers,
Kalle

- Original Message -

From: "Dan van der Ster" 
To: "Kalle Happonen" 
Cc: "ceph-users" , "xie xingguo" ,
"Samuel Just" 
Sent: Tuesday, 17 November, 2020 12:22:28
Subject: Re: [ceph-users] osd_pglog memory hoarding - another case
Hi Kalle,

Do you have active PGs now with huge pglogs?
You can do something like this to find them:

   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3000)'

If you find some, could you increase to debug_osd = 10 then share the osd log.
I am interested in the debug lines from calc_trim_to_aggressively (or
calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
might show other issues.

Cheers, dan


On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  wrote:

Hi Kalle,

Strangely and luckily, in our case the memory explosion didn't reoccur
after that incident. So I can mostly only offer moral support.

But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
think this is suspicious:

b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk

https://github.com/ceph/ceph/commit/b670715eb4

Given that it adds a case where the pg_log is not trimmed, I wonder if
there could be an unforeseen condition where `last_update_ondisk`
isn't being updated correctly, and therefore the osd stops trimming
the pg_log altogether.

Xie or Samuel: does that sound possible?

Cheers, Dan

On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  wrote:

Hello all,
wrt:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/

Yesterday we hit a problem with osd_pglog memory, similar to the thread above.

We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node.
We run 8+3 EC for the data pool (metadata is on replicated nvme pool).

The cluster has been running fine, and (as relevant to the post) the memory
usage has been stable at 100 GB / node. We've had the default pg_log of 3000.
The user traffic doesn't seem to have been exceptional lately.

Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory
usage on OSD nodes started to grow. On each node it grew steadily about 30
GB/day, until the servers started OOM killing OSD processes.

After a lot of debugging we found that the pg_logs were huge. Each OSD process
pg_log had grown to ~22GB, which we naturally didn't hav

[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Dan van der Ster
I don't think the default osd_min_pg_log_entries has changed recently.
In https://tracker.ceph.com/issues/47775 I proposed that we limit the
pg log length by memory -- if it is indeed possible for log entries to
get into several MB, then this would be necessary IMHO.

But you said you were trimming PG logs with the offline tool? How long
were those logs that needed to be trimmed?

-- dan


On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen  wrote:
>
> Another idea, which I don't know if has any merit.
>
> If 8 MB is a realistic log size (or has this grown for some reason?), did the 
> enforcement (or default) of the minimum value change lately 
> (osd_min_pg_log_entries)?
>
> If the minimum amount would be set to 1000, at 8 MB per log, we would have 
> issues with memory.
>
> Cheers,
> Kalle
>
>
>
> - Original Message -
> > From: "Kalle Happonen" 
> > To: "Dan van der Ster" 
> > Cc: "ceph-users" 
> > Sent: Tuesday, 17 November, 2020 12:45:25
> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>
> > Hi Dan @ co.,
> > Thanks for the support (moral and technical).
> >
> > That sounds like a good guess, but it seems like there is nothing alarming 
> > here.
> > In all our pools, some pgs are a bit over 3100, but not at any exceptional
> > values.
> >
> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
> >  "pgid": "37.2b9",
> >  "ondisk_log_size": 3103,
> >  "pgid": "33.e",
> >  "ondisk_log_size": 3229,
> >  "pgid": "7.2",
> >  "ondisk_log_size": 3111,
> >  "pgid": "26.4",
> >  "ondisk_log_size": 3185,
> >  "pgid": "33.4",
> >  "ondisk_log_size": 3311,
> >  "pgid": "33.8",
> >  "ondisk_log_size": 3278,
> >
> > I also have no idea what the average size of a pg log entry should be, in 
> > our
> > case it seems it's around 8 MB (22GB/3000 entires).
> >
> > Cheers,
> > Kalle
> >
> > - Original Message -
> >> From: "Dan van der Ster" 
> >> To: "Kalle Happonen" 
> >> Cc: "ceph-users" , "xie xingguo" 
> >> ,
> >> "Samuel Just" 
> >> Sent: Tuesday, 17 November, 2020 12:22:28
> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case
> >
> >> Hi Kalle,
> >>
> >> Do you have active PGs now with huge pglogs?
> >> You can do something like this to find them:
> >>
> >>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
> >> select(.ondisk_log_size > 3000)'
> >>
> >> If you find some, could you increase to debug_osd = 10 then share the osd 
> >> log.
> >> I am interested in the debug lines from calc_trim_to_aggressively (or
> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
> >> might show other issues.
> >>
> >> Cheers, dan
> >>
> >>
> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  
> >> wrote:
> >>>
> >>> Hi Kalle,
> >>>
> >>> Strangely and luckily, in our case the memory explosion didn't reoccur
> >>> after that incident. So I can mostly only offer moral support.
> >>>
> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
> >>> think this is suspicious:
> >>>
> >>>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
> >>>
> >>>https://github.com/ceph/ceph/commit/b670715eb4
> >>>
> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if
> >>> there could be an unforeseen condition where `last_update_ondisk`
> >>> isn't being updated correctly, and therefore the osd stops trimming
> >>> the pg_log altogether.
> >>>
> >>> Xie or Samuel: does that sound possible?
> >>>
> >>> Cheers, Dan
> >>>
> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  
> >>> wrote:
> >>> >
> >>> > Hello all,
> >>> > wrt:
> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
> >>> >
> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
> >>> > above.
> >>> >
> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk 
> >>> > per node.
> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
> >>> >
> >>> > The cluster has been running fine, and (as relevant to the post) the 
> >>> > memory
> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 
> >>> > 3000.
> >>> > The user traffic doesn't seem to have been exceptional lately.
> >>> >
> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> >>> > memory
> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 
> >>> > 30
> >>> > GB/day, until the servers started OOM killing OSD processes.
> >>> >
> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
> >>> > process
> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, 
> >>> > and then
> >>> > the cluster was in an unstable situation. This is significantly more 
> >>> > than the
> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly 
> >>> > affect the
> >>> > size.
> >>> >
> >>> > We've 

[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Kalle Happonen
Another idea, which I don't know if has any merit. 

If 8 MB is a realistic log size (or has this grown for some reason?), did the 
enforcement (or default) of the minimum value change lately 
(osd_min_pg_log_entries)?

If the minimum amount would be set to 1000, at 8 MB per log, we would have 
issues with memory.

Cheers,
Kalle



- Original Message -
> From: "Kalle Happonen" 
> To: "Dan van der Ster" 
> Cc: "ceph-users" 
> Sent: Tuesday, 17 November, 2020 12:45:25
> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi Dan @ co.,
> Thanks for the support (moral and technical).
> 
> That sounds like a good guess, but it seems like there is nothing alarming 
> here.
> In all our pools, some pgs are a bit over 3100, but not at any exceptional
> values.
> 
> cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
> select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>  "pgid": "37.2b9",
>  "ondisk_log_size": 3103,
>  "pgid": "33.e",
>  "ondisk_log_size": 3229,
>  "pgid": "7.2",
>  "ondisk_log_size": 3111,
>  "pgid": "26.4",
>  "ondisk_log_size": 3185,
>  "pgid": "33.4",
>  "ondisk_log_size": 3311,
>  "pgid": "33.8",
>  "ondisk_log_size": 3278,
> 
> I also have no idea what the average size of a pg log entry should be, in our
> case it seems it's around 8 MB (22GB/3000 entires).
> 
> Cheers,
> Kalle
> 
> - Original Message -
>> From: "Dan van der Ster" 
>> To: "Kalle Happonen" 
>> Cc: "ceph-users" , "xie xingguo" 
>> ,
>> "Samuel Just" 
>> Sent: Tuesday, 17 November, 2020 12:22:28
>> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case
> 
>> Hi Kalle,
>> 
>> Do you have active PGs now with huge pglogs?
>> You can do something like this to find them:
>> 
>>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
>> select(.ondisk_log_size > 3000)'
>> 
>> If you find some, could you increase to debug_osd = 10 then share the osd 
>> log.
>> I am interested in the debug lines from calc_trim_to_aggressively (or
>> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
>> might show other issues.
>> 
>> Cheers, dan
>> 
>> 
>> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  wrote:
>>>
>>> Hi Kalle,
>>>
>>> Strangely and luckily, in our case the memory explosion didn't reoccur
>>> after that incident. So I can mostly only offer moral support.
>>>
>>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>>> think this is suspicious:
>>>
>>>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>>>
>>>https://github.com/ceph/ceph/commit/b670715eb4
>>>
>>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>>> there could be an unforeseen condition where `last_update_ondisk`
>>> isn't being updated correctly, and therefore the osd stops trimming
>>> the pg_log altogether.
>>>
>>> Xie or Samuel: does that sound possible?
>>>
>>> Cheers, Dan
>>>
>>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  
>>> wrote:
>>> >
>>> > Hello all,
>>> > wrt:
>>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>>> >
>>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
>>> > above.
>>> >
>>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
>>> > node.
>>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>>> >
>>> > The cluster has been running fine, and (as relevant to the post) the 
>>> > memory
>>> > usage has been stable at 100 GB / node. We've had the default pg_log of 
>>> > 3000.
>>> > The user traffic doesn't seem to have been exceptional lately.
>>> >
>>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
>>> > memory
>>> > usage on OSD nodes started to grow. On each node it grew steadily about 30
>>> > GB/day, until the servers started OOM killing OSD processes.
>>> >
>>> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
>>> > process
>>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and 
>>> > then
>>> > the cluster was in an unstable situation. This is significantly more than 
>>> > the
>>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect 
>>> > the
>>> > size.
>>> >
>>> > We've reduced the pg_log to 500, and started offline trimming it where we 
>>> > can,
>>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
>>> > nodes, but we're  still recovering, and have a lot of ODSs down and out 
>>> > still.
>>> >
>>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts 
>>> > triggered
>>> > this (or something unrelated we don't see).
>>> >
>>> > This mail is mostly to figure out if there are good guesses why the 
>>> > pg_log size
>>> > per OSD process exploded? Any technical (and moral) support is 
>>> > appreciated.
>>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also 
>>> > to
>>> > put a data point out there for othe

[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Dan van der Ster
On Tue, Nov 17, 2020 at 11:45 AM Kalle Happonen  wrote:
>
> Hi Dan @ co.,
> Thanks for the support (moral and technical).
>
> That sounds like a good guess, but it seems like there is nothing alarming 
> here. In all our pools, some pgs are a bit over 3100, but not at any 
> exceptional values.
>
> cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
> select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
>   "pgid": "37.2b9",
>   "ondisk_log_size": 3103,
>   "pgid": "33.e",
>   "ondisk_log_size": 3229,
>   "pgid": "7.2",
>   "ondisk_log_size": 3111,
>   "pgid": "26.4",
>   "ondisk_log_size": 3185,
>   "pgid": "33.4",
>   "ondisk_log_size": 3311,
>   "pgid": "33.8",
>   "ondisk_log_size": 3278,
>
> I also have no idea what the average size of a pg log entry should be, in our 
> case it seems it's around 8 MB (22GB/3000 entires).

I also have no idea how large the average PG log entry *should* be.
(BTW I think you forgot a factor which is the number of PGs on each OSD.).

Here's a sample from one of our S3 4+2 OSDs:

71 PGs,

"osd_pglog": {
"items": 249530,
"bytes": 33925360
},

So that's ~32MB for roughly 500*71 entries == around 1kB each.

Anyway you raised a good point -- this isn't necessarily a "pg log not
trimming" bug, but rather it might be a "pg log entries are huge" bug.

-- dan


>
> Cheers,
> Kalle
>
> - Original Message -
> > From: "Dan van der Ster" 
> > To: "Kalle Happonen" 
> > Cc: "ceph-users" , "xie xingguo" 
> > , "Samuel Just" 
> > Sent: Tuesday, 17 November, 2020 12:22:28
> > Subject: Re: [ceph-users] osd_pglog memory hoarding - another case
>
> > Hi Kalle,
> >
> > Do you have active PGs now with huge pglogs?
> > You can do something like this to find them:
> >
> >   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
> > select(.ondisk_log_size > 3000)'
> >
> > If you find some, could you increase to debug_osd = 10 then share the osd 
> > log.
> > I am interested in the debug lines from calc_trim_to_aggressively (or
> > calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
> > might show other issues.
> >
> > Cheers, dan
> >
> >
> > On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  
> > wrote:
> >>
> >> Hi Kalle,
> >>
> >> Strangely and luckily, in our case the memory explosion didn't reoccur
> >> after that incident. So I can mostly only offer moral support.
> >>
> >> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
> >> think this is suspicious:
> >>
> >>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
> >>
> >>https://github.com/ceph/ceph/commit/b670715eb4
> >>
> >> Given that it adds a case where the pg_log is not trimmed, I wonder if
> >> there could be an unforeseen condition where `last_update_ondisk`
> >> isn't being updated correctly, and therefore the osd stops trimming
> >> the pg_log altogether.
> >>
> >> Xie or Samuel: does that sound possible?
> >>
> >> Cheers, Dan
> >>
> >> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  
> >> wrote:
> >> >
> >> > Hello all,
> >> > wrt:
> >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
> >> >
> >> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
> >> > above.
> >> >
> >> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
> >> > node.
> >> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
> >> >
> >> > The cluster has been running fine, and (as relevant to the post) the 
> >> > memory
> >> > usage has been stable at 100 GB / node. We've had the default pg_log of 
> >> > 3000.
> >> > The user traffic doesn't seem to have been exceptional lately.
> >> >
> >> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> >> > memory
> >> > usage on OSD nodes started to grow. On each node it grew steadily about 
> >> > 30
> >> > GB/day, until the servers started OOM killing OSD processes.
> >> >
> >> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
> >> > process
> >> > pg_log had grown to ~22GB, which we naturally didn't have memory for, 
> >> > and then
> >> > the cluster was in an unstable situation. This is significantly more 
> >> > than the
> >> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect 
> >> > the
> >> > size.
> >> >
> >> > We've reduced the pg_log to 500, and started offline trimming it where 
> >> > we can,
> >> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
> >> > nodes, but we're  still recovering, and have a lot of ODSs down and out 
> >> > still.
> >> >
> >> > We're unsure if version 14.2.13 triggered this, or if the osd restarts 
> >> > triggered
> >> > this (or something unrelated we don't see).
> >> >
> >> > This mail is mostly to figure out if there are good guesses why the 
> >> > pg_log size
> >> > per OSD process exploded? Any technical (and moral) support is 
> >> > appr

[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Kalle Happonen
Hi Dan @ co.,
Thanks for the support (moral and technical).

That sounds like a good guess, but it seems like there is nothing alarming 
here. In all our pools, some pgs are a bit over 3100, but not at any 
exceptional values.

cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size"
  "pgid": "37.2b9",
  "ondisk_log_size": 3103,
  "pgid": "33.e",
  "ondisk_log_size": 3229,
  "pgid": "7.2",
  "ondisk_log_size": 3111,
  "pgid": "26.4",
  "ondisk_log_size": 3185,
  "pgid": "33.4",
  "ondisk_log_size": 3311,
  "pgid": "33.8",
  "ondisk_log_size": 3278,

I also have no idea what the average size of a pg log entry should be, in our 
case it seems it's around 8 MB (22GB/3000 entires).

Cheers,
Kalle

- Original Message -
> From: "Dan van der Ster" 
> To: "Kalle Happonen" 
> Cc: "ceph-users" , "xie xingguo" 
> , "Samuel Just" 
> Sent: Tuesday, 17 November, 2020 12:22:28
> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case

> Hi Kalle,
> 
> Do you have active PGs now with huge pglogs?
> You can do something like this to find them:
> 
>   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
> select(.ondisk_log_size > 3000)'
> 
> If you find some, could you increase to debug_osd = 10 then share the osd log.
> I am interested in the debug lines from calc_trim_to_aggressively (or
> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
> might show other issues.
> 
> Cheers, dan
> 
> 
> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  wrote:
>>
>> Hi Kalle,
>>
>> Strangely and luckily, in our case the memory explosion didn't reoccur
>> after that incident. So I can mostly only offer moral support.
>>
>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
>> think this is suspicious:
>>
>>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>>
>>https://github.com/ceph/ceph/commit/b670715eb4
>>
>> Given that it adds a case where the pg_log is not trimmed, I wonder if
>> there could be an unforeseen condition where `last_update_ondisk`
>> isn't being updated correctly, and therefore the osd stops trimming
>> the pg_log altogether.
>>
>> Xie or Samuel: does that sound possible?
>>
>> Cheers, Dan
>>
>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  wrote:
>> >
>> > Hello all,
>> > wrt:
>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>> >
>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
>> > above.
>> >
>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
>> > node.
>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>> >
>> > The cluster has been running fine, and (as relevant to the post) the memory
>> > usage has been stable at 100 GB / node. We've had the default pg_log of 
>> > 3000.
>> > The user traffic doesn't seem to have been exceptional lately.
>> >
>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
>> > memory
>> > usage on OSD nodes started to grow. On each node it grew steadily about 30
>> > GB/day, until the servers started OOM killing OSD processes.
>> >
>> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
>> > process
>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and 
>> > then
>> > the cluster was in an unstable situation. This is significantly more than 
>> > the
>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect 
>> > the
>> > size.
>> >
>> > We've reduced the pg_log to 500, and started offline trimming it where we 
>> > can,
>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some
>> > nodes, but we're  still recovering, and have a lot of ODSs down and out 
>> > still.
>> >
>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts 
>> > triggered
>> > this (or something unrelated we don't see).
>> >
>> > This mail is mostly to figure out if there are good guesses why the pg_log 
>> > size
>> > per OSD process exploded? Any technical (and moral) support is appreciated.
>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also 
>> > to
>> > put a data point out there for other debuggers.
>> >
>> > Cheers,
>> > Kalle Happonen
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Dan van der Ster
Hi Kalle,

Do you have active PGs now with huge pglogs?
You can do something like this to find them:

   ceph pg dump -f json | jq '.pg_map.pg_stats[] |
select(.ondisk_log_size > 3000)'

If you find some, could you increase to debug_osd = 10 then share the osd log.
I am interested in the debug lines from calc_trim_to_aggressively (or
calc_trim_to if you didn't enable pglog_hardlimit), but the whole log
might show other issues.

Cheers, dan


On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster  wrote:
>
> Hi Kalle,
>
> Strangely and luckily, in our case the memory explosion didn't reoccur
> after that incident. So I can mostly only offer moral support.
>
> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
> think this is suspicious:
>
>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>
>https://github.com/ceph/ceph/commit/b670715eb4
>
> Given that it adds a case where the pg_log is not trimmed, I wonder if
> there could be an unforeseen condition where `last_update_ondisk`
> isn't being updated correctly, and therefore the osd stops trimming
> the pg_log altogether.
>
> Xie or Samuel: does that sound possible?
>
> Cheers, Dan
>
> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  wrote:
> >
> > Hello all,
> > wrt: 
> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
> >
> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
> > above.
> >
> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
> > node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
> >
> > The cluster has been running fine, and (as relevant to the post) the memory 
> > usage has been stable at 100 GB / node. We've had the default pg_log of 
> > 3000. The user traffic doesn't seem to have been exceptional lately.
> >
> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> > memory usage on OSD nodes started to grow. On each node it grew steadily 
> > about 30 GB/day, until the servers started OOM killing OSD processes.
> >
> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
> > process pg_log had grown to ~22GB, which we naturally didn't have memory 
> > for, and then the cluster was in an unstable situation. This is 
> > significantly more than the 1,5 GB in the post above. We do have ~20k pgs, 
> > which may directly affect the size.
> >
> > We've reduced the pg_log to 500, and started offline trimming it where we 
> > can, and also just waited. The pg_log size dropped to ~1,2 GB on at least 
> > some nodes, but we're  still recovering, and have a lot of ODSs down and 
> > out still.
> >
> > We're unsure if version 14.2.13 triggered this, or if the osd restarts 
> > triggered this (or something unrelated we don't see).
> >
> > This mail is mostly to figure out if there are good guesses why the pg_log 
> > size per OSD process exploded? Any technical (and moral) support is 
> > appreciated. Also, currently we're not sure if 14.2.13 triggered this, so 
> > this is also to put a data point out there for other debuggers.
> >
> > Cheers,
> > Kalle Happonen
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Dan van der Ster
Hi Xie,

On Tue, Nov 17, 2020 at 11:14 AM  wrote:
>
> Hi Dan,
>
>
> > Given that it adds a case where the pg_log is not trimmed, I wonder if
> > there could be an unforeseen condition where `last_update_ondisk`
> > isn't being updated correctly, and therefore the osd stops trimming
> > the pg_log altogether.
>
> >
>
> > Xie or Samuel: does that sound possible?
>
>
> "b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk"
>
>
> sounds like the culprit to me if the cluster pgs never go active and recover 
> under min_size.

Thanks for the reply. In our case the cluster was HEALTH_OK -- all PGs
active and running for two weeks after upgrading to v14.2.11 (from
12.2.12). It took two weeks for us to notice that the pg logs were
growing without bound.

-- dan

>
>
>
> 原始邮件
> 发件人:DanvanderSter
> 收件人:Kalle Happonen;
> 抄送人:Ceph Users;谢型果10072465;Samuel Just;
> 日 期 :2020年11月17日 16:56
> 主 题 :Re: [ceph-users] osd_pglog memory hoarding - another case
> Hi Kalle,
>
> Strangely and luckily, in our case the memory explosion didn't reoccur
> after that incident. So I can mostly only offer moral support.
>
> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
> think this is suspicious:
>
>b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk
>
>https://github.com/ceph/ceph/commit/b670715eb4
>
> Given that it adds a case where the pg_log is not trimmed, I wonder if
> there could be an unforeseen condition where `last_update_ondisk`
> isn't being updated correctly, and therefore the osd stops trimming
> the pg_log altogether.
>
> Xie or Samuel: does that sound possible?
>
> Cheers, Dan
>
> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  wrote:
> >
> > Hello all,
> > wrt: 
> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
> >
> > Yesterday we hit a problem with osd_pglog memory, similar to the thread 
> > above.
> >
> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
> > node. We run 8+3 EC for the data pool (metadata is on replicated nvme 
> > pool)..
> >
> > The cluster has been running fine, and (as relevant to the post) the memory 
> > usage has been stable at 100 GB / node. We've had the default pg_log of 
> > 3000. The user traffic doesn't seem to have been exceptional lately.
> >
> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> > memory usage on OSD nodes started to grow. On each node it grew steadily 
> > about 30 GB/day, until the servers started OOM killing OSD processes.
> >
> > After a lot of debugging we found that the pg_logs were huge. Each OSD 
> > process pg_log had grown to ~22GB, which we naturally didn't have memory 
> > for, and then the cluster was in an unstable situation. This is 
> > significantly more than the 1,5 GB in the post above. We do have ~20k pgs, 
> > which may directly affect the size.
> >
> > We've reduced the pg_log to 500, and started offline trimming it where we 
> > can, and also just waited. The pg_log size dropped to ~1,2 GB on at least 
> > some nodes, but we're  still recovering, and have a lot of ODSs down and 
> > out still.
> >
> > We're unsure if version 14.2.13 triggered this, or if the osd restarts 
> > triggered this (or something unrelated we don't see).
> >
> > This mail is mostly to figure out if there are good guesses why the pg_log 
> > size per OSD process exploded? Any technical (and moral) support is 
> > appreciated. Also, currently we're not sure if 14.2.13 triggered this, so 
> > this is also to put a data point out there for other debuggers.
> >
> > Cheers,
> > Kalle Happonen
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread xie.xingguo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-17 Thread Dan van der Ster
Hi Kalle,

Strangely and luckily, in our case the memory explosion didn't reoccur
after that incident. So I can mostly only offer moral support.

But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
think this is suspicious:

   b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk

   https://github.com/ceph/ceph/commit/b670715eb4

Given that it adds a case where the pg_log is not trimmed, I wonder if
there could be an unforeseen condition where `last_update_ondisk`
isn't being updated correctly, and therefore the osd stops trimming
the pg_log altogether.

Xie or Samuel: does that sound possible?

Cheers, Dan

On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen  wrote:
>
> Hello all,
> wrt: 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>
> Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
>
> We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
> node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>
> The cluster has been running fine, and (as relevant to the post) the memory 
> usage has been stable at 100 GB / node. We've had the default pg_log of 3000. 
> The user traffic doesn't seem to have been exceptional lately.
>
> Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> memory usage on OSD nodes started to grow. On each node it grew steadily 
> about 30 GB/day, until the servers started OOM killing OSD processes.
>
> After a lot of debugging we found that the pg_logs were huge. Each OSD 
> process pg_log had grown to ~22GB, which we naturally didn't have memory for, 
> and then the cluster was in an unstable situation. This is significantly more 
> than the 1,5 GB in the post above. We do have ~20k pgs, which may directly 
> affect the size.
>
> We've reduced the pg_log to 500, and started offline trimming it where we 
> can, and also just waited. The pg_log size dropped to ~1,2 GB on at least 
> some nodes, but we're  still recovering, and have a lot of ODSs down and out 
> still.
>
> We're unsure if version 14.2.13 triggered this, or if the osd restarts 
> triggered this (or something unrelated we don't see).
>
> This mail is mostly to figure out if there are good guesses why the pg_log 
> size per OSD process exploded? Any technical (and moral) support is 
> appreciated. Also, currently we're not sure if 14.2.13 triggered this, so 
> this is also to put a data point out there for other debuggers.
>
> Cheers,
> Kalle Happonen
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] EC cluster cascade failures and performance problems

2020-11-17 Thread Paul Kramme
Hello,

currently, we are experiencing problems with a cluster used for storing
RBD backups. Config:

* 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL
* k=4 m=2 EC
* dual 25GbE NIC
* v14.2.8

ceph health detail shows the following messages:

HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not
deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for
18629 sec, mon.cloud10-1517 has slow ops
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
 osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of
185 GiB) to slow device
PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time
pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958
pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935
[--- cut ---]
PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on)
snap trim queue for pg 18.2c5 at 41630
snap trim queue for pg 18.d6 at 44079
longest queue on pg 18.d6 at 44079
try decreasing "osd snap trim sleep" and/or increasing "osd pg max
concurrent snap trims".
SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517
has slow ops

We've made some observations on that cluster:
* The BlueFS spillover goes away with "ceph tell osd.0 compact" but
comes back eventually
* The blockdb/WAL SSD is highly utilized, while the HDDs are not
* When one OSD fails, there is a cascade failure taking down many other
OSDs across all nodes. Most of the time, the cluster comes back when
setting the nodown flag and restarting all failed OSDs one by one
* Sometimes, especially during maintenance, "Long heartbeat ping times
on front/back interface seen, longest is 1390.076 msec" messages pop up
* The cluster performance deteriorates sharply when upgrading from
14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8

Of these problems, the OSD cascade failure is the most important, and is
responsible for lenghty downtimes in the past few weeks.

Do you have any ideas on how to combat these problems?

Thank you,

Paul

-- 
Mit freundlichen Grüßen
  Paul Kramme
Ihr Profihost Team

---
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Stefan Priebe, Marc Zocher, Dr. Claus Boyens,
Daniel Hagemeier
Aufsichtsrat: Gabriele Pulvermüller (Vorsitzende)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osd_pglog memory hoarding - another case

2020-11-17 Thread Kalle Happonen
Hello all,
wrt: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/

Yesterday we hit a problem with osd_pglog memory, similar to the thread above.

We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. 
We run 8+3 EC for the data pool (metadata is on replicated nvme pool).

The cluster has been running fine, and (as relevant to the post) the memory 
usage has been stable at 100 GB / node. We've had the default pg_log of 3000. 
The user traffic doesn't seem to have been exceptional lately.

Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory 
usage on OSD nodes started to grow. On each node it grew steadily about 30 
GB/day, until the servers started OOM killing OSD processes. 

After a lot of debugging we found that the pg_logs were huge. Each OSD process 
pg_log had grown to ~22GB, which we naturally didn't have memory for, and then 
the cluster was in an unstable situation. This is significantly more than the 
1,5 GB in the post above. We do have ~20k pgs, which may directly affect the 
size.

We've reduced the pg_log to 500, and started offline trimming it where we can, 
and also just waited. The pg_log size dropped to ~1,2 GB on at least some 
nodes, but we're  still recovering, and have a lot of ODSs down and out still.

We're unsure if version 14.2.13 triggered this, or if the osd restarts 
triggered this (or something unrelated we don't see).

This mail is mostly to figure out if there are good guesses why the pg_log size 
per OSD process exploded? Any technical (and moral) support is appreciated. 
Also, currently we're not sure if 14.2.13 triggered this, so this is also to 
put a data point out there for other debuggers.

Cheers,
Kalle Happonen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-17 Thread Janek Bevendorff
I have run radosgw-admin gc list (without --include-all) a few times 
already, but the list was always empty. I will create a cron job running 
it every few minutes and writing out the results.


On 17/11/2020 02:22, Eric Ivancich wrote:
I’m wondering if anyone experiencing this bug would mind running 
`radosgw-admin gc list --include-all` on a schedule and saving the 
results. I’d like to know whether these tail objects are getting 
removed by the gc process. If we find that that’s the case then 
there’s the issue of how they got on the gc list.


Eric


On Nov 16, 2020, at 3:48 AM, Janek Bevendorff 
> wrote:


As noted in the bug report, the issue has affected only multipart 
objects at this time. I have added some more remarks there.


And yes, multipart objects tend to have 0 byte head objects in 
general. The affected objects are simply missing all shadow objects, 
leaving us with nothing but the empty head object and a few metadata.



On 13/11/2020 20:14, Eric Ivancich wrote:

Thank you for the answers to those questions, Janek.

And in case anyone hasn’t seen it, we do have a tracker for this issue:

https://tracker.ceph.com/issues/47866

We may want to move most of the conversation to the comments there, 
so everything’s together.


I do want to follow up on your answer to Question 4, Janek:

On Nov 13, 2020, at 12:22 PM, Janek Bevendorff 
> wrote:


4. Is anyone experiencing this issue willing to run their RGWs 
with 'debug_ms=1'? That would allow us to see a request from an 
RGW to either remove a tail object or decrement its reference 
counter (and when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I 
think so, I read 1byte from all objects, but didn't compare 
checksums, so I cannot say if all objects are complete, but at 
least all are there).


With multipart uploads I believe this is a sufficient test, as the 
first bit of data is in the first tail object, and it’s tail objects 
that seem to be disappearing.


However if the object is not uploaded via multipart and if it does 
have tail (_shadow_) objects, then the initial data is stored in the 
head object. So this test would not be truly diagnostic. This could 
be done with a large object, for example, with `s3cmd put 
--disable-multipart …`.


Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io