[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
> But there can be a on chip disk controller on the motherboard, I'm not sure.

There is always some kind of controller. Could be on-board. Usually, the cache 
settings are accessible when booting into the BIOS set-up.

> If your worry is fsync persistence

No, what I worry about is volatile write cache, which is usually enabled by 
default. This cache exists on disk as well as on controller. To avoid loosing 
writes on power fail, the controller needs to be in write-through mode and the 
disk write cache disabled. The latter can be done with smartctl, the former in 
the BIOS setup.

Did you test power failure? If so, how often? On how many hosts simultaneously? 
Pulling network cables will not trigger cache related problems. The problem 
with write cache is, that you rely on a lot of bells and whistles where some 
usually fail. With ceph, this will lead to exactly the problem you are 
observing now.

Your pool configuration looks OK. You need to find out where exactly the scrub 
errors are situated. It looks like meta-data damage and you might loose some 
data. Be careful to do only read-only admin operations for now.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 16:08:58
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

> Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd 
> pool ls detail".

File ceph-osd-pool-ls-detail.txt attached.


> Did you look at the disk/controller cache settings?

I don't have disk controllers on Ceph machines. The hard disk is directly 
attached to the motherboard via SATA cable. But there can be a on chip disk 
controller on the motherboard, I'm not sure.

If your worry is fsync persistence, I have thoroughly tested database fsync 
reliability on Ceph RBD with hundreds of transactions per second and remove 
network cable and restart the database machine, etc. while inserts going on. 
and I did not lose a single transaction. I simulated this many times and 
persistence on my Ceph cluster was perfect (i.e not a single loss).


> I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and 
> record the output of "ceph -w | grep '3\.b'" (note the single quotes).

> The error messages you included in one of your first e-mails are only on 1 
> out of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors.

I ran again the "ceph pg deep-scrub 3.b", here is the whole output of ceph -w:


2020-11-02 22:33:48.224392 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key


2020-11-02 22:33:48.224396 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info


2020-11-02 22:35:30.087042 osd.0 [ERR] 3.b deep-scrub 3 errors


Btw, I'm very grateful for your perseverance on this.


Best regards

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd 
pool ls detail".

Did you look at the disk/controller cache settings?

I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and record 
the output of "ceph -w | grep '3\.b'" (note the single quotes).

The error messages you included in one of your first e-mails are only on 1 out 
of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 14:25:08
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> the primary OSD is probably not listed as a peer. Can you post the complete 
> output of

> - ceph pg 3.b query
> - ceph pg dump
> - ceph osd df tree

> in a pastebin?

Yes, the Primary OSD is 0.

I have attached above as .txt files. Please let me know if you still cannot 
read them.

Regards

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagara,

the primary OSD is probably not listed as a peer. Can you post the complete 
output of

- ceph pg 3.b query
- ceph pg dump
- ceph osd df tree

in a pastebin?

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 11:53:58
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.


I checked other PGs with "active+clean", there is a "peer": "0".


But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 
1 and 2, 0 and 2, etc.


Regards


Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagra,

looks like you have one on a new and 2 on an old version. Can you add the 
information about which OSD each version resides?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 10:10:02
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a 
> write only after all copies are on disk. In other words, if PGs end up on 
> different versions after a power outage, one always needs to roll back. Since 
> you have two healthy OSDs in the PG and the PG is active (successfully 
> peered), it might just be a broken disk and read/write errors. I would focus 
> on that.

I tried to revert the PG as follows:

# ceph pg 3.b query | grep version
"last_user_version": 2263481,
"version": "4825'2264303",

"last_user_version": 2263481,
"version": "4825'2264301",

"last_user_version": 2263481,
"version": "4825'2264301",


ceph pg 3.b list_unfound

{
"num_missing": 0,
"num_unfound": 0,
"objects": [],
"more": false
}


# ceph pg 3.b mark_unfound_lost revert
pg has no unfound objects


# ceph pg 3.b revert
Invalid command: revert not in query
pg  query :  show details of a specific pg
Error EINVAL: invalid command


How to revert/rollback a PG?


> Another question, do you have write caches enabled (disk cache and controller 
> cache)? This is know to cause problems on power outages and also degraded 
> performance with ceph. You should check and disable any caches if necessary.

No. HDD is directly connected to motherboard.

Thank you

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagara,

I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a 
write only after all copies are on disk. In other words, if PGs end up on 
different versions after a power outage, one always needs to roll back. Since 
you have two healthy OSDs in the PG and the PG is active (successfully peered), 
it might just be a broken disk and read/write errors. I would focus on that.

Another question, do you have write caches enabled (disk cache and controller 
cache)? This is know to cause problems on power outages and also degraded 
performance with ceph. You should check and disable any caches if necessary.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:37:41
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

sorry: *badblocks* can force remappings of broken sectors (non-destructive 
read-write check)

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:35:35
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
sorry: *badblocks* can force remappings of broken sectors (non-destructive 
read-write check)

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:35:35
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
I think this happens when a PG has 3 different copies and cannot decide which 
one is correct. You might have hit a very rare case. You should start with the 
scrub errors, check which PGs and which copies (OSDs) are affected. It sounds 
almost like all 3 scrub errors are on the same PG.

You might have had a combination of crash and OSD fail, your situation is 
probably not covered by "single point of failure".

In case you have a PG with scrub errors on 2 copies, you should be able to 
reconstruct the PG from the third with PG export/PG import commands.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 13:16:08
To: ceph-users@ceph.io
Subject: [ceph-users] How to recover from 
active+clean+inconsistent+failed_repair?

Hi all

I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes.
A crash happened and all 3 Ceph nodes went down.
One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the 
repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the 
question and cannot bring the cluster to "active+clean".
How do I rescue the cluster? Is this a false positive?
Here are the detail:
All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds.

1. ceph -s
health: HEALTH_ERR3 scrub errorsPossible data damage: 1 
pg inconsistent
pgs: 191 active+clean 1   active+clean+inconsistent

2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistentpg 3.b is active+clean+inconsistent, acting [0,1,2]

3. rados list-inconsistent-pg rbd[]

4. ceph pg deep-scrub 3.b

5. ceph pg repair 3.b

6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistentpg 3.b is active+clean+inconsistent+failed_repair, acting 
[0,1,2]

7. rados list-inconsistent-obj 3.b --format=json-pretty{    "epoch": 4769,
"inconsistents": []}

8. ceph pg 3.b list_unfound {    "num_missing": 0,"num_unfound": 0,
"objects": [],"more": false}
Appreciate your help.
ThanksSagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very high read IO during backfilling

2020-10-30 Thread Frank Schilder
Are you a victim of bluefs_buffered_io=false: 
https://www.mail-archive.com/ceph-users@ceph.io/msg05550.html ?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Kamil Szczygieł 
Sent: 27 October 2020 21:39:22
To: ceph-users@ceph.io
Subject: [ceph-users] Very high read IO during backfilling

Hi,

We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory 
each) that are running mon, mds and mgr and also 4 data nodes (12 core, 256 GB 
memory, 13x10TB HDDs each). We've increased number of PGs inside our pool, 
which resulted in all OSDs going crazy and reading the average of 900 M/s 
constantly (based on iotop).

This has resulted in slow ops and very low recovery speed. Any tips on how to 
handle this kind of situation? We've osd_recovery_sleep_hdd set to 0.2, 
osd_recovery_max_active set to 5 and osd_max_backfills set to 4. Some OSDs are 
reporting slow ops constantly and iowait on machines is at 70-80% constantly.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release

2020-10-30 Thread Frank Schilder
umount + mount worked. Thanks!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 30 October 2020 10:22:38
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond 
to capability release

Hi,

You said you dropped caches -- can you try again echo 3 >
/proc/sys/vm/drop_caches ?

Otherwise, does umount then mount from one of the clients clear the warning?

(I don't believe this is due to a "busy client", but rather a kernel
client bug where it doesn't release caps in some cases -- we've seen
this in the past but not recently).

-- Dan

On Fri, Oct 30, 2020 at 10:13 AM Frank Schilder  wrote:
>
> Dear cephers,
>
> I have a somewhat strange situation. I have the health warning:
>
> # ceph health detail
> HEALTH_WARN 3 clients failing to respond to capability release
> MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
> mdsceph-12(mds.0): Client sn106.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 30716617
> mdsceph-12(mds.0): Client sn269.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 30717358
> mdsceph-12(mds.0): Client sn009.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 30749150
>
> However, these clients are not busy right now. Also, they hold almost 
> nothing; see snippets from "session ls" below. It is possible that a very IO 
> intensive application was running on these nodes and these release requests 
> got stuck. How do I resolve this issue? Can I just evict the client?
>
> Version is mimic 13.2.8. Note that we execute a drop cache command after a 
> job finishes on these clients. Its possible that the clients dropped the caps 
> already before the MDS request was handled/received.
>
> Best regards,
> Frank
>
> {
> "id": 30717358,
> "num_leases": 0,
> "num_caps": 44,
> "state": "open",
> "request_load_avg": 0,
> "uptime": 6632206.332307,
> "replay_requests": 0,
> "completed_requests": 0,
> "reconnecting": false,
> "inst": "client.30717358 192.168.57.140:0/3212676185",
> "client_metadata": {
> "features": "00ff",
> "entity_id": "con-fs2-hpc",
> "hostname": "sn269.hpc.ait.dtu.dk",
> "kernel_version": "3.10.0-957.12.2.el7.x86_64",
> "root": "/hpc/home"
> }
> },
> --
> {
> "id": 30716617,
> "num_leases": 0,
> "num_caps": 48,
> "state": "open",
> "request_load_avg": 1,
> "uptime": 6632206.336307,
> "replay_requests": 0,
> "completed_requests": 1,
> "reconnecting": false,
> "inst": "client.30716617 192.168.56.233:0/2770977433",
> "client_metadata": {
> "features": "00ff",
> "entity_id": "con-fs2-hpc",
> "hostname": "sn106.hpc.ait.dtu.dk",
> "kernel_version": "3.10.0-957.12.2.el7.x86_64",
> "root": "/hpc/home"
> }
> },
> --
> {
> "id": 30749150,
> "num_leases": 0,
> "num_caps": 44,
> "state": "open",
> "request_load_avg": 0,
> "uptime": 6632206.338307,
> "replay_requests": 0,
> "completed_requests": 0,
> "reconnecting": false,
> "inst": "client.30749150 192.168.56.136:0/2578719015",
> "client_metadata": {
> "features": "00ff",
> "entity_id": "con-fs2-hpc",
> "hostname": "sn009.hpc.ait.dtu.dk",
> "kernel_version": "3.10.0-957.12.2.el7.x86_64",
> "root": "/hpc/home"
> }
> },
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release

2020-10-30 Thread Frank Schilder
Dear cephers,

I have a somewhat strange situation. I have the health warning:

# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release
MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
mdsceph-12(mds.0): Client sn106.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 30716617
mdsceph-12(mds.0): Client sn269.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 30717358
mdsceph-12(mds.0): Client sn009.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 30749150

However, these clients are not busy right now. Also, they hold almost nothing; 
see snippets from "session ls" below. It is possible that a very IO intensive 
application was running on these nodes and these release requests got stuck. 
How do I resolve this issue? Can I just evict the client?

Version is mimic 13.2.8. Note that we execute a drop cache command after a job 
finishes on these clients. Its possible that the clients dropped the caps 
already before the MDS request was handled/received.

Best regards,
Frank

{
"id": 30717358,
"num_leases": 0,
"num_caps": 44,
"state": "open",
"request_load_avg": 0,
"uptime": 6632206.332307,
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.30717358 192.168.57.140:0/3212676185",
"client_metadata": {
"features": "00ff",
"entity_id": "con-fs2-hpc",
"hostname": "sn269.hpc.ait.dtu.dk",
"kernel_version": "3.10.0-957.12.2.el7.x86_64",
"root": "/hpc/home"
}
},
--
{
"id": 30716617,
"num_leases": 0,
"num_caps": 48,
"state": "open",
"request_load_avg": 1,
"uptime": 6632206.336307,
"replay_requests": 0,
"completed_requests": 1,
"reconnecting": false,
"inst": "client.30716617 192.168.56.233:0/2770977433",
"client_metadata": {
"features": "00ff",
"entity_id": "con-fs2-hpc",
"hostname": "sn106.hpc.ait.dtu.dk",
"kernel_version": "3.10.0-957.12.2.el7.x86_64",
"root": "/hpc/home"
}
},
--
{
"id": 30749150,
"num_leases": 0,
"num_caps": 44,
"state": "open",
    "request_load_avg": 0,
"uptime": 6632206.338307,
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.30749150 192.168.56.136:0/2578719015",
"client_metadata": {
"features": "00ff",
"entity_id": "con-fs2-hpc",
"hostname": "sn009.hpc.ait.dtu.dk",
"kernel_version": "3.10.0-957.12.2.el7.x86_64",
"root": "/hpc/home"
}
},

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: frequent Monitor down

2020-10-30 Thread Frank Schilder
I remember exactly this discussion some time ago, where one of the developers 
gave some more subtle reasons for not using even numbers. The maths sounds 
simple, with 4 mons you can tolerate the loss of 1, just like with 3 mons. The 
added benefit seems to be the extra copy of a mon.

However, the reality is not that simple. There is apparently some kind of 
subtlety that has more to do with the physical set-up that makes 4 mons worse 
than 3 (more likely to lead to loss of service). I do not remember the thread, 
but it was within the last year.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Janne Johansson 
Sent: 29 October 2020 22:07:45
To: Tony Liu
Cc: Marc Roos; ceph-users
Subject: [ceph-users] Re: frequent Monitor down

Den tors 29 okt. 2020 kl 20:16 skrev Tony Liu :

> Typically, the number of nodes is 2n+1 to cover n failures.
> It's OK to have 4 nodes, from failure covering POV, it's the same
> as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the
> cluster is down. It works, just not make much sense.
>
>
Well, you can see it the other way around, with 3 configured mons, and only
2 up, you know you have a majority and can go on with writes.
With 4 configured mons and only 2 up, it stops because you get the split
brain scenario. For a 2DC setup with 2 mons at each place, a split is still
fatal.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-29 Thread Frank Schilder
> ... i will use now only one site, but need first stabilice the
> cluster to remove the EC erasure coding and use replicate ...

If you change to one site only, there is no point in getting rid of the EC 
pool. Your main problem will be restoring the lost data. Do you have backup of 
everything? Do you still have the old OSDs? You never answered these questions.

To give you an idea why this is important, with ceph, loosing 1% of data on an 
rbd pool does *not* mean you loose 1% of the disks. It means that, on average, 
every disk looses 1% of its blocks. In other words, getting everything up again 
will be a lot of work either way.

The best path to follow is what Eugen suggested: add mons to have at least 3 
and dig out the old disks to be able to export and import PGs. Look at Eugen's 
last 2 e-mails, its a starting point. You might be able to recover more by 
reducing temporarily min_size to 1 on the replicated pools and to 4 on the EC 
pool. If possible, make sure there is no client access during that time. The 
missing rest needs to be scraped off the OSDs you deleted from the cluster.

If you have backup of everything, starting from scratch and populating the ceph 
cluster from backup might be the fastest option.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 28 October 2020 07:23:09
To: Ing. Luis Felipe Domínguez Vega
Cc: Ceph Users
Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT]

If you have that many spare hosts I would recommend to deploy two more
MONs on them, and probably also additional MGRs so they can failover.

What is the EC profile for the data_storage pool?

Can you also share

ceph pg dump pgs | grep -v "active+clean"

to see which PGs are affected.
The remaining issue with unfound objects and unkown PGs could be
because you removed OSDs. That could mean data loss, but maybe there's
a chance to recover anyway.


Zitat von "Ing. Luis Felipe Domínguez Vega" :

> Well recovering not working yet... i was started 6 servers more and
> the cluster not yet recovered.
> Ceph status not show any recover progress
>
> ceph -s : https://pastebin.ubuntu.com/p/zRQPbvGzbw/
> ceph osd tree   : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
> ceph osd df : https://pastebin.ubuntu.com/p/ysbh8r2VVz/
> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
> crush rules : (ceph osd crush rule dump)
> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>
> El 2020-10-27 09:59, Eugen Block escribió:
>> Your pool 'data_storage' has a size of 7 (or 7 chunks since it's
>> erasure-coded) and the rule requires each chunk on a different host
>> but you currently have only 5 hosts available, that's why the recovery
>> is not progressing. It's waiting for two more hosts. Unfortunately,
>> you can't change the EC profile or the rule of that pool. I'm not sure
>> if it would work in the current cluster state, but if you can't add
>> two more hosts (which would be your best option for recovery) it might
>> be possible to create a new replicated pool (you seem to have enough
>> free space) and copy the contents from that EC pool. But as I said,
>> I'm not sure if that would work in a degraded state, I've never tried
>> that.
>>
>> So your best bet is to get two more hosts somehow.
>>
>>
>>> pool 4 'data_storage' erasure profile desoft size 7 min_size 5
>>> crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32
>>> autoscale_mode off last_change 154384 lfor 0/121016/121014 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> application rbd
>>
>>
>> Zitat von "Ing. Luis Felipe Domínguez Vega" :
>>
>>> Needed data:
>>>
>>> ceph -s : https://pastebin.ubuntu.com/p/S9gKjyZtdK/
>>> ceph osd tree   : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
>>> ceph osd df : (later, because i'm waiting since 10
>>> minutes and not output yet)
>>> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
>>> crush rules : (ceph osd crush rule dump)
>>> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>>>
>>> El 2020-10-27 07:14, Eugen Block escribió:
>>>>> I understand, but i delete the OSDs from CRUSH map, so ceph
>>>>> don't   wait for these OSDs, i'm right?
>>>>
>>>> It depends on your actual crush tree and rules. Can you share (maybe
>>>> you already did)
>>>>
>>>> ceph osd tree
>>>> ceph osd df
>>>> ceph osd pool ls detail
>>>>
>>>> and a dump

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
I think you really need to sit down and explain the full story. Dropping 
one-liners with new information will not work via e-mail.

I have never heard of the problem you are facing, so you did something that 
possibly no-one else has done before. Unless we know the full history from the 
last time the cluster was health_ok until now, it will almost certainly not be 
possible to figure out what is going on via e-mail.

Usually, setting "norebalance" and "norecovery" should stop any recovery IO and 
allow the PGs to peer. If they do not become active, something is wrong and the 
information we got so far does not give a clue what this could be.

Please post the output of "ceph health detail", "ceph osd pool stats" and "ceph 
osd pool ls detail" and a log of actions and results since last health_ok 
status here, maybe it gives a clue what is going on.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou 
Sent: 29 October 2020 09:44:14
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] monitor sst files continue growing

I reset the pg_num after adding osd, it made some pg inactive(in activating 
state)

Frank Schilder mailto:fr...@dtu.dk>> 于2020年10月29日周四 下午3:56写道:
This does not explain incomplete and inactive PGs. Are you hitting 
https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover 
from OSD restart"? In that case, temporarily stopping and restarting all new 
OSDs might help.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou mailto:deader...@gmail.com>>
Sent: 29 October 2020 08:30:25
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] monitor sst files continue growing

After add OSDs into the cluster, the recovery and backfill progress has not 
finished yet

Zhenshi Zhou 
mailto:deader...@gmail.com><mailto:deader...@gmail.com<mailto:deader...@gmail.com>>>
 于2020年10月29日周四 下午3:29写道:
MGR is stopped by me cause it took too much memories.
For pg status, I added some OSDs in this cluster, and it

Frank Schilder 
mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 
于2020年10月29日周四 下午3:27写道:
Your problem is the overall cluster health. The MONs store cluster history 
information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs 
only makes things worse right now. The health status is a mess, no MGR, a bunch 
of PGs inactive, etc. This is what you need to resolve. How did your cluster 
end up like this?

It looks like all OSDs are up and in. You need to find out

- why there are inactive PGs
- why there are incomplete PGs

This usually happens when OSDs go missing.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou 
mailto:deader...@gmail.com><mailto:deader...@gmail.com<mailto:deader...@gmail.com>>>
Sent: 29 October 2020 07:37:19
To: ceph-users
Subject: [ceph-users] monitor sst files continue growing

Hi all,

My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
continue growing. It claims mon are using a lot of disk space.

I set "mon compact on start = true" and restart one of the monitors. But
it started and campacting for a long time, seems it has no end.

[image.png]
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
He he.

> It will prevent OSDs from being marked out if you shut them down or the .

... down or the MONs loose heartbeats due to high network load during peering.

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 29 October 2020 09:05:27
To: Mark Johnson; ceph-users@ceph.io
Subject: [ceph-users] Re: pgs stuck backfill_toofull

It will prevent OSDs from being marked out if you shut them down or the . 
Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get 
overloaded by peering traffic and the MONs can loose contact for a while. 
Setting noout will prevent flapping and also reduce the administrative traffic 
a bit. Its just a precaution.

If this is a production system, you need to rethink your size 2 min size 1 
config. This is the major problem for keeping the service available under 
maintenance.

Please take your time and read the docs on all the commands I sent you. The 
cluster status is not critical as far as I can see.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Johnson 
Sent: 29 October 2020 08:58:15
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: pgs stuck backfill_toofull

Thanks again Frank.  That gives me something to digest (and try to understand).

One question regarding maintenance mode, these are production systems that are 
required to be available all the time.  What, exactly, will happen if I issue 
this command for maintenance mode?

Thanks,
Mark


On Thu, 2020-10-29 at 07:51 +0000, Frank Schilder wrote:

Cephfs pools are uncritical, because ceph fs splits very large files into 
chunks of objectsize. The RGW pool is the problem, because RGW does not as far 
as I know. A few 1TB uploads and you have a problem.


The calculation is confusing, because the term PG is used in two different 
meanings, unfortunately. The pool PG count and OSD PG count are different 
things. A PG is a virtual raid set distributed over some OSDs. The number of 
PGs in a pool is the count of such raid sets. The PG count for an OSD is in 
fact the PG membership count - something completely different. It says in how 
many PGs an OSD is a member of. To create 100PGs with replication 3 you need 
3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG 
memberships each. This is shown as PGs in the utilisation columns. If these 
terms were used with a bit more precision, it would be less confusing.


If the data distribution will remain more or less the same in the near future, 
changing the PG count as follows should help:


Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG 
count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG 
memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per 
OSD. This will leave PG memberships available for the future and should sort 
out your distribution problem.


If you want to follow this route, you can do the following:


- ceph osd set noout # maintenance mode

- ceph osd set norebalance # prevent immediate start of rebalancing

- increase pg_num and pgp_num of pool 20 to 512

- increase the reweight of osd.3 to, say 0.8

- wait for peering to finish and any recovery to complete

- ceph osd unset noout # leave maintenance mode

- if everything OK (all PGs active, no degraded objects, no recovery) do ceph 
osd unset norebalance

- once the rebalancing is finished, reweight the OSDs manually, the built-in 
reweight commands are a bit limited


is that just a matter of "ceph osd reweight osd.3 1"

Yes, this will do. However, increase probably in less aggressive steps. You 
will need some rebalancing, because you run a bit low on available space.


As a final note, running with size 2 min size 1 is a serious data redundancy 
risk. You should get another server and upgrade to 3(2).


Best regards,

=====

Frank Schilder

AIT Risø Campus

Bygning 109, rum S14




From: Mark Johnson <

<mailto:ma...@iovox.com>

ma...@iovox.com

>

Sent: 29 October 2020 08:19:01

To:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

; Frank Schilder

Subject: Re: pgs stuck backfill_toofull


Thanks for you swift reply.  Below is the requested information.


I understand the bit about not being able to reduce the pg count as we've come 
across this issue once before.  This is the reason I've been hesitant to make 
any changes there without being 100% certain of getting it right and the impact 
of these changes.  That, and the more I read about how to calculate this, the 
more confused I get.  As for the reweight, is that just a matter of "ceph osd 
reweight osd.3 1" once the other issues are sorted out (or perhaps start with a 
less dramatic change and work up)?


Also, presuming I need to change the pg/pgp num, would you be

[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
It will prevent OSDs from being marked out if you shut them down or the . 
Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get 
overloaded by peering traffic and the MONs can loose contact for a while. 
Setting noout will prevent flapping and also reduce the administrative traffic 
a bit. Its just a precaution.

If this is a production system, you need to rethink your size 2 min size 1 
config. This is the major problem for keeping the service available under 
maintenance.

Please take your time and read the docs on all the commands I sent you. The 
cluster status is not critical as far as I can see.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Johnson 
Sent: 29 October 2020 08:58:15
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: pgs stuck backfill_toofull

Thanks again Frank.  That gives me something to digest (and try to understand).

One question regarding maintenance mode, these are production systems that are 
required to be available all the time.  What, exactly, will happen if I issue 
this command for maintenance mode?

Thanks,
Mark


On Thu, 2020-10-29 at 07:51 +, Frank Schilder wrote:

Cephfs pools are uncritical, because ceph fs splits very large files into 
chunks of objectsize. The RGW pool is the problem, because RGW does not as far 
as I know. A few 1TB uploads and you have a problem.


The calculation is confusing, because the term PG is used in two different 
meanings, unfortunately. The pool PG count and OSD PG count are different 
things. A PG is a virtual raid set distributed over some OSDs. The number of 
PGs in a pool is the count of such raid sets. The PG count for an OSD is in 
fact the PG membership count - something completely different. It says in how 
many PGs an OSD is a member of. To create 100PGs with replication 3 you need 
3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG 
memberships each. This is shown as PGs in the utilisation columns. If these 
terms were used with a bit more precision, it would be less confusing.


If the data distribution will remain more or less the same in the near future, 
changing the PG count as follows should help:


Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG 
count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG 
memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per 
OSD. This will leave PG memberships available for the future and should sort 
out your distribution problem.


If you want to follow this route, you can do the following:


- ceph osd set noout # maintenance mode

- ceph osd set norebalance # prevent immediate start of rebalancing

- increase pg_num and pgp_num of pool 20 to 512

- increase the reweight of osd.3 to, say 0.8

- wait for peering to finish and any recovery to complete

- ceph osd unset noout # leave maintenance mode

- if everything OK (all PGs active, no degraded objects, no recovery) do ceph 
osd unset norebalance

- once the rebalancing is finished, reweight the OSDs manually, the built-in 
reweight commands are a bit limited


is that just a matter of "ceph osd reweight osd.3 1"

Yes, this will do. However, increase probably in less aggressive steps. You 
will need some rebalancing, because you run a bit low on available space.


As a final note, running with size 2 min size 1 is a serious data redundancy 
risk. You should get another server and upgrade to 3(2).


Best regards,

=====

Frank Schilder

AIT Risø Campus

Bygning 109, rum S14




From: Mark Johnson <

<mailto:ma...@iovox.com>

ma...@iovox.com

>

Sent: 29 October 2020 08:19:01

To:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

; Frank Schilder

Subject: Re: pgs stuck backfill_toofull


Thanks for you swift reply.  Below is the requested information.


I understand the bit about not being able to reduce the pg count as we've come 
across this issue once before.  This is the reason I've been hesitant to make 
any changes there without being 100% certain of getting it right and the impact 
of these changes.  That, and the more I read about how to calculate this, the 
more confused I get.  As for the reweight, is that just a matter of "ceph osd 
reweight osd.3 1" once the other issues are sorted out (or perhaps start with a 
less dramatic change and work up)?


Also, presuming I need to change the pg/pgp num, would you be suggesting on 
pool 2 based on the below info (the pool with a few large files) or on pool 20 
(the pool with the most data but an average of about 250KB file size)?  I'm 
just completely confused as to what's caused this issue in the first place and 
how to go about fixing it.  On top of that, am I going to be able to increase 
the pg/pgp count with the cluster in a state of health_warn?  Just some posts 
I've read 

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
This does not explain incomplete and inactive PGs. Are you hitting 
https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover 
from OSD restart"? In that case, temporarily stopping and restarting all new 
OSDs might help.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou 
Sent: 29 October 2020 08:30:25
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] monitor sst files continue growing

After add OSDs into the cluster, the recovery and backfill progress has not 
finished yet

Zhenshi Zhou mailto:deader...@gmail.com>> 于2020年10月29日周四 
下午3:29写道:
MGR is stopped by me cause it took too much memories.
For pg status, I added some OSDs in this cluster, and it

Frank Schilder mailto:fr...@dtu.dk>> 于2020年10月29日周四 下午3:27写道:
Your problem is the overall cluster health. The MONs store cluster history 
information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs 
only makes things worse right now. The health status is a mess, no MGR, a bunch 
of PGs inactive, etc. This is what you need to resolve. How did your cluster 
end up like this?

It looks like all OSDs are up and in. You need to find out

- why there are inactive PGs
- why there are incomplete PGs

This usually happens when OSDs go missing.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou mailto:deader...@gmail.com>>
Sent: 29 October 2020 07:37:19
To: ceph-users
Subject: [ceph-users] monitor sst files continue growing

Hi all,

My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
continue growing. It claims mon are using a lot of disk space.

I set "mon compact on start = true" and restart one of the monitors. But
it started and campacting for a long time, seems it has no end.

[image.png]
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
Cephfs pools are uncritical, because ceph fs splits very large files into 
chunks of objectsize. The RGW pool is the problem, because RGW does not as far 
as I know. A few 1TB uploads and you have a problem.

The calculation is confusing, because the term PG is used in two different 
meanings, unfortunately. The pool PG count and OSD PG count are different 
things. A PG is a virtual raid set distributed over some OSDs. The number of 
PGs in a pool is the count of such raid sets. The PG count for an OSD is in 
fact the PG membership count - something completely different. It says in how 
many PGs an OSD is a member of. To create 100PGs with replication 3 you need 
3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG 
memberships each. This is shown as PGs in the utilisation columns. If these 
terms were used with a bit more precision, it would be less confusing.

If the data distribution will remain more or less the same in the near future, 
changing the PG count as follows should help:

Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG 
count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG 
memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per 
OSD. This will leave PG memberships available for the future and should sort 
out your distribution problem.

If you want to follow this route, you can do the following:

- ceph osd set noout # maintenance mode
- ceph osd set norebalance # prevent immediate start of rebalancing
- increase pg_num and pgp_num of pool 20 to 512
- increase the reweight of osd.3 to, say 0.8
- wait for peering to finish and any recovery to complete
- ceph osd unset noout # leave maintenance mode
- if everything OK (all PGs active, no degraded objects, no recovery) do ceph 
osd unset norebalance
- once the rebalancing is finished, reweight the OSDs manually, the built-in 
reweight commands are a bit limited

> is that just a matter of "ceph osd reweight osd.3 1"
Yes, this will do. However, increase probably in less aggressive steps. You 
will need some rebalancing, because you run a bit low on available space.

As a final note, running with size 2 min size 1 is a serious data redundancy 
risk. You should get another server and upgrade to 3(2).

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Johnson 
Sent: 29 October 2020 08:19:01
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: pgs stuck backfill_toofull

Thanks for you swift reply.  Below is the requested information.

I understand the bit about not being able to reduce the pg count as we've come 
across this issue once before.  This is the reason I've been hesitant to make 
any changes there without being 100% certain of getting it right and the impact 
of these changes.  That, and the more I read about how to calculate this, the 
more confused I get.  As for the reweight, is that just a matter of "ceph osd 
reweight osd.3 1" once the other issues are sorted out (or perhaps start with a 
less dramatic change and work up)?

Also, presuming I need to change the pg/pgp num, would you be suggesting on 
pool 2 based on the below info (the pool with a few large files) or on pool 20 
(the pool with the most data but an average of about 250KB file size)?  I'm 
just completely confused as to what's caused this issue in the first place and 
how to go about fixing it.  On top of that, am I going to be able to increase 
the pg/pgp count with the cluster in a state of health_warn?  Just some posts 
I've read seem to indicate that the health state needs to be ok before this 
sort of thing can be changed (but I could be misunnderstanding what I'm 
reading).

Anyway, here's the info:

# ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
28219G 11227G   15558G 55.13
POOLS:
NAME  ID USED   %USED MAX AVAIL 
OBJECTS
rbd   0   0 0  690G 
   0
KUBERNETES1122G 15.11  690G
34188
KUBERNETES_METADATA   2  49310k 0  690G 
1426
default.rgw.control   11  0 0  690G 
   8
default.rgw.data.root 12 20076k 0  690G
54412
default.rgw.gc13  0 0  690G 
  32
default.rgw.log   14  0 0  690G 
 127
default.rgw.users.uid 15   4942 0  690G 
  15
default.rgw.users.keys16126 0  690G 
   4
default.rgw.users.swift   17252 0  690G 
   8
default.rgw.buckets.index 18  0 0  690G
27206
.rgw.root 

[ceph-users] Re: monitor sst files continue growing

2020-10-29 Thread Frank Schilder
Your problem is the overall cluster health. The MONs store cluster history 
information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs 
only makes things worse right now. The health status is a mess, no MGR, a bunch 
of PGs inactive, etc. This is what you need to resolve. How did your cluster 
end up like this?

It looks like all OSDs are up and in. You need to find out

- why there are inactive PGs
- why there are incomplete PGs

This usually happens when OSDs go missing.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zhenshi Zhou 
Sent: 29 October 2020 07:37:19
To: ceph-users
Subject: [ceph-users] monitor sst files continue growing

Hi all,

My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
continue growing. It claims mon are using a lot of disk space.

I set "mon compact on start = true" and restart one of the monitors. But
it started and campacting for a long time, seems it has no end.

[image.png]
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pgs stuck backfill_toofull

2020-10-29 Thread Frank Schilder
Hi Mark,

it looks like you have some very large PGs. Also, you run with a quite low PG 
count, in particular, for the large pool. Please post the output of "ceph df" 
and "ceph osd pool ls detail" to see how much data is in each pool and some 
pool info. I guess you need to increase the PG count of the large pool to split 
PGs up and also reduce the impact of imbalance. When I look at this:

 3 1.37790  0.45013  1410G  1079G   259G 76.49 1.39  21
 4 1.37790  0.95001  1410G  1086G   253G 76.98 1.40  44

I would conclude that the PGs are too large, the reweight of 0.45 without much 
utilization effect indicates that. This weight will need to be rectified as 
well at some time.

You should be able to run with 100-200 PGs per OSD. Please be aware that PG 
planning requires caution as you cannot reduce the PG count of a pool in your 
version. You need to know how much data is in the pools right now and what the 
future plan is.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Johnson 
Sent: 29 October 2020 06:55:55
To: ceph-users@ceph.io
Subject: [ceph-users] pgs stuck backfill_toofull

I've been struggling with this one for a few days now.  We had an OSD report as 
near full a few days ago.  Had this happen a couple of times before and a 
reweight-by-utilization has sorted it out in the past.  Tried the same again 
but this time we ended up with a couple of pgs in a state of backfill_toofull 
and a handful of misplaced objects as a result.

Tried doing the reweight a few more times and it's been moving data around.  We 
did have another osd trigger the near full alert but running the reweight a 
couple more times seems to have moved some of that data around a bit better.  
However, the original near_full osd doesn't seem to have changed much and the 
backfill_toofull pgs are still there.  I'd keep doing the 
reweight-by-utilization but I'm not sure if I'm heading down the right path and 
if it will eventually sort it out.

We have 14 pools, but the vast majority of data resides in just one of those 
pools (pool 20).  The pgs in the backfill state are in pool 2 (as far as I can 
tell).  That particular pool is used for some cephfs stuff and has a handful of 
large files in there (not sure if this is significant to the problem).

All up, our utilization is showing as 55.13% but some of our OSDs are showing 
as 76% in use with this one problem sitting at 85.02%.  Right now, I'm just not 
sure what the proper corrective action is.  The last couple of reweights I've 
run have been a bit more targetted in that I've set it to only function on two 
OSDs at a time.  If I run a test-reweight targetting only one osd, it does say 
it will reweight OSD 9 (the one at 85.02%).  I gather this will move data away 
from this OSD and potentially get it below the threshold.  However, at one 
point in the past couple of days, it's shown as no OSDs in a near full state, 
yet the two pgs in backfill_toofull didn't change.  So, that's why I'm not sure 
continually reweighting is going to solve this issue.

I'm a long way from knowledgable on Ceph so I'm not really sure what 
information is useful here.  Here's a bit of info on what I'm seeing.  Can 
provide anything else that might help.


Basically, we have a three node cluster but only two have OSDs.  The third is 
there simply to enable a quorum to be established.  The OSDs are evenly spread 
across these two needs and the configuration of each is identical.  We are 
running Jewel and are not in a position to upgrade at this stage.




# ceph --version
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)


# ceph health detail
HEALTH_WARN 2 pgs backfill_toofull; 2 pgs stuck unclean; recovery 33/62099566 
objects misplaced (0.000%); 1 near full osd(s)
pg 2.52 is stuck unclean for 201822.031280, current state 
active+remapped+backfill_toofull, last acting [17,3]
pg 2.18 is stuck unclean for 202114.617682, current state 
active+remapped+backfill_toofull, last acting [18,2]
pg 2.18 is active+remapped+backfill_toofull, acting [18,2]
pg 2.52 is active+remapped+backfill_toofull, acting [17,3]
recovery 33/62099566 objects misplaced (0.000%)
osd.9 is near full at 85%


# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 2 1.37790  1.0  1410G   842G   496G 59.75 1.08  33
 3 1.37790  0.45013  1410G  1079G   259G 76.49 1.39  21
 4 1.37790  0.95001  1410G  1086G   253G 76.98 1.40  44
 5 1.37790  1.0  1410G   617G   722G 43.74 0.79  43
 6 1.37790  0.65009  1410G   616G   722G 43.69 0.79  39
 7 1.37790  0.95001  1410G   495G   844G 35.10 0.64  40
 8 1.37790  1.0  1410G   732G   606G 51.93 0.94  52
 9 1.37790  0.70007  1410G  1199G   139G 85.02 1.54  37
10 1.37790  1.0  1410G   611G   727G 43.35 0.79  41
11 1.37790  0.75006  1410G   495G   843G 35.11 0.64  32
 0 1.37790  1.0  1410G   731G   608G 51.82 0.94  43
12 1.37790  1.0  1410G

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

2020-10-28 Thread Frank Schilder
Hi all, I need to go back to a small piece of information:

> I was 3 mons, but i have 2 physical datacenters, one of them breaks with
> not short term fix, so i remove all osds and ceph mon (2 of them) and
> now i have only the osds of 1 datacenter with the monitor.

When I look at the data about pools and crush map, I don't see anything that is 
multi-site. Maybe the physical location was 2-site, but the crush rules don't 
reflect that. Consequently, the ceph cluster was configured single-site and 
will act accordingly when you loose 50% of it.

Quick interlude: when people recommend to add servers, they do not necessarily 
mean *new* servers. They mean you have to go to ground zero, dig out as much 
hardware as you can, drive it to the working site and make it rejoin the 
cluster.

A hypothesis. Assume we want to build a 2-site cluster (sites A and B) that can 
sustain the total loss of any 1 site, capacity at each site is equal (mirrored).

Short answer: this is not exactly possible due to the fact that you always need 
a qualified majority of monitors for quorum and you cannot distribute both, N 
MONs and a qualified majority evenly and simultaneously over 2 sites. We have 
already an additional constraint: site A will have 2 and site B 1 monitor. The 
condition is, that in case site A goes down the monitors from the site A can be 
rescued and moved to site B to restore data access. Hence, a loss of site A 
will imply temporary loss of service (Note that 2+2=4 MONs will not help, 
because now 3 MONs are required for a qualified majority; again MONs need to be 
rescued from the down site). If this constraint is satisfied, then one can 
configure pools as follows:

replicated: size 4, min_size 2, crush rule places 2 copies at each site
erasure coded: k+m with min_size=k+1, m even and m>=k+2, for example, k=2, m=4, 
crush rule places 3 shards at each site

With such a configuration, it is possible to sustain the loss of any one site 
if the monitors can be recovered from site A. Note that such EC pools will be 
very compute intense and have high latency (use option fast_read to get at 
least reasonable read speeds). Essentially, EC is not really suitable for 
multi-site redundancy, but the above EC set up will require a bit less capacity 
than 4 copies.

This setup can sustain the total loss of 1 site (minus MONs on site A) and will 
rebuild all data once a large enough second site is brought up again.

When I look at the information you posted, I see replication 3(2) and EC 5+2 
pools, all having crush root default. I do not see any of these mandatory 
configurations, the sites are ignored in the crush rules. Hence, if you can't 
get material from the down site back up, you look at permanent data loss.

You may be able to recover some more data in the replicated pools by setting 
min_size=1 for some time. However, you will loose any writes that are on the 
other 2 but not on the 1 disk now used for recovery and it will certainly not 
recover PGs with all 3 copies on the down site. Therefore, I would not attempt 
this, also because for the EC pools, you will need to get hold of the hosts 
from the down site and re-integrate these into the cluster any ways. If you 
can't do this, the data is lost.

In the long run, given your crush map and rules, you either stop placing stuff 
at 2 sites, or you create a proper 2-site set-up and copy data over.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Ing. Luis Felipe Domínguez Vega 
Sent: 28 October 2020 05:14:27
To: Eugen Block
Cc: Ceph Users
Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT]

Well recovering not working yet... i was started 6 servers more and the
cluster not yet recovered.
Ceph status not show any recover progress

ceph -s : https://pastebin.ubuntu.com/p/zRQPbvGzbw/
ceph osd tree   : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
ceph osd df : https://pastebin.ubuntu.com/p/ysbh8r2VVz/
ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
crush rules : (ceph osd crush rule dump)
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/

El 2020-10-27 09:59, Eugen Block escribió:
> Your pool 'data_storage' has a size of 7 (or 7 chunks since it's
> erasure-coded) and the rule requires each chunk on a different host
> but you currently have only 5 hosts available, that's why the recovery
>  is not progressing. It's waiting for two more hosts. Unfortunately,
> you can't change the EC profile or the rule of that pool. I'm not sure
>  if it would work in the current cluster state, but if you can't add
> two more hosts (which would be your best option for recovery) it might
>  be possible to create a new replicated pool (you seem to have enough
> free space) and copy the contents from that EC pool. But as I said,
> I'm not sure if that would work in a degraded state, I'v

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-27 Thread Frank Schilder
Thanks for digging this out. I believed to remember exactly this method (don't 
know where from), but couldn't find it in the documentation and started 
doubting it. Yes, this would be very useful information to add to the 
documentation and it also confirms that your simpler setup with just a 
specialized crush rule will work exactly as intended and is long-term stable.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 胡 玮文 
Sent: 26 October 2020 17:19
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
pool

> 在 2020年10月26日,15:43,Frank Schilder  写道:
>
> 
>> I’ve never seen anything that implies that lead OSDs within an acting set 
>> are a function of CRUSH rule ordering.
>
> This is actually a good question. I believed that I had seen/heard that 
> somewhere, but I might be wrong.
>
> Looking at the definition of a PG, is states that a PG is an ordered set of 
> OSD (IDs) and the first up OSD will be the primary. In other words, it seems 
> that the lowest OSD ID is decisive. If the SSDs were deployed before the 
> HDDs, they have the smallest IDs and, hence, will be preferred as primary 
> OSDs.

I don’t think this is correct. From my experiments, using previously mentioned 
CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are 
always SSD.

I also have a look at the code, if I understand it correctly:

* If the default primary affinity is not changed, then the logic about primary 
affinity is skipped, and the primary would be the first one returned by CRUSH 
algorithm [1].

* The order of OSDs returned by CRUSH still matters if you changed the primary 
affinity. The affinity represents the probability of a test to be success. The 
first OSD will be tested first, and will have higher probability to become 
primary. [2]
  * If any OSD has primary affinity = 1.0, the test will always success, and 
any OSD after it will never be primary.
  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. 
Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability 
of 0.125. Otherwise, 1st will be primary.
  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be 
primary as fallback.

[1]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456
[2]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561

So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it 
to be the primary in my case.

Do you think I should contribute these to documentation?

> This, however, is not a sustainable situation. Any addition of OSDs will mess 
> this up and the distribution scheme will fail in the future. A way out seem 
> to be:
>
> - subdivide your HDD storage using device classes:
> * define a device class for HDDs with primary affinity=0, for example, pick 5 
> HDDs and change their device class to hdd_np (for no primary)
> * set the primary affinity of these HDD OSDs to 0
> * modify your crush rule to use "step take default class hdd_np"
> * this will create a pool with primaries on SSD and balanced storage 
> distribution between SSD and HDD
> * all-HDD pools deployed as usual on class hdd
> * when increasing capacity, one needs to take care of adding disks to hdd_np 
> class and set their primary affinity to 0
> * somewhat increased admin effort, but fully working solution
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

2020-10-26 Thread Frank Schilder
Hi Kristof,

I missed that: why do you need to do manual compaction?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Kristof Coucke 
Sent: 26 October 2020 11:33:52
To: Frank Schilder; a.jazdzew...@googlemail.com
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Question about expansion existing Ceph cluster - 
adding OSDs

Hi Ansgar, Frank, all,

Thanks for the feedback in the first place.

In the meantime, I've added all the disks and the cluster is rebalancing 
itself... Which will take ages as you've mentioned. Last week after this 
conversation it was around 50% (little bit more), today it's around 44,5%.
Every day, I have to take the cluster down to run manual compaction on some 
disks :-(, but that's a known bug where Igor is working on. (Kudos to him when 
I get my sleep back at night for this one...)

Though, I'm still having an issue which I don't completely understand.
When I look into the Ceph dashboard - OSDs, I can see the #pgs for a specific 
OSD. Does someone know how this is calculated? Because it seems incorrect...
E.g. A specific disk shows in the dashboard 189 PGs...? However, examining the 
pg dump output I can see that for that particular disk there are 145 PGs where 
the disk is in the "up" list, and 168 disks where that particular disk is in 
the "acting" list...  Of those 2 lists, 135 are in common, meaning 10 PGs will 
need to be moved to that disk, while 33 PGs will need to be moved away...
I can't figure out how the dashboard is getting to the figure of 189... It's 
also on other disks (a delta between the PG dump output and the info in the 
Ceph dashboard).

Another example is one disk which I've put on weight 0 as it's marked to have a 
predictable failure in the future... So the list with "up" is 0 (which is 
correct), and the PGs where this disk is in acting is 49. So, this seems 
correct as these 49 PGs need to be moved away. However... Looking into the Ceph 
dashboard the UI is saying that there are 71 PGs on that disk...

So:
- How does the Ceph dashboard get that number in the 1st place?
- Is there a possibility that there are "orphaned" PG-parts left behind on a 
particular OSD?
- If it is possible that there are orphaned parts of a PG left behind on a 
disk, how do I clean this up?

I've also tried examining the osdmap, however, the output seems to be 
limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the file is 
concatenated by exporting the osd map, or by the osdmaptool --print).

The cluster is running Nautilus v14.2.11, all on the same version.

I'll make some time writing documentation and documenting my findings which 
I've all faced in the journey of the last 2 weeks Kristof in Ceph's 
wunderland...

Thanks for all your input so far!

Regards,

Kristof



Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder 
mailto:fr...@dtu.dk>>:
There have been threads on exactly this. Might depend a bit on your ceph 
version. We are running mimic and have no issues doing:

- set noout, norebalance, nobackfill
- add all OSDs (with weight 1)
- wait for peering to complete
- unset all flags and let the rebalance loose

Starting with nautilus there seem to be issues with this procedure. Mainly the 
peering phase can cause a collapse of the cluster.  In your case, it sounds 
like you added the OSDs already. You should be able to do relatively safely:

- set noout, norebalance, nobackfill
- set weight of OSDs to 1 one by one and wait for peering to complete every time
- unset all flags and let the rebalance loose

I believe once the peering succeeded without crashes, the rebalancing will just 
work fine. You can easily control how much rebalancing is going on.

I noted that ceph seems to have a strange concept of priority though. I needed 
to gain capacity by adding OSDs and ceph was very consequent with moving PGs 
from the fullest OSDs last. The opposite of what should happen. Thus, it took 
ages for additional capacity to become available and also the backfill too full 
warnings stayed for all the time. You can influence this to some degree by 
using force_recovery commands on PGs on the fullest OSDs.

Best regards and good luck,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Kristof Coucke mailto:kristof.cou...@gmail.com>>
Sent: 21 October 2020 13:29:00
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Question about expansion existing Ceph cluster - adding 
OSDs

Hi,

I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
Some disks were near full.
The new disks have been added with initial weight = 0.
The original plan was to increase this slowly towards their full weight
using the gentle reweight script. However, this is going way too slow and
I'm also having issues now with "backfill_toofull"

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-26 Thread Frank Schilder
> I’ve never seen anything that implies that lead OSDs within an acting set are 
> a function of CRUSH rule ordering.

This is actually a good question. I believed that I had seen/heard that 
somewhere, but I might be wrong.

Looking at the definition of a PG, is states that a PG is an ordered set of OSD 
(IDs) and the first up OSD will be the primary. In other words, it seems that 
the lowest OSD ID is decisive. If the SSDs were deployed before the HDDs, they 
have the smallest IDs and, hence, will be preferred as primary OSDs.

This, however, is not a sustainable situation. Any addition of OSDs will mess 
this up and the distribution scheme will fail in the future. A way out seem to 
be:

- subdivide your HDD storage using device classes:
  * define a device class for HDDs with primary affinity=0, for example, pick 5 
HDDs and change their device class to hdd_np (for no primary)
  * set the primary affinity of these HDD OSDs to 0
  * modify your crush rule to use "step take default class hdd_np"
  * this will create a pool with primaries on SSD and balanced storage 
distribution between SSD and HDD
  * all-HDD pools deployed as usual on class hdd
  * when increasing capacity, one needs to take care of adding disks to hdd_np 
class and set their primary affinity to 0
  * somewhat increased admin effort, but fully working solution

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 25 October 2020 17:07:15
To: ceph-users@ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

> I'm not entirely sure if primary on SSD will actually make the read happen on 
> SSD.

My understanding is that by default reads always happen from the lead OSD in 
the acting set.  Octopus seems to (finally) have an option to spread the reads 
around, which IIRC defaults to false.

I’ve never seen anything that implies that lead OSDs within an acting set are a 
function of CRUSH rule ordering. I’m not asserting that they aren’t though, but 
I’m … skeptical.

Setting primary affinity would do the job, and you’d want to have cron 
continually update it across the cluster to react to topology changes.  I was 
told of this strategy back in 2014, but haven’t personally seen it implemented.

That said, HDDs are more of a bottleneck for writes than reads and just might 
be fine for your application.  Tiny reads are going to limit you to some degree 
regardless of drive type, and you do mention throughput, not IOPS.

I must echo Frank’s notes about capacity too.  Ceph can do a lot of things, but 
that doesn’t mean something exotic is necessarily the best choice.  You’re 
concerned about 3R only yielding 1/3 of raw capacity if using an all-SSD 
cluster, but the architecture you propose limits you anyway because drive size. 
Consider also chassis, CPU, RAM, RU, switch port costs as well, and the cost of 
you fussing over an exotic solution instead of the hundreds of other things in 
your backlog.

And your cluster as described is *tiny*.  Honestly I’d suggest considering one 
of these alternatives:

* Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are really 
promising for replacing HDDs for density in this kind of application.  You 
might even consider ARM if IOPs aren’t a concern.
* An NVMeoF solution


Cache tiers are “deprecated”, but then so are custom cluster names.  Neither 
appears

> For EC pools there is an option "fast_read" 
> (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read#set-pool-values),
>  which states that a read will return as soon as the first k shards have 
> arrived. The default is to wait for all k+m shards (all replicas). This 
> option is not available for replicated pools.
>
> Now, not sure if this option is not available for replicated pools because 
> the read will always be served by the acting primary, or if it currently 
> waits for all replicas. In the latter case, reads will wait for the slowest 
> device.
>
> I'm not sure if I interpret this correctly. I think you should test the setup 
> with HDD only and SSD+HDD to see if read speed improves. Note that write 
> speed will always depend on the slowest device.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 25 October 2020 15:03:16
> To: 胡 玮文; Alexander E. Patrakov
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
>
> A cache pool might be an alternative, heavily depending on how much data is 
> hot. However, then you will have much less SSD capacity available, because it 
> also requires replication.
>
> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-25 Thread Frank Schilder
I would like to add one comment.

I'm not entirely sure if primary on SSD will actually make the read happen on 
SSD. For EC pools there is an option "fast_read" 
(https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read#set-pool-values),
 which states that a read will return as soon as the first k shards have 
arrived. The default is to wait for all k+m shards (all replicas). This option 
is not available for replicated pools.

Now, not sure if this option is not available for replicated pools because the 
read will always be served by the acting primary, or if it currently waits for 
all replicas. In the latter case, reads will wait for the slowest device.

I'm not sure if I interpret this correctly. I think you should test the setup 
with HDD only and SSD+HDD to see if read speed improves. Note that write speed 
will always depend on the slowest device.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 25 October 2020 15:03:16
To: 胡 玮文; Alexander E. Patrakov
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

A cache pool might be an alternative, heavily depending on how much data is 
hot. However, then you will have much less SSD capacity available, because it 
also requires replication.

Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD 
you will probably run short of SSD capacity. Or, looking at it the other way 
around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out 
of 120T HDD capacity.

With this replication, the usable storage will be 10T and raw used will be 10T 
SSD and 30T HDD. If you can't do anything else on the HDD space, you will need 
more SSDs. If your servers have more free disk slots, you can add SSDs over 
time until you have at least 40T SSD capacity to balance SSD and HDD capacity.

Personally, I think the 1SSD + 3HDD is a good option compared with a cache 
pool. You have the data security of 3-times replication and, if everything is 
up, need only 1 copy in the SSD cache, which means that you have 3 times the 
cache capacity.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 胡 玮文 
Sent: 25 October 2020 13:40:55
To: Alexander E. Patrakov
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard 
against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on 
HDD. This will work as intended, right? Because at least I can ensure 3 HDDs 
are from different hosts.

> 在 2020年10月25日,20:04,Alexander E. Patrakov  写道:
>
> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com  
> wrote:
>>
>> Hi all,
>>
>> We are planning for a new pool to store our dataset using CephFS. These data 
>> are almost read-only (but not guaranteed) and consist of a lot of small 
>> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will 
>> deploy about 10 such nodes. We aim at getting the highest read throughput.
>>
>> If we just use a replicated pool of size 3 on SSD, we should get the best 
>> performance, however, that only leave us 1/3 of usable SSD space. And EC 
>> pools are not friendly to such small object read workload, I think.
>>
>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 
>> 3 data replications, each on a different host (fail domain). 1 of them on 
>> SSD, the other 2 on HDD. And normally every read request is directed to SSD. 
>> So, if every SSD OSD is up, I’d expect the same read throughout as the all 
>> SSD deployment.
>>
>> I’ve read the documents and did some tests. Here is the crush rule I’m 
>> testing with:
>>
>> rule mixed_replicated_rule {
>>id 3
>>type replicated
>>min_size 1
>>max_size 10
>>step take default class ssd
>>step chooseleaf firstn 1 type host
>>step emit
>>step take default class hdd
>>step chooseleaf firstn -1 type host
>>step emit
>> }
>>
>> Now I have the following conclusions, but I’m not very sure:
>> * The first OSD produced by crush will be the primary OSD (at least if I 
>> don’t change the “primary affinity”). So, the above rule is guaranteed to 
>> map SSD OSD as primary in pg. And every read request will read from SSD if 
>> it is up.
>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from 
>> different hosts. So, if I want to ensure data availability even if 2 hosts 
>> fail, I need to choose 1 SSD a

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-25 Thread Frank Schilder
A cache pool might be an alternative, heavily depending on how much data is 
hot. However, then you will have much less SSD capacity available, because it 
also requires replication.

Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD 
you will probably run short of SSD capacity. Or, looking at it the other way 
around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out 
of 120T HDD capacity.

With this replication, the usable storage will be 10T and raw used will be 10T 
SSD and 30T HDD. If you can't do anything else on the HDD space, you will need 
more SSDs. If your servers have more free disk slots, you can add SSDs over 
time until you have at least 40T SSD capacity to balance SSD and HDD capacity.

Personally, I think the 1SSD + 3HDD is a good option compared with a cache 
pool. You have the data security of 3-times replication and, if everything is 
up, need only 1 copy in the SSD cache, which means that you have 3 times the 
cache capacity.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 胡 玮文 
Sent: 25 October 2020 13:40:55
To: Alexander E. Patrakov
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard 
against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on 
HDD. This will work as intended, right? Because at least I can ensure 3 HDDs 
are from different hosts.

> 在 2020年10月25日,20:04,Alexander E. Patrakov  写道:
>
> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com  
> wrote:
>>
>> Hi all,
>>
>> We are planning for a new pool to store our dataset using CephFS. These data 
>> are almost read-only (but not guaranteed) and consist of a lot of small 
>> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will 
>> deploy about 10 such nodes. We aim at getting the highest read throughput.
>>
>> If we just use a replicated pool of size 3 on SSD, we should get the best 
>> performance, however, that only leave us 1/3 of usable SSD space. And EC 
>> pools are not friendly to such small object read workload, I think.
>>
>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 
>> 3 data replications, each on a different host (fail domain). 1 of them on 
>> SSD, the other 2 on HDD. And normally every read request is directed to SSD. 
>> So, if every SSD OSD is up, I’d expect the same read throughout as the all 
>> SSD deployment.
>>
>> I’ve read the documents and did some tests. Here is the crush rule I’m 
>> testing with:
>>
>> rule mixed_replicated_rule {
>>id 3
>>type replicated
>>min_size 1
>>max_size 10
>>step take default class ssd
>>step chooseleaf firstn 1 type host
>>step emit
>>step take default class hdd
>>step chooseleaf firstn -1 type host
>>step emit
>> }
>>
>> Now I have the following conclusions, but I’m not very sure:
>> * The first OSD produced by crush will be the primary OSD (at least if I 
>> don’t change the “primary affinity”). So, the above rule is guaranteed to 
>> map SSD OSD as primary in pg. And every read request will read from SSD if 
>> it is up.
>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from 
>> different hosts. So, if I want to ensure data availability even if 2 hosts 
>> fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the 
>> replication size to 4, instead of the ideal value 3, on the pool using the 
>> above crush rule.
>>
>> Am I correct about the above statements? How would this work from your 
>> experience? Thanks.
>
> This works (i.e. guards against host failures) only if you have
> strictly separate sets of hosts that have SSDs and that have HDDs.
> I.e., there should be no host that has both, otherwise there is a
> chance that one hdd and one ssd from that host will be picked.
>
> --
> Alexander E. Patrakov
> CV: 
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3Dreserved=0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-23 Thread Frank Schilder
Hi Michael.

> I still don't see any traffic to the pool, though I'm also unsure how much 
> traffic is to be expected.

Probably not much. If ceph df shows that the pool contains some objects, I 
guess that's sorted.

That osdmaptool crashes indicates that your cluster runs with corrupted 
internal data. I tested your crush map and you should get complete PGs for the 
fs data pool. That you don't and that osdmaptool crashes points at a corruption 
of internal data. I'm afraid this is the point where you need support from ceph 
developers and should file a tracker report 
(https://tracker.ceph.com/projects/ceph/issues). A short description of the 
origin of the situation with the osdmaptool output and a reference to this 
thread linked in should be sufficient. Please post a link to the ticket here.

In parallel, you should probably open a new thread focussed on the osd map 
corruption. Maybe there are low-level commands to repair it.

You should wait with trying to clean up the unfound objects until this is 
resolved. Not sure about adding further storage either. To me, this sounds 
quite serious.

Best regards and good luck!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
The post was titled "mds behind on trimming - replay until memory exhausted".

> Load up with swap and try the up:replay route.
> Set the beacon to 10 until it finishes.

Good point! The MDS will not send beacons for a long time. Same was necessary 
in the other case.

Good luck!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
drive. There is a thread from this year where someone had a similar problem, 
the MDS running out of memory during replay. He could quickly add sufficient 
swap and the MDS managed to come up. Took a long time though, but might be 
faster than getting more RAM and will not loose data.

Your clients will not be able to do much, if anything during recovery though.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 22 October 2020 18:11:57
To: David C
Cc: ceph-devel; ceph-users
Subject: [ceph-users] Re: Urgent help needed please - MDS offline

I assume you aren't able to quickly double the RAM on this MDS ? or
failover to a new MDS with more ram?

Failing that, you shouldn't reset the journal without recovering
dentries, otherwise the cephfs_data objects won't be consistent with
the metadata.
The full procedure to be used is here:
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts

 backup the journal, recover dentires, then reset the journal.
(the steps after might not be needed)

That said -- maybe there is a more elegant procedure than using
cephfs-journal-tool.  A cephfs dev might have better advice.

-- dan


On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
>
> I'm pretty sure it's replaying the same ops every time, the last
> "EMetaBlob.replay updated dir" before it dies is always referring to
> the same directory. Although interestingly that particular dir shows
> up in the log thousands of times - the dir appears to be where a
> desktop app is doing some analytics collecting - I don't know if
> that's likely to be a red herring or the reason why the journal
> appears to be so long. It's a dir I'd be quite happy to lose changes
> to or remove from the file system altogether.
>
> I'm loath to update during an outage although I have seen people
> update the MDS code independently to get out of a scrape - I suspect
> you wouldn't recommend that.
>
> I feel like this leaves me with having to manipulate the journal in
> some way, is there a nuclear option where I can choose to disregard
> the uncommitted events? I assume that would be a journal reset with
> the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> expect to lose any metadata changes that were made since my cluster
> filled up but are there further implications? I also wonder what's the
> riskier option, resetting the journal or attempting an update.
>
> I'm very grateful for your help so far
>
> Below is more of the debug 10 log with ops relating to the
> aforementioned dir (name changed but inode is accurate):
>
> 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [dentry
> #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> 0x5654f82794a0]
> 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
> /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
> 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
> b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
> 0x5654f8288a00]
> 2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal
> EMetaBlob.replay dir 0x10009e1ec8e
> 2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8e
> /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
> state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
> rc2020-10-22 08:46:44.932805 b17592 89215=89215+0)
> hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100]
> 2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added (full) [dentry
> #0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> [2,head] auth NULL (dversion lock) v=904149 inode=0
> state=1610612800|bottomlru | dirty=1 0x56586df52f00]
> 2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head]
> /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00]
> 2020-10-22 16:44:00.488918 7f424659e700 10
> 

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Frank Schilder
Could you also execute (and post the output of)

  # osdmaptool osd.map --test-map-pgs-dump --pool 7

with the osd map you pulled out (pool 7 should be the fs data pool)? Please 
check what mapping is reported for PG 7.39d? Just checking if osd map and pg 
dump agree here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:
> Hi Michael,
>
> some quick thoughts.
>
>
> That you can create a pool with 1 PG is a good sign, the crush rule is OK. 
> That pg query says it doesn't have PG 1.0 points in the right direction. 
> There is an inconsistency in the cluster. This is also indicated by the fact 
> that no upmaps seem to exist (the clean-up script was empty). With the osd 
> map you extracted, you could check what the osd map believes the mapping of 
> the PGs of pool 1 are:
>
># osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

> or if it also claims the PG does not exist. It looks like something went 
> wrong during pool creation and you are not the only one having problems with 
> this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html 
> . Sounds a lot like a bug in cephadm.
>
> In principle, it looks like the idea to delete and recreate the health 
> metrics pool is a way forward. Please look at the procedure mentioned in the 
> thread quoted above. Deletion of the pool there lead to some crashes and some 
> surgery on some OSDs was necessary. However, in your case it might just work, 
> because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.

> In order to do so cleanly, however, you will probably want to shut down all 
> clients accessing this pool. Note that clients accessing the health metrics 
> pool are not FS clients, so the mds cannot tell you anything about them. The 
> only command that seems to list all clients is
>
># ceph daemon mon.MON-ID sessions
>
> that needs to be executed on all mon hosts. On the other hand, you could also 
> just go ahead and see if something crashes (an MGR module probably) or 
> disable all MGR modules during this recovery attempt. I found some info that 
> cephadm creates this pool and starts an MGR module.
>
> If you google "device_health_metric pool" you should find descriptions of 
> similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.

> I will look at the incomplete PG issue. I hope this is just some PG tuning. 
> At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

> The stuck MDS request could be an attempt to access an unfound object. It 
> should be possible to locate the fs client and find out what it was trying to 
> do. I see this sometimes when people are too impatient. They manage to 
> trigger a race condition and an MDS operation gets stuck (there are MDS bugs 
> and in my case it was an ls command that got stuck). Usually, evicting the 
> client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike

> From: Michael Thomas 
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multipl

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Frank Schilder
Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:
> Hi Michael,
>
> some quick thoughts.
>
>
> That you can create a pool with 1 PG is a good sign, the crush rule is OK. 
> That pg query says it doesn't have PG 1.0 points in the right direction. 
> There is an inconsistency in the cluster. This is also indicated by the fact 
> that no upmaps seem to exist (the clean-up script was empty). With the osd 
> map you extracted, you could check what the osd map believes the mapping of 
> the PGs of pool 1 are:
>
># osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

> or if it also claims the PG does not exist. It looks like something went 
> wrong during pool creation and you are not the only one having problems with 
> this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html 
> . Sounds a lot like a bug in cephadm.
>
> In principle, it looks like the idea to delete and recreate the health 
> metrics pool is a way forward. Please look at the procedure mentioned in the 
> thread quoted above. Deletion of the pool there lead to some crashes and some 
> surgery on some OSDs was necessary. However, in your case it might just work, 
> because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.

> In order to do so cleanly, however, you will probably want to shut down all 
> clients accessing this pool. Note that clients accessing the health metrics 
> pool are not FS clients, so the mds cannot tell you anything about them. The 
> only command that seems to list all clients is
>
># ceph daemon mon.MON-ID sessions
>
> that needs to be executed on all mon hosts. On the other hand, you could also 
> just go ahead and see if something crashes (an MGR module probably) or 
> disable all MGR modules during this recovery attempt. I found some info that 
> cephadm creates this pool and starts an MGR module.
>
> If you google "device_health_metric pool" you should find descriptions of 
> similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.

> I will look at the incomplete PG issue. I hope this is just some PG tuning. 
> At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

> The stuck MDS request could be an attempt to access an unfound object. It 
> should be possible to locate the fs client and find out what it was trying to 
> do. I see this sometimes when people are too impatient. They manage to 
> trigger a race condition and an MDS operation gets stuck (there are MDS bugs 
> and in my case it was an ls command that got stuck). Usually, evicting the 
> client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike

> From: Michael Thomas 
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
>
> On 10/20/20 1:18 PM, Frank Schilder wrote:
>> Dear Michael,
>>
>>>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
>>>> OSD mapping?
>>
>> I meant here with crush rule replicated_host_nvme. Sorry, forgot.
>
> Seems to have worked fine:
>
> https://pastebin.com/PFgDE4J1
>
>>> Yes, the OSD was still out when the previous health report was created.
>>
>> Hmm, this is odd. If this is correct, the

[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

2020-10-21 Thread Frank Schilder
There have been threads on exactly this. Might depend a bit on your ceph 
version. We are running mimic and have no issues doing:

- set noout, norebalance, nobackfill
- add all OSDs (with weight 1)
- wait for peering to complete
- unset all flags and let the rebalance loose

Starting with nautilus there seem to be issues with this procedure. Mainly the 
peering phase can cause a collapse of the cluster.  In your case, it sounds 
like you added the OSDs already. You should be able to do relatively safely:

- set noout, norebalance, nobackfill
- set weight of OSDs to 1 one by one and wait for peering to complete every time
- unset all flags and let the rebalance loose

I believe once the peering succeeded without crashes, the rebalancing will just 
work fine. You can easily control how much rebalancing is going on.

I noted that ceph seems to have a strange concept of priority though. I needed 
to gain capacity by adding OSDs and ceph was very consequent with moving PGs 
from the fullest OSDs last. The opposite of what should happen. Thus, it took 
ages for additional capacity to become available and also the backfill too full 
warnings stayed for all the time. You can influence this to some degree by 
using force_recovery commands on PGs on the fullest OSDs.

Best regards and good luck,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Kristof Coucke 
Sent: 21 October 2020 13:29:00
To: ceph-users@ceph.io
Subject: [ceph-users] Question about expansion existing Ceph cluster - adding 
OSDs

Hi,

I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
Some disks were near full.
The new disks have been added with initial weight = 0.
The original plan was to increase this slowly towards their full weight
using the gentle reweight script. However, this is going way too slow and
I'm also having issues now with "backfill_toofull".
Can I just add all the OSDs with their full weight, or will I get a lot of
issues when I'm doing that?
I know that a lot of PGs will have to be replaced, but increasing the
weight slowly will take a year at the current speed. I'm already playing
with the max backfill to increase the speed, but every time I increase the
weight it will take a lot of time again...
I can face the fact that there will be a performance decrease.

Looking forward to your comments!

Regards,

Kristof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-21 Thread Frank Schilder
Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

  # osdmaptool osd.map --test-map-pgs-dump --pool 1

or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.

In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

  # ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:
> Dear Michael,
>
>>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
>>> OSD mapping?
>
> I meant here with crush rule replicated_host_nvme. Sorry, forgot.

Seems to have worked fine:

https://pastebin.com/PFgDE4J1

>> Yes, the OSD was still out when the previous health report was created.
>
> Hmm, this is odd. If this is correct, then it did report a slow op even 
> though it was out of the cluster:
>
>> from https://pastebin.com/3G3ij9ui:
>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
>> [osd.0,osd.41] have slow ops.
>
> Not sure what to make of that. It looks almost like you have a ghost osd.41.
>
>
> I think (some of) the slow ops you are seeing are directed to the 
> health_metrics pool and can be ignored. If it is too annoying, you could try 
> to find out who runs the client with IDs client.7524484 and disable it. Might 
> be an MGR module.

I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.

What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.

> Looking at the data you provided and also some older threads of yours 
> (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
> considering that we are looking at the fall-out of a past admin operation. A 
> possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
> rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
> 1.0. For example, the upmap specifies HDDs, but the crush rule required 
> NVMEs. This result is an empty set.

So var I've been unable to locate the client with the ID 7524484.  It's
not showing up in the manager dashboard -> Filesystems page, nor in the
output of 'ceph tell mds.ceph1 client ls'.

I'm digging through the compress logs for the past week to see if I can
find the culprit.

> I couldn't really find a simple command to list up-maps. The only 
> non-destructive way seems to be to extract the osdmap and create a clean-up 
> command file. The clea

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-20 Thread Frank Schilder
Dear Michael,

> > Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
> > OSD mapping?

I meant here with crush rule replicated_host_nvme. Sorry, forgot.


> Yes, the OSD was still out when the previous health report was created.

Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:

> from https://pastebin.com/3G3ij9ui:
> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
> [osd.0,osd.41] have slow ops.

Not sure what to make of that. It looks almost like you have a ghost osd.41.


I think (some of) the slow ops you are seeing are directed to the 
health_metrics pool and can be ignored. If it is too annoying, you could try to 
find out who runs the client with IDs client.7524484 and disable it. Might be 
an MGR module.


Looking at the data you provided and also some older threads of yours 
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
considering that we are looking at the fall-out of a past admin operation. A 
possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. 
This result is an empty set.

I couldn't really find a simple command to list up-maps. The only 
non-destructive way seems to be to extract the osdmap and create a clean-up 
command file. The cleanup file should contain a command for every PG with an 
upmap. To check this, you can execute (see also 
https://docs.ceph.com/en/latest/man/8/osdmaptool/)

  # ceph osd getmap > osd.map
  # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?

Also, with the OSD map of your cluster, you can simulate certain admin 
operations and check resulting PG mappings for pools and other things without 
having to touch the cluster; see 
https://docs.ceph.com/en/latest/man/8/osdmaptool/.


To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query

It would also be helpful if you could post the decoded crush map. You can get 
the map as a txt-file as follows:

  # ceph osd getcrushmap -o crush-orig.bin
  # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.


Did the slow MDS request complete by now?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

Contents of previous messages removed.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-16 Thread Frank Schilder
Dear Michael,

this is a bit of a nut. I can't see anything obvious. I have two hypotheses 
that you might consider testing.

1) Problem with 1 incomplete PG.

In the shadow hierarchy for your cluster I can see quite a lot of nodes like

{
"id": -135,
"name": "node229~hdd",
"type_id": 1,
"type_name": "host",
"weight": 0,
"alg": "straw2",
"hash": "rjenkins1",
"items": []
},

I would have expected that hosts without a device of a certain device class are 
*excluded* completely from a tree instead of having weight 0. I'm wondering if 
this could lead to the crush algorithm fail in the way described here: 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
 . This might be a long shot, but could you export your crush map and play with 
the tunables as described under this link to see if more tries lead to a valid 
mapping? Note that testing this is harmless and does not change anything on the 
cluster.

The hypothesis here is that buckets with weight 0 are not excluded from drawing 
a-priori, but a-posteriori. If there are too many draws of an empty bucket, a 
mapping fails. Allowing more tries should then lead to success. We should at 
least rule out this possibility.

2) About the incomplete PG.

I'm wondering if the problem is that the pool has exactly 1 PG. I don't have a 
test pool with Nautilus and cannot try this out. Can you create a test pool 
with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? If not, can you 
then increase pg_num and pgp_num to, say, 10 and see if this has any effect?

I'm wondering here if there needs to be a minimum number >1 of PGs in a pool. 
Again, this is more about ruling out a possibility than expecting success. As 
an extension to this test, you could increase pg_num and pgp_num of the pool 
device_health_metrics to see if this has any effect.


The crush rules and crush tree look OK to me. I can't really see why the 
missing OSDs are not assigned to the two PGs 1.0 and 7.39d.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 16 October 2020 15:41:29
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,

> Please mark OSD 41 as "in" again and wait for some slow ops to show up.

I forgot. "wait for some slow ops to show up" ... and then what?

Could you please go to the host of the affected OSD and look at the output of 
"ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and 
check what type of operations get stuck? I'm wondering if its administrative, 
like peering attempts.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,

thanks for this initial work. I will need to look through the files you posted 
in more detail. In the meantime:

Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far 
as I can see, marking it "out" might have cleared hanging slow ops (there were 
1000 before), but they then started piling up again. From the OSD log it looks 
like an operation that is sent to/from PG 1.0, which doesn't respond because it 
is inactive. Hence, getting PG 1.0 active should resolve this issue (later).

Its a bit strange that I see slow ops for OSD 41 in the latest health detail 
(https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report 
was created?

I think we might have misunderstood my question 6. My question was whether or 
not each host bucket corresponds to a physical host and vice versa, that is, 
each physical host has exactly 1 host bucket. I'm asking because it is possible 
to have multiple host buckets assigned to a single physical host and this has 
implications on how to manage things.

Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I 
can see), the problem is that is has no OSDs assigned. I need to look a bit 
longer at the data you uploaded to find out why. I can't see anything obvious.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Michael Thomas 
Sent: 16 October 2020 02:08:01
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/14/20 3:49 PM, Frank Schilder wrote:
> Hi Michael,
>
> it doesn't look too bad. All degraded objects are due to the undersi

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-16 Thread Frank Schilder
Dear Michael,

> Please mark OSD 41 as "in" again and wait for some slow ops to show up.

I forgot. "wait for some slow ops to show up" ... and then what?

Could you please go to the host of the affected OSD and look at the output of 
"ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and 
check what type of operations get stuck? I'm wondering if its administrative, 
like peering attempts.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,

thanks for this initial work. I will need to look through the files you posted 
in more detail. In the meantime:

Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far 
as I can see, marking it "out" might have cleared hanging slow ops (there were 
1000 before), but they then started piling up again. From the OSD log it looks 
like an operation that is sent to/from PG 1.0, which doesn't respond because it 
is inactive. Hence, getting PG 1.0 active should resolve this issue (later).

Its a bit strange that I see slow ops for OSD 41 in the latest health detail 
(https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report 
was created?

I think we might have misunderstood my question 6. My question was whether or 
not each host bucket corresponds to a physical host and vice versa, that is, 
each physical host has exactly 1 host bucket. I'm asking because it is possible 
to have multiple host buckets assigned to a single physical host and this has 
implications on how to manage things.

Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I 
can see), the problem is that is has no OSDs assigned. I need to look a bit 
longer at the data you uploaded to find out why. I can't see anything obvious.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Michael Thomas 
Sent: 16 October 2020 02:08:01
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/14/20 3:49 PM, Frank Schilder wrote:
> Hi Michael,
>
> it doesn't look too bad. All degraded objects are due to the undersized PG. 
> If this is an EC pool with m>=2, data is currently not in danger.
>
> I see a few loose ends to pick up, let's hope this is something simple. For 
> any of the below, before attempting the next step, please wait until all 
> induced recovery IO has completed before continuing.
>
> 1) Could you please paste the output of the following commands to pastebin 
> (bash syntax):
>
>ceph osd pool get device_health_metrics all

https://pastebin.com/6D83mjsV

>ceph osd pool get fs.data.archive.frames all

https://pastebin.com/7XAaQcpC

>ceph pg dump |& grep -i -e PG_STAT -e "^7.39d"

https://pastebin.com/tBLaq63Q

>ceph osd crush rule ls

https://pastebin.com/6f5B778G

>ceph osd erasure-code-profile ls

https://pastebin.com/uhAaMH1c

>ceph osd crush dump # this is a big one, please be careful with copy-paste 
> (see point 3 below)

https://pastebin.com/u92D23jV

> 2) I don't see any IO reported (neither user nor recovery). Could you please 
> confirm that the command outputs were taken during a zero-IO period?

That's correct, there was no activity at this time.  Access to the
cephfs filesystem is very bursty, varying from completely idle to
multiple GB/s (read).

> 3) Something is wrong with osd.41. Can you check its health status with 
> smartctl? If it is reported healthy, give it one more clean restart. If the 
> slow ops do not disappear, it could be a disk fail that is not detected by 
> health monitoring. You could set it to "out" and see if the cluster recovers 
> to a healthy state (modulo the currently degraded objects) with no slow ops. 
> If so, I would replace the disk.

smartctl reports no problems.

osd.41 (and osd.0) was one of the original OSDs used for the
device_health_metrics pool.  Early on, before I knew better, I had
removed this OSD (and osd.0) from the cluster, and the OSD ids got
recycled when new disks were later added.  This is when the slow ops on
osd.0 and osd.41 started getting reported.  On advice from another user
on ceph-users, I updated my crush map to remap the device_health_metrics
pool to a different set of OSDs (and the slow ops persisted).

osd.0 usually also shows slow ops.  I was a little surprised that it
didn't when I took this snapshot, but now it does.

I have now run 'ceph osd out 41', and the recovery I/O has finished.
With the exception of one less OSD marked in, the output of 'ceph
status' looks the same.

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-16 Thread Frank Schilder
Dear Michael,

thanks for this initial work. I will need to look through the files you posted 
in more detail. In the meantime:

Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far 
as I can see, marking it "out" might have cleared hanging slow ops (there were 
1000 before), but they then started piling up again. From the OSD log it looks 
like an operation that is sent to/from PG 1.0, which doesn't respond because it 
is inactive. Hence, getting PG 1.0 active should resolve this issue (later).

Its a bit strange that I see slow ops for OSD 41 in the latest health detail 
(https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report 
was created?

I think we might have misunderstood my question 6. My question was whether or 
not each host bucket corresponds to a physical host and vice versa, that is, 
each physical host has exactly 1 host bucket. I'm asking because it is possible 
to have multiple host buckets assigned to a single physical host and this has 
implications on how to manage things.

Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I 
can see), the problem is that is has no OSDs assigned. I need to look a bit 
longer at the data you uploaded to find out why. I can't see anything obvious.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 16 October 2020 02:08:01
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/14/20 3:49 PM, Frank Schilder wrote:
> Hi Michael,
>
> it doesn't look too bad. All degraded objects are due to the undersized PG. 
> If this is an EC pool with m>=2, data is currently not in danger.
>
> I see a few loose ends to pick up, let's hope this is something simple. For 
> any of the below, before attempting the next step, please wait until all 
> induced recovery IO has completed before continuing.
>
> 1) Could you please paste the output of the following commands to pastebin 
> (bash syntax):
>
>ceph osd pool get device_health_metrics all

https://pastebin.com/6D83mjsV

>ceph osd pool get fs.data.archive.frames all

https://pastebin.com/7XAaQcpC

>ceph pg dump |& grep -i -e PG_STAT -e "^7.39d"

https://pastebin.com/tBLaq63Q

>ceph osd crush rule ls

https://pastebin.com/6f5B778G

>ceph osd erasure-code-profile ls

https://pastebin.com/uhAaMH1c

>ceph osd crush dump # this is a big one, please be careful with copy-paste 
> (see point 3 below)

https://pastebin.com/u92D23jV

> 2) I don't see any IO reported (neither user nor recovery). Could you please 
> confirm that the command outputs were taken during a zero-IO period?

That's correct, there was no activity at this time.  Access to the
cephfs filesystem is very bursty, varying from completely idle to
multiple GB/s (read).

> 3) Something is wrong with osd.41. Can you check its health status with 
> smartctl? If it is reported healthy, give it one more clean restart. If the 
> slow ops do not disappear, it could be a disk fail that is not detected by 
> health monitoring. You could set it to "out" and see if the cluster recovers 
> to a healthy state (modulo the currently degraded objects) with no slow ops. 
> If so, I would replace the disk.

smartctl reports no problems.

osd.41 (and osd.0) was one of the original OSDs used for the
device_health_metrics pool.  Early on, before I knew better, I had
removed this OSD (and osd.0) from the cluster, and the OSD ids got
recycled when new disks were later added.  This is when the slow ops on
osd.0 and osd.41 started getting reported.  On advice from another user
on ceph-users, I updated my crush map to remap the device_health_metrics
pool to a different set of OSDs (and the slow ops persisted).

osd.0 usually also shows slow ops.  I was a little surprised that it
didn't when I took this snapshot, but now it does.

I have now run 'ceph osd out 41', and the recovery I/O has finished.
With the exception of one less OSD marked in, the output of 'ceph
status' looks the same.

The last few lines of the osd.41 logfile are here:

https://pastebin.com/k06aArW4

How long does it take for ceph to clear the slow ops status?

> 4) In the output of "df tree" node141 shows up twice. Could you confirm that 
> this is a copy-paste error or is this node indeed twice in the output? This 
> is easiest to see in the pastebin when switching to "raw" view.

This was a copy/paste error.

> 5) The crush tree contains an empty host bucket (node308). Please delete this 
> host bucket (ceph osd crush rm node308) for now and let me know if this 
> caused any data movements (recovery IO).

This did not cause any data movement, according to 'ceph status'.

> 6) The crush tree looks a bit

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-14 Thread Frank Schilder
Hi Michael,

it doesn't look too bad. All degraded objects are due to the undersized PG. If 
this is an EC pool with m>=2, data is currently not in danger.

I see a few loose ends to pick up, let's hope this is something simple. For any 
of the below, before attempting the next step, please wait until all induced 
recovery IO has completed before continuing.

1) Could you please paste the output of the following commands to pastebin 
(bash syntax):

  ceph osd pool get device_health_metrics all
  ceph osd pool get fs.data.archive.frames all
  ceph pg dump |& grep -i -e PG_STAT -e "^7.39d"
  ceph osd crush rule ls
  ceph osd erasure-code-profile ls
  ceph osd crush dump # this is a big one, please be careful with copy-paste 
(see point 3 below)

2) I don't see any IO reported (neither user nor recovery). Could you please 
confirm that the command outputs were taken during a zero-IO period?

3) Something is wrong with osd.41. Can you check its health status with 
smartctl? If it is reported healthy, give it one more clean restart. If the 
slow ops do not disappear, it could be a disk fail that is not detected by 
health monitoring. You could set it to "out" and see if the cluster recovers to 
a healthy state (modulo the currently degraded objects) with no slow ops. If 
so, I would replace the disk.

4) In the output of "df tree" node141 shows up twice. Could you confirm that 
this is a copy-paste error or is this node indeed twice in the output? This is 
easiest to see in the pastebin when switching to "raw" view.

5) The crush tree contains an empty host bucket (node308). Please delete this 
host bucket (ceph osd crush rm node308) for now and let me know if this caused 
any data movements (recovery IO).

6) The crush tree looks a bit exotic. Do the nodes with a single OSD correspond 
to a physical host with 1 OSD disk? If not, could you please state how the host 
buckets are mapped onto physical hosts?

7) In case there was a change to the health status, could you please include an 
updated "ceph health detail"?

I don't expect to get the incomplete PG resolved with the above, but it will 
move some issues out of the way before proceeding.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 14 October 2020 20:52:10
To: Andreas John; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Hello,

The original cause of the OSD instability has already been fixed.  It
was due to user jobs (via condor) consuming too much memory and causing
the machine to swap.  The OSDs didn't actually crash, but weren't
responding in time and were being flagged as down.

In most cases, the problematic OSD servers were also not responding on
the console and had to be physically power cycled to recover.

Since adding additional memory limits to user jobs, we have only had 1
or 2 unstable OSDs that were fixed by killing the remaining rogue user jobs.

Regards,

--Mike

On 10/10/20 9:22 AM, Andreas John wrote:
> Hello Mike,
>
> do your OSDs go down from time to time? I once has an issue with
> unrecoverable objects, because I had only n+1 (size 2) redundancy and
> ceph wasn't able to decide, what's the correct copy of the object. In my
> case there half-deleted snapshots  in one of the copies. I used
> ceph-objectstoretool to remove the "wrong" part. Did you check you OSD
> logs? Do the osd go down wirth an obscure stacktrace (and maybe they are
> restartet by systemd ...)
>
> rgds,
>
> j.
>
>
>
> On 09.10.20 22:33, Michael Thomas wrote:
>> Hi Frank,
>>
>> That was a good tip.  I was able to move the broken files out of the
>> way and restore them for users.  However, after 2 weeks I'm still left
>> with unfound objects.  Even more annoying, I now have 82k objects
>> degraded (up from 74), which hasn't changed in over a week.
>>
>> I'm ready to claim that the auto-repair capabilities of ceph are not
>> able to fix my particular issues, and will have to continue to
>> investigate alternate ways to clean this up, including a pg
>> export/import (as you suggested) and perhaps a mds backward scrub
>> (after testing in a junk pool first).
>>
>> I have other tasks I need to perform on the filesystem (removing OSDs,
>> adding new OSDs, increasing PG count), but I feel like I need to
>> address these degraded/lost objects before risking any more damage.
>>
>> One particular PG is in a curious state:
>>
>> 7.39d82163 82165 2467341  3440607778070
>>   0   2139  active+recovery_unfound+undersized+degraded+remapped 23m
>> 50755'112549   50766:960500   [116,72,122,48,45,131,73,81]p116
>>   [71,109,99,48,45,90,73,NONE]p7

[ceph-users] Long heartbeat ping times

2020-10-12 Thread Frank Schilder
Dear all,

occasionally, I find messages like

Health check update: Long heartbeat ping times on front interface seen, longest 
is 1043.153 msec (OSD_SLOW_PING_TIME_FRONT)

in the cluster log. Unfortunately, I seem to be unable to find out which OSDs 
were affected (a-posteriori). I cannot find related messages in any OSD log and 
the messages I find in /var/log/messages do not contain IP addresses or OSD IDs.

Is there a way to find out which OSDs/hosts were the problem after health 
status is back to healthy?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs tag not working

2020-10-01 Thread Frank Schilder
There used to be / is a bug in ceph fs commands when using data pools. If you 
enable the application cephfs on a pool explicitly before running cephfs add 
datapool, the fs-tag is not applied. Maybe its that? There is an older thread 
on the topic in the users-list and also a fix/workaround.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 01 October 2020 15:33:53
To: ceph-users@ceph.io
Subject: [ceph-users] Re: cephfs tag not working

Hi,

I have a one-node-cluster (also 15.2.4) for testing purposes and just
created a cephfs with the tag, it works for me. But my node is also
its own client, so there's that. And it was installed with 15.2.4, no
upgrade.

> For the 2nd, mds works, files can be created or removed, but client
> read/write (native client, kernel version 5.7.4) fails with I/O
> error, so osd part does not seem to be working properly.

You mean it works if you mount it from a different host (within the
cluster maybe) with the new client's key but it doesn't work with the
designated clients? I'm not sure about the OSD part since the other
syntax seems to work, you say.

Can you share more details about the error? The mount on the clients
works but they can't read/write?

Regards,
Eugen


Zitat von Andrej Filipcic :

> Hi,
>
> on octopus 15.2.4 I have an issue with cephfs tag auth. The
> following works fine:
>
> client.f9desktop
> key: 
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw  pool=cephfs_data, allow rw
> pool=ssd_data, allow rw pool=fast_data,  allow rw pool=arich_data,
> allow rw pool=ecfast_data
>
> but this one does not.
>
> client.f9desktopnew
> key: 
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs
>
> For the 2nd, mds works, files can be created or removed, but client
> read/write (native client, kernel version 5.7.4) fails with I/O
> error, so osd part does not seem to be working properly.
>
> Any clues what can be wrong? the cephfs was created in jewel...
>
> Another issue is: if osd caps are updated (adding data pool), then
> some clients refresh the caps, but most of them do not, and the only
> way to refresh it is to remount the filesystem. working tag would
> solve it.
>
> Best regards,
> Andrej
>
> --
> _
>prof. dr. Andrej Filipcic,   E-mail: andrej.filip...@ijs.si
>Department of Experimental High Energy Physics - F9
>Jozef Stefan Institute, Jamova 39, P.o.Box 3000
>SI-1001 Ljubljana, Slovenia
>Tel.: +386-1-477-3674Fax: +386-1-477-3166
> -
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-10-01 Thread Frank Schilder
Dear Mark and Nico,

I think this might be the time to file a tracker report. As far as I can see, 
your set-up is as it should be, OSD operations on your clusters should behave 
exactly as on ours. I don't know of any other configuration option that 
influences placement calculation.

The problems you (Nico in particular) describe seem serious enough. I heard 
also other reports of admin operations killing a cluster starting with 
Nautilus, most notably this one 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ/#4KY3OW7PTOODLQVYKARZLGE5FZUNQOER
 . Maybe there is/are regressions with crush placement computations (and 
others)? I will add this to the list of tests before considering to upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc Roos 
Sent: 30 September 2020 22:26:11
To: eblock; Frank Schilder
Cc: ceph-users; nico.schottelius
Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

I am not sure, but it looks like this remapping at hdd's is not being
done when adding back the same ssd osd.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-09-30 Thread Frank Schilder
Hi Nico and Mark,

your crush trees look indeed like they have been converted properly to using 
device classes already. Changing something within one device class should not 
influence placement in another. Maybe I'm overlooking something?

The only other place I know of where such a mix-up could occur are the crush 
rules. Do your rules look like this:

{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

Notice the "~rbd_data" qualifier. It is important that the device class is 
specified at the root selection.

I'm really surprised that with your crush tree you observe changes in SSD 
implying changes in HDD placements. I was really rough on our mimic cluster 
with moving disks in and out and between servers and I have never seen this 
problem. Could it be a regression in nautilus? Is the auto-balancer interfering?

> we recently also noticed that rebuilding one pool ("ssd")
> influenced speed on other pools, which was unexpected.

Could this be something else? Was PG/object placement influenced or performance 
only?

I'm asking, because during one of our service windows we observed something 
very strange. We have a multi-location cluster with pools with completely 
isolated storage devices in different locations. On one of these sub-clusters 
we run a ceph fs. During maintenance we needed to shut down the ceph-fs. When 
our admin issued the umount command (ca. 1500 clients), we noticed that RBD 
pools seemed to have problems even though there is absolutely no overlap in 
disks (disjoint crush trees), they are not even in the same physical location 
and sit on their own switches. The fs and RBD only share the MONs/MGRs. I'm not 
entirely sure if we observed something real or only a network blip. However, 
nagios went crazy on our VM environment for a few minutes.

Maybe there is another issue that causes unexpected cross-dependencies that 
affect performance?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc Roos 
Sent: 30 September 2020 14:59:50
To: eblock; Frank Schilder
Cc: ceph-users; nico.schottelius
Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

Hi Frank, thanks this 'root default' indeed looks different with these 0
there. I have also uploaded mine[1] because it looks very similar to
Nico's. I guess his hdd pg's can also start moving in some occassions.
Thanks for 'crushtool reclassify' hint, I guess I have missed this in
the release notes or so.

[1]
https://pastebin.com/PFx0V3S7



-Original Message-
To: Eugen Block
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd
class osd's

This is how my crush tree including shadow hierarchies looks like (a
mess :): https://pastebin.com/iCLbi4Up

Every device class has its own tree. Starting with mimic, this is
automatic when creating new device classes.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 30 September 2020 08:43:47
To: Frank Schilder
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd
class osd's

Interesting, I also did this test on an upgraded cluster (L to N).
I'll repeat the test on a native Nautilus to see it for myself.


Zitat von Frank Schilder

> Somebody on this list posted a script that can convert pre-mimic crush

> trees with buckets for different types of devices to crush trees with
> device classes with minimal data movement (trying to maintain IDs as
> much as possible). Don't have a thread name right now, but could try
> to find it tomorrow.
>
> I can check tomorrow how our crush tree unfolds. Basically, for every
> device class there is a full copy (shadow hierarchy) for each device
> class with its own weights etc.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Marc Roos
> Sent: 29 September 2020 22:19:33
> To: eblock; Frank Schilder
&

[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-09-30 Thread Frank Schilder
> To me it looks like the structure of both maps is pretty much the same -
> or am I mistaken?

Yes, but you are not Marc Roos. Do you work on the same cluster or do you 
observe the same problem?

In any case, here is a thread pointing to the crush tree/rule conversion I 
mentioned: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/675QZ2JXXX4RPRNPK2NL7FB5MVANKUB2/#675QZ2JXXX4RPRNPK2NL7FB5MVANKUB2

The tool is "crushtool reclassify" and is recommended to use when upgrading 
from luminous to newer to convert crush rules to use device classes.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nico Schottelius 
Sent: 30 September 2020 09:12:49
To: Frank Schilder
Cc: Eugen Block; Marc Roos; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

Hey Frank,

I uploaded our kraken created and nautilus upgraded crush map on [0].

To me it looks like the structure of both maps is pretty much the same -
or am I mistaken?

Best regards,

Nico

[0] https://www.nico.schottelius.org/temp/ceph-shadowtree20200930

Frank Schilder  writes:

> This is how my crush tree including shadow hierarchies looks like (a mess :): 
> https://pastebin.com/iCLbi4Up
>
> Every device class has its own tree. Starting with mimic, this is automatic 
> when creating new device classes.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 30 September 2020 08:43:47
> To: Frank Schilder
> Cc: Marc Roos; ceph-users
> Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class 
> osd's
>
> Interesting, I also did this test on an upgraded cluster (L to N).
> I'll repeat the test on a native Nautilus to see it for myself.
>
>
> Zitat von Frank Schilder :
>
>> Somebody on this list posted a script that can convert pre-mimic
>> crush trees with buckets for different types of devices to crush
>> trees with device classes with minimal data movement (trying to
>> maintain IDs as much as possible). Don't have a thread name right
>> now, but could try to find it tomorrow.
>>
>> I can check tomorrow how our crush tree unfolds. Basically, for
>> every device class there is a full copy (shadow hierarchy) for each
>> device class with its own weights etc.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Marc Roos 
>> Sent: 29 September 2020 22:19:33
>> To: eblock; Frank Schilder
>> Cc: ceph-users
>> Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd
>> class osd's
>>
>> Yes correct this is coming from Luminous or maybe even Kraken. How does
>> a default crush tree look like in mimic or octopus? Or is there some
>> manual how to bring this to the new 'default'?
>>
>>
>> -Original Message-
>> Cc: ceph-users
>> Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd
>> class osd's
>>
>> Are these crush maps inherited from pre-mimic versions? I have
>> re-balanced SSD and HDD pools in mimic (mimic deployed) where one device
>> class never influenced the placement of the other. I have mixed hosts
>> and went as far as introducing rbd_meta, rbd_data and such classes to
>> sub-divide even further (all these devices have different perf specs).
>> This worked like a charm. When adding devices of one class, only pools
>> in this class were ever affected.
>>
>> As far as I understand, starting with mimic, every shadow class defines
>> a separate tree (not just leafs/OSDs). Thus, device classes are
>> independent of each other.
>>
>>
>>
>> 
>> Sent: 29 September 2020 20:54:48
>> To: eblock
>> Cc: ceph-users
>> Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class
>> osd's
>>
>> Yes correct, hosts have indeed both ssd's and hdd's combined. Is this
>> not more of a bug then? I would assume the goal of using device classes
>> is that you separate these and one does not affect the other, even the
>> host weight of the ssd and hdd class are already available. The
>> algorithm should just use that instead of the weight of the whole host.
>> Or is there some specific use case, where these classes combined is
>> required?
>>
>>
>> -Original Message-
>> Cc: ceph-users
>> Subject: *SPAM* Re: [ceph-users]

[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-09-30 Thread Frank Schilder
This is how my crush tree including shadow hierarchies looks like (a mess :): 
https://pastebin.com/iCLbi4Up

Every device class has its own tree. Starting with mimic, this is automatic 
when creating new device classes.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 30 September 2020 08:43:47
To: Frank Schilder
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

Interesting, I also did this test on an upgraded cluster (L to N).
I'll repeat the test on a native Nautilus to see it for myself.


Zitat von Frank Schilder :

> Somebody on this list posted a script that can convert pre-mimic
> crush trees with buckets for different types of devices to crush
> trees with device classes with minimal data movement (trying to
> maintain IDs as much as possible). Don't have a thread name right
> now, but could try to find it tomorrow.
>
> I can check tomorrow how our crush tree unfolds. Basically, for
> every device class there is a full copy (shadow hierarchy) for each
> device class with its own weights etc.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Marc Roos 
> Sent: 29 September 2020 22:19:33
> To: eblock; Frank Schilder
> Cc: ceph-users
> Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd
> class osd's
>
> Yes correct this is coming from Luminous or maybe even Kraken. How does
> a default crush tree look like in mimic or octopus? Or is there some
> manual how to bring this to the new 'default'?
>
>
> -Original Message-
> Cc: ceph-users
> Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd
> class osd's
>
> Are these crush maps inherited from pre-mimic versions? I have
> re-balanced SSD and HDD pools in mimic (mimic deployed) where one device
> class never influenced the placement of the other. I have mixed hosts
> and went as far as introducing rbd_meta, rbd_data and such classes to
> sub-divide even further (all these devices have different perf specs).
> This worked like a charm. When adding devices of one class, only pools
> in this class were ever affected.
>
> As far as I understand, starting with mimic, every shadow class defines
> a separate tree (not just leafs/OSDs). Thus, device classes are
> independent of each other.
>
>
>
> 
> Sent: 29 September 2020 20:54:48
> To: eblock
> Cc: ceph-users
> Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class
> osd's
>
> Yes correct, hosts have indeed both ssd's and hdd's combined. Is this
> not more of a bug then? I would assume the goal of using device classes
> is that you separate these and one does not affect the other, even the
> host weight of the ssd and hdd class are already available. The
> algorithm should just use that instead of the weight of the whole host.
> Or is there some specific use case, where these classes combined is
> required?
>
>
> -Original Message-
> Cc: ceph-users
> Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when
> converting ssd class osd's
>
> They're still in the same root (default) and each host is member of both
> device-classes, I guess you have a mixed setup (hosts c01/c02 have both
> HDDs and SSDs)? I don't think this separation is enough to avoid
> remapping even if a different device-class is affected (your report
> confirms that).
>
> Dividing the crush tree into different subtrees might help here but I'm
> not sure if that's really something you need. You might also just deal
> with the remapping as long as it doesn't happen too often, I guess. On
> the other hand, if your setup won't change (except adding more OSDs) you
> might as well think about a different crush tree. It really depends on
> your actual requirements.
>
> We created two different subtrees when we got new hardware and it helped
> us a lot moving the data only once to the new hardware avoiding multiple
> remappings, now the older hardware is our EC environment except for some
> SSDs on those old hosts that had to stay in the main subtree. So our
> setup is also very individual but it works quite nice.
> :-)
>
>
> Zitat von :
>
>> I have practically a default setup. If I do a 'ceph osd crush tree
>> --show-shadow' I have a listing like this[1]. I would assume from the
>> hosts being listed within the default~ssd and default~hdd, they are
>> separate (enough)?
>>
>>
>> [1]
>> root default~ssd
>>  host c01~ssd
>> ..
>> ..

[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-09-29 Thread Frank Schilder
Somebody on this list posted a script that can convert pre-mimic crush trees 
with buckets for different types of devices to crush trees with device classes 
with minimal data movement (trying to maintain IDs as much as possible). Don't 
have a thread name right now, but could try to find it tomorrow.

I can check tomorrow how our crush tree unfolds. Basically, for every device 
class there is a full copy (shadow hierarchy) for each device class with its 
own weights etc.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc Roos 
Sent: 29 September 2020 22:19:33
To: eblock; Frank Schilder
Cc: ceph-users
Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

Yes correct this is coming from Luminous or maybe even Kraken. How does
a default crush tree look like in mimic or octopus? Or is there some
manual how to bring this to the new 'default'?


-Original Message-
Cc: ceph-users
Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd
class osd's

Are these crush maps inherited from pre-mimic versions? I have
re-balanced SSD and HDD pools in mimic (mimic deployed) where one device
class never influenced the placement of the other. I have mixed hosts
and went as far as introducing rbd_meta, rbd_data and such classes to
sub-divide even further (all these devices have different perf specs).
This worked like a charm. When adding devices of one class, only pools
in this class were ever affected.

As far as I understand, starting with mimic, every shadow class defines
a separate tree (not just leafs/OSDs). Thus, device classes are
independent of each other.




Sent: 29 September 2020 20:54:48
To: eblock
Cc: ceph-users
Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class
osd's

Yes correct, hosts have indeed both ssd's and hdd's combined. Is this
not more of a bug then? I would assume the goal of using device classes
is that you separate these and one does not affect the other, even the
host weight of the ssd and hdd class are already available. The
algorithm should just use that instead of the weight of the whole host.
Or is there some specific use case, where these classes combined is
required?


-Original Message-
Cc: ceph-users
Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when
converting ssd class osd's

They're still in the same root (default) and each host is member of both
device-classes, I guess you have a mixed setup (hosts c01/c02 have both
HDDs and SSDs)? I don't think this separation is enough to avoid
remapping even if a different device-class is affected (your report
confirms that).

Dividing the crush tree into different subtrees might help here but I'm
not sure if that's really something you need. You might also just deal
with the remapping as long as it doesn't happen too often, I guess. On
the other hand, if your setup won't change (except adding more OSDs) you
might as well think about a different crush tree. It really depends on
your actual requirements.

We created two different subtrees when we got new hardware and it helped
us a lot moving the data only once to the new hardware avoiding multiple
remappings, now the older hardware is our EC environment except for some
SSDs on those old hosts that had to stay in the main subtree. So our
setup is also very individual but it works quite nice.
:-)


Zitat von :

> I have practically a default setup. If I do a 'ceph osd crush tree
> --show-shadow' I have a listing like this[1]. I would assume from the
> hosts being listed within the default~ssd and default~hdd, they are
> separate (enough)?
>
>
> [1]
> root default~ssd
>  host c01~ssd
> ..
> ..
>  host c02~ssd
> ..
> root default~hdd
>  host c01~hdd
> ..
>  host c02~hdd
> ..
> root default
>
>
>
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class

> osd's
>
> Are all the OSDs in the same crush root? I would think that since the
> crush weight of hosts change as soon as OSDs are out it impacts the
> whole crush tree. If you separate the SSDs from the HDDs logically
(e.g.
> different bucket type in the crush tree) the ramapping wouldn't affect

> the HDDs.
>
>
>
>
>> I have been converting ssd's osd's to dmcrypt, and I have noticed
>> that
>
>> pg's of pools are migrated that should be (and are?) on hdd class.
>>
>> On a healthy ok cluster I am getting, when I set the crush reweight
>> to
>
>> 0.0 of a ssd osd this:
>>
>> 17.35 10415  00  9907   0
>> 36001743890   0  0 3045 3045
>> active+remapped+backfilling 2020-09-27 12:55:49.093054
>> active+remapp

[ceph-users] Re: hdd pg's migrating when converting ssd class osd's

2020-09-29 Thread Frank Schilder
Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD 
and HDD pools in mimic (mimic deployed) where one device class never influenced 
the placement of the other. I have mixed hosts and went as far as introducing 
rbd_meta, rbd_data and such classes to sub-divide even further (all these 
devices have different perf specs). This worked like a charm. When adding 
devices of one class, only pools in this class were ever affected.

As far as I understand, starting with mimic, every shadow class defines a 
separate tree (not just leafs/OSDs). Thus, device classes are independent of 
each other.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc Roos 
Sent: 29 September 2020 20:54:48
To: eblock
Cc: ceph-users
Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's

Yes correct, hosts have indeed both ssd's and hdd's combined. Is this
not more of a bug then? I would assume the goal of using device classes
is that you separate these and one does not affect the other, even the
host weight of the ssd and hdd class are already available. The
algorithm should just use that instead of the weight of the whole host.
Or is there some specific use case, where these classes combined is
required?


-Original Message-
Cc: ceph-users
Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when
converting ssd class osd's

They're still in the same root (default) and each host is member of both
device-classes, I guess you have a mixed setup (hosts c01/c02 have both
HDDs and SSDs)? I don't think this separation is enough to avoid
remapping even if a different device-class is affected (your report
confirms that).

Dividing the crush tree into different subtrees might help here but I'm
not sure if that's really something you need. You might also just deal
with the remapping as long as it doesn't happen too often, I guess. On
the other hand, if your setup won't change (except adding more OSDs) you
might as well think about a different crush tree. It really depends on
your actual requirements.

We created two different subtrees when we got new hardware and it helped
us a lot moving the data only once to the new hardware avoiding multiple
remappings, now the older hardware is our EC environment except for some
SSDs on those old hosts that had to stay in the main subtree. So our
setup is also very individual but it works quite nice.
:-)


Zitat von :

> I have practically a default setup. If I do a 'ceph osd crush tree
> --show-shadow' I have a listing like this[1]. I would assume from the
> hosts being listed within the default~ssd and default~hdd, they are
> separate (enough)?
>
>
> [1]
> root default~ssd
>  host c01~ssd
> ..
> ..
>  host c02~ssd
> ..
> root default~hdd
>  host c01~hdd
> ..
>  host c02~hdd
> ..
> root default
>
>
>
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class

> osd's
>
> Are all the OSDs in the same crush root? I would think that since the
> crush weight of hosts change as soon as OSDs are out it impacts the
> whole crush tree. If you separate the SSDs from the HDDs logically
(e.g.
> different bucket type in the crush tree) the ramapping wouldn't affect

> the HDDs.
>
>
>
>
>> I have been converting ssd's osd's to dmcrypt, and I have noticed
>> that
>
>> pg's of pools are migrated that should be (and are?) on hdd class.
>>
>> On a healthy ok cluster I am getting, when I set the crush reweight
>> to
>
>> 0.0 of a ssd osd this:
>>
>> 17.35 10415  00  9907   0
>> 36001743890   0  0 3045 3045
>> active+remapped+backfilling 2020-09-27 12:55:49.093054
>> active+remapped+83758'20725398
>> 83758:100379720  [8,14,23]  8  [3,14,23]  3
>> 83636'20718129 2020-09-27 00:58:07.098096  83300'20689151 2020-09-24
>> 21:42:07.385360 0
>>
>> However osds 3,14,23,8 are all hdd osd's
>>
>> Since this is a cluster from Kraken/Luminous, I am not sure if the
>> device class of the replicated_ruleset[1] was set when the pool 17
>> was
>
>> created.
>> Weird thing is that all pg's of this pool seem to be on hdd osd[2]
>>
>> Q. How can I display the definition of 'crush_rule 0' at the time of
>> the pool creation? (To be sure it had already this device class hdd
>> configured)
>>
>>
>>
>> [1]
>> [@~]# ceph osd pool ls detail | grep 'pool 17'
>> pool 17 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 83712
>> f

[ceph-users] Re: samba vfs_ceph: client_mds_namespace not working?

2020-09-23 Thread Frank Schilder
Hi Stefan,

thanks for your answer. I think the deprecated option is still supported and I 
found something else - I will update to the new option though. On the ceph 
side, I see in the log now:

  client session with non-allowable root '/' denied (client.31382084 
192.168.48.135:0/2576875769)

It looks like the path option of vfs_ceph is not passed on correctly. I did 
neither try nor allow to mount the root itself, but a sub-directory. It looks 
like one of the combinations I tested works, but the path into the ceph fs is 
not used. The option I use is:

  path = /shares/FOLDER-NAME

and this should show up in the client session as root '/shares/FOLDER-NAME'. 
Starts looking like a bug in vfs_ceph.c .

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 23 September 2020 11:49:29
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] samba vfs_ceph: client_mds_namespace not working?

On 2020-09-23 11:00, Frank Schilder wrote:
> Dear all,
>
> maybe someone has experienced this before. We are setting up a SAMBA gateway 
> and would like to use the vfs_ceph module. In case of several file systems 
> one needs to choose an mds namespace. There is an option in ceph.conf:
>
>   client mds namespace = CEPH-FS-NAME
>
> Unfortunately, it seems not to work. I tried it in all possible versions, in 
> [global] and [client], with and without "client" at the beginning, to no 
> avail. I either get a time out or an error. I also found the libcephfs 
> function

In ceph/src/common/options.cc I found this:

Option("client_fs", Option::TYPE_STR, Option::LEVEL_ADVANCED)
.set_flag(Option::FLAG_STARTUP)
.set_default("")
.set_description("CephFS file system name to mount")
.set_long_description("Use this with ceph-fuse, or with any process "
"that uses libcephfs.  Programs using libcephfs may also pass "
"the filesystem name into mount(), which will override this
setting. "
"If no filesystem name is given in mount() or this setting, the
default "
"filesystem will be mounted (usually the first created)."),

/* Alias for client_fs. Deprecated */
Option("client_mds_namespace", Option::TYPE_STR, Option::LEVEL_DEV)
.set_flag(Option::FLAG_STARTUP)
.set_default(""),

So the client_mds_namespace is deprecated, and maybe even removed? Does
it work if you specify "client_fs"?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Documentation broken

2020-09-23 Thread Frank Schilder
Hi Lenz,

thanks for that, this should do. Please retain the copy until all is migrated :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Lenz Grimmer 
Sent: 23 September 2020 10:55:13
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Documentation broken

Hi Frank,

On 9/22/20 4:30 PM, Frank Schilder wrote:

> during the migration of documentation, would it be possible to make
> the old documentation available somehow? A lot of pages are broken
> and I can't access the documentation for mimic at all any more.
>
> Is there an archive or something similar?

The wayback machine has an online copy from May this year:

https://web.archive.org/web/20191226012841/https://docs.ceph.com/docs/mimic/

Alternatively, all previous versions of the docs are of course stored in
the git repo (but admittedly not that easy to browse/read):

https://github.com/ceph/ceph/tree/mimic/doc

Hope that helps,

Lenz

--
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: samba vfs_ceph: client_mds_namespace not working?

2020-09-23 Thread Frank Schilder
Update: setting "ceph fs set-default CEPH-FS-NAME" allows to do a kernel fs 
mount without providing the mds_namespace mount option, but the vfs_ceph module 
still fails with either

  cephwrap_connect: [CEPH] Error return: Operation not permitted

or

  cephwrap_connect: [CEPH] Error return: Operation not supported

depending on whether I use

  ceph:user_id = USER-NAME

or

  ceph:user_id = client.USER-NAME

I guess the second user spec is correct as the first error message indicates an 
auth problem. On the client side I see the same message in both cases:

  tree connect failed: NT_STATUS_UNSUCCESSFUL

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 23 September 2020 11:00:50
To: ceph-users
Subject: [ceph-users] samba vfs_ceph: client_mds_namespace not working?

Dear all,

maybe someone has experienced this before. We are setting up a SAMBA gateway 
and would like to use the vfs_ceph module. In case of several file systems one 
needs to choose an mds namespace. There is an option in ceph.conf:

  client mds namespace = CEPH-FS-NAME

Unfortunately, it seems not to work. I tried it in all possible versions, in 
[global] and [client], with and without "client" at the beginning, to no avail. 
I either get a time out or an error. I also found the libcephfs function

  ceph_select_filesystem(cmount, CEPH-FS-NAME)

added it to vfs_ceph.c just before the ceph_mount with the same result, I get 
an error (operation not permitted). Does anyone know how to get this to work? 
And, yes, I tested an ordinary kernel fs mount with the credentials for the 
ceph client without problems.

I can't access any documentation on the libcephfs api, I always get a page not 
found error.

My last resort is now to

  ceph fs set-default CEPH-FS-NAME

to the fs to be used and live with the implied restrictions and ugliness.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] samba vfs_ceph: client_mds_namespace not working?

2020-09-23 Thread Frank Schilder
Dear all,

maybe someone has experienced this before. We are setting up a SAMBA gateway 
and would like to use the vfs_ceph module. In case of several file systems one 
needs to choose an mds namespace. There is an option in ceph.conf:

  client mds namespace = CEPH-FS-NAME

Unfortunately, it seems not to work. I tried it in all possible versions, in 
[global] and [client], with and without "client" at the beginning, to no avail. 
I either get a time out or an error. I also found the libcephfs function

  ceph_select_filesystem(cmount, CEPH-FS-NAME)

added it to vfs_ceph.c just before the ceph_mount with the same result, I get 
an error (operation not permitted). Does anyone know how to get this to work? 
And, yes, I tested an ordinary kernel fs mount with the credentials for the 
ceph client without problems.

I can't access any documentation on the libcephfs api, I always get a page not 
found error.

My last resort is now to

  ceph fs set-default CEPH-FS-NAME

to the fs to be used and live with the implied restrictions and ugliness.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unknown PGs after osd move

2020-09-22 Thread Frank Schilder
No, the recipe I gave was for trying to recover healthy status of all PGs in 
the current situation.

I would avoid moving OSDs at all cost, because it will always imply 
rebalancing. Any change to the crush map changes how PGs are hashed onto OSDs, 
which in turn triggers a rebalancing.

If moving OSDs cannot be avoided, I usually do:

- evacuate OSDs that need to move
- move empty (!) OSDs to new location
- let data move back onto OSDs

There are other ways of doing it, with their own pro's and cons. For example, 
if your client load allows high-bandwidth rebuild operations, you can also

- shut down OSDs that need to move (make sure you don't shut down too many from 
different failure domains at the same time)
- let the remaining OSDs rebuild the missing data
- after health is back to OK, move OSDs and start up

The second way is usually faster, but has the drawback that new writes will go 
to less redundant storage for a while. The first method takes longer, but there 
is no redundancy degradation along the way.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nico Schottelius 
Sent: 22 September 2020 22:13:49
To: Frank Schilder
Cc: Nico Schottelius; Andreas John; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Unknown PGs after osd move

Hey Frank,

Frank Schilder  writes:

>> > Is the crush map aware about that?
>>
>> Yes, it correctly shows the osds at serve8 (previously server15).
>>
>> > I didn't ever try that, but don't you need to cursh move it?
>>
>> I originally imagined this, too. But as soon as the osd starts on a new
>> server it is automatically put into the serve8 bucket.
>
> It does not work like this, unfortunately. If you physically move
> disks to a new server without "informing ceph" in advance, hat is,
> crush move the OSD while they are up, ceph looses placement
> information. You can post-repair such a situation by temporarily
> "crush moving" (software move, not hardware move) the OSDs back to
> their previous host buckets, wait for peering to complete, and then
> "crush move" them to their new location again.

That is good to know. So in theory:

- crush move osd to a different server bucket
- shutdown osd
- move physically to another server
- no rebalancing needed

Should do the job?

It won't accept today's rebalance, but it would be good to have a sane
way for the future.

Cheers,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unknown PGs after osd move

2020-09-22 Thread Frank Schilder
> > Is the crush map aware about that?
>
> Yes, it correctly shows the osds at serve8 (previously server15).
>
> > I didn't ever try that, but don't you need to cursh move it?
>
> I originally imagined this, too. But as soon as the osd starts on a new
> server it is automatically put into the serve8 bucket.

It does not work like this, unfortunately. If you physically move disks to a 
new server without "informing ceph" in advance, hat is, crush move the OSD 
while they are up, ceph looses placement information. You can post-repair such 
a situation by temporarily "crush moving" (software move, not hardware move) 
the OSDs back to their previous host buckets, wait for peering to complete, and 
then "crush move" them to their new location again. Do not restart OSDs during 
this process or while rebalancing of misplaced objects is going on. There is a 
long-standing issue that causes placement information to be lost again and one 
would need to repeat the procedure.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nico Schottelius 
Sent: 22 September 2020 21:14:07
To: Andreas John
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Unknown PGs after osd move

Hey Andreas,

Andreas John  writes:

> Hello,
>
> On 22.09.20 20:45, Nico Schottelius wrote:
>> Hello,
>>
>> after having moved 4 ssds to another host (+ the ceph tell hanging issue
>> - see previous mail), we ran into 241 unknown pgs:
>
> You mean, that you re-seated the OSDs into another chassis/host?

That is correct.

> Is the crush map aware about that?

Yes, it correctly shows the osds at serve8 (previously server15).

> I didn't ever try that, but don't you need to cursh move it?

I originally imagined this, too. But as soon as the osd starts on a new
server it is automatically put into the serve8 bucket.

Cheers,

Nico


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Documentation broken

2020-09-22 Thread Frank Schilder
Hi all,

during the migration of documentation, would it be possible to make the old 
documentation available somehow? A lot of pages are broken and I can't access 
the documentation for mimic at all any more.

Is there an archive or something similar?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Setting up a small experimental CEPH network

2020-09-21 Thread Frank Schilder
Hi all,

we use heavily bonded interfaces (6x10G) and also needed to look at this 
balancing question. We use LACP bonding and, while the host OS probably tries 
to balance outgoing traffic over all NICs, the real decision is made by the 
switches (incoming traffic). Our switches hash packets to a port by (source?) 
MAC address, meaning that it is not the number of TCP/IP connections that helps 
balancing, but only the number of MAC addresses. In an LACP bond, all NICs have 
the same MAC address and balancing happens by (physical) host. The more hosts, 
the better it will work.

In a way, for us this is a problem and not at the same time. We have about 550 
physical clients (an HPC cluster) and 12 OSD hosts, which means that we 
probably have a good load on every single NIC for client traffic.

On the other hand, rebalancing between 12 servers is unlikely to use all NICs 
effectively. So far, we don't have enough disks per host to notice that, but it 
could become visible at some point. Basically, the host with the worst 
switch-sided hashing for incoming traffic will become the bottleneck.

On some switches the hashing method for LACP bonds can be configured, however, 
not with much detail. I have not seen a possibility to use IP:PORT for hashing 
to a switch port.

I have no experience with bonding mode 6 (ALB) that might provide a 
per-connection hashing. Would be interested to hear how it performs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc Roos 
Sent: 21 September 2020 11:08:55
To: ceph-users; lindsay.mathieson
Subject: [ceph-users] Re: Setting up a small experimental CEPH network

I tested something in the past[1] where I could notice that an osd
staturated a bond link and did not use the available 2nd one. I think I
maybe made a mistake in writing down it was a 1x replicated pool.
However it has been written here multiple times that these osd processes
are single thread, so afaik they cannot use more than on link, and at
the moment your osd has a saturated link, your clients will notice this.


[1]
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html



-Original Message-
From: Lindsay Mathieson [mailto:lindsay.mathie...@gmail.com]
Sent: maandag 21 september 2020 2:42
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Setting up a small experimental CEPH network

On 21/09/2020 5:40 am, Stefan Kooman wrote:
> My experience with bonding and Ceph is pretty good (OpenvSwitch). Ceph

> uses lots of tcp connections, and those can get shifted (balanced)
> between interfaces depending on load.

Same here - I'm running 4*1GB (LACP, Balance-TCP) on a 5 node cluster
with 19 OSD's. 20 Active VM's and it idles at under 1 MiB/s, spikes up
to 100MiB/s no problem. When doing a heavy rebalance/repair data rates
on any one node can hit 400MiBs+


It scales out really well.

--
Lindsay
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-18 Thread Frank Schilder
Dear Michael,

maybe there is a way to restore access for users and solve the issues later. 
Someone else with a lost/unfound object was able to move the affected file (or 
directory containing the file) to a separate location and restore the now 
missing data from backup. This will "park" the problem of cluster health for 
later fixing.

Best regads,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 18 September 2020 15:38:51
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,

> I disagree with the statement that trying to recover health by deleting
> data is a contradiction.  In some cases (such as mine), the data in ceph
> is backed up in another location (eg tape library).  Restoring a few
> files from tape is a simple and cheap operation that takes a minute, at
> most.

I would agree with that if the data was deleted using the appropriate 
high-level operation. Deleting an unfound object is like marking a sector on a 
disk as bad with smartctl. How should the file system react to that? Purging an 
OSD is like removing a disk from a raid set. Such operations increase 
inconsistencies/degradation rather than resolving them. Cleaning this up also 
requires to execute other operations to remove all references to the object 
and, finally, the file inode itself.

The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. 
For example, when coloring is enabled, ls will stat every file in the dir to be 
able to choose the color according to permissions. If one then disables 
coloring, a plain "ls" will return all names while an "ls -l" will hang due to 
stat calls.

An "rm" or "rm -f" should succeed if the folder permissions allow that. It 
should not stat the file itself, so it sounds a bit odd that its hanging. I 
guess in some situations it does, like "rm -i", which will ask before removing 
read-only files. How does "unlink FILE" behave?

Most admin commands on ceph are asynchronous. A command like "pg repair" or 
"osd scrub" only schedules an operation. The command "ceph pg 7.1fb 
mark_unfound_lost delete" does probably just the same. Unfortunately, I don't 
know how to check that a scheduled operation has 
started/completed/succeeded/failed. I asked this in an earlier thread (about PG 
repair) and didn't get an answer. On our cluster, the actual repair happened 
ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that 
(some of) these operations have very low priority and will not start at least 
as long as there is recovery going on. One might want to consider the 
possibility that some of the scheduled commands have not been executed yet.

The output of "pg query" contains the IDs of the missing objects (in mimic) and 
each of these objects is on one of the peer OSDs of the PG (I think object here 
refers to shard or copy). It should be possible to find the corresponding OSD 
(or at least obtain confirmation that the object is really gone) and move the 
object to a place where it is expected to be found. This can probably be 
achieved with "PG export" and "PG import". I don't know of any other way(s).

I guess, in the current situation, sitting it out a bit longer might be a good 
strategy. I don't know how many asynchronous commands you executed and giving 
the cluster time to complete these jobs might improve the situation.

Sorry that I can't be of more help here. However, if you figure out a solution 
(ideally non-destructive), please post it here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 18 September 2020 14:15:53
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] multiple OSD crash, unfound objects

Hi Frank,

On 9/18/20 2:50 AM, Frank Schilder wrote:
> Dear Michael,
>
> firstly, I'm a bit confused why you started deleting data. The objects were 
> unfound, but still there. That's a small issue. Now the data might be gone 
> and that's a real issue.
>
> 
> Interval:
>
> Anyone reading this: I have seen many threads where ceph admins started 
> deleting objects or PGs or even purging OSDs way too early from a cluster. 
> Trying to recover health by deleting data is a contradiction. Ceph has bugs 
> and sometimes it needs some help finding everything again. As far as I know, 
> for most of these bugs there are workarounds that allow full recovery with a 
> bit of work.

I disagree with the statement that trying to recover health by deleting
data is a contradiction.  In some cases (such as mine), the data in ceph
is backed up in another location (eg tape library

[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-18 Thread Frank Schilder
Dear Michael,

> I disagree with the statement that trying to recover health by deleting
> data is a contradiction.  In some cases (such as mine), the data in ceph
> is backed up in another location (eg tape library).  Restoring a few
> files from tape is a simple and cheap operation that takes a minute, at
> most.

I would agree with that if the data was deleted using the appropriate 
high-level operation. Deleting an unfound object is like marking a sector on a 
disk as bad with smartctl. How should the file system react to that? Purging an 
OSD is like removing a disk from a raid set. Such operations increase 
inconsistencies/degradation rather than resolving them. Cleaning this up also 
requires to execute other operations to remove all references to the object 
and, finally, the file inode itself.

The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. 
For example, when coloring is enabled, ls will stat every file in the dir to be 
able to choose the color according to permissions. If one then disables 
coloring, a plain "ls" will return all names while an "ls -l" will hang due to 
stat calls.

An "rm" or "rm -f" should succeed if the folder permissions allow that. It 
should not stat the file itself, so it sounds a bit odd that its hanging. I 
guess in some situations it does, like "rm -i", which will ask before removing 
read-only files. How does "unlink FILE" behave?

Most admin commands on ceph are asynchronous. A command like "pg repair" or 
"osd scrub" only schedules an operation. The command "ceph pg 7.1fb 
mark_unfound_lost delete" does probably just the same. Unfortunately, I don't 
know how to check that a scheduled operation has 
started/completed/succeeded/failed. I asked this in an earlier thread (about PG 
repair) and didn't get an answer. On our cluster, the actual repair happened 
ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that 
(some of) these operations have very low priority and will not start at least 
as long as there is recovery going on. One might want to consider the 
possibility that some of the scheduled commands have not been executed yet.

The output of "pg query" contains the IDs of the missing objects (in mimic) and 
each of these objects is on one of the peer OSDs of the PG (I think object here 
refers to shard or copy). It should be possible to find the corresponding OSD 
(or at least obtain confirmation that the object is really gone) and move the 
object to a place where it is expected to be found. This can probably be 
achieved with "PG export" and "PG import". I don't know of any other way(s).

I guess, in the current situation, sitting it out a bit longer might be a good 
strategy. I don't know how many asynchronous commands you executed and giving 
the cluster time to complete these jobs might improve the situation.

Sorry that I can't be of more help here. However, if you figure out a solution 
(ideally non-destructive), please post it here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 18 September 2020 14:15:53
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] multiple OSD crash, unfound objects

Hi Frank,

On 9/18/20 2:50 AM, Frank Schilder wrote:
> Dear Michael,
>
> firstly, I'm a bit confused why you started deleting data. The objects were 
> unfound, but still there. That's a small issue. Now the data might be gone 
> and that's a real issue.
>
> 
> Interval:
>
> Anyone reading this: I have seen many threads where ceph admins started 
> deleting objects or PGs or even purging OSDs way too early from a cluster. 
> Trying to recover health by deleting data is a contradiction. Ceph has bugs 
> and sometimes it needs some help finding everything again. As far as I know, 
> for most of these bugs there are workarounds that allow full recovery with a 
> bit of work.

I disagree with the statement that trying to recover health by deleting
data is a contradiction.  In some cases (such as mine), the data in ceph
is backed up in another location (eg tape library).  Restoring a few
files from tape is a simple and cheap operation that takes a minute, at
most.  For the sake of expediency, sometimes it's quicker and easier to
simply delete the affected files and restore them from the backup system.

This procedure has worked fine with our previous distributed filesystem
(hdfs), so I (naively?) thought that it could be used with ceph as well.
  I was a bit surprised that cephs behavior was to indefinitely block
the 'rm' operation so that the affected file could not even be removed.

Since I have 25 unfound objects spread across 9 PGs, I used a PG with a
single unfound object

[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-18 Thread Frank Schilder
Dear Michael,

firstly, I'm a bit confused why you started deleting data. The objects were 
unfound, but still there. That's a small issue. Now the data might be gone and 
that's a real issue.


Interval:

Anyone reading this: I have seen many threads where ceph admins started 
deleting objects or PGs or even purging OSDs way too early from a cluster. 
Trying to recover health by deleting data is a contradiction. Ceph has bugs and 
sometimes it needs some help finding everything again. As far as I know, for 
most of these bugs there are workarounds that allow full recovery with a bit of 
work.


First question is, did you delete the entire object or just a shard on one 
disk? Are there OSDs that might still have a copy?

If the object is gone for good, the file references something that doesn't 
exist - its like a bad sector. You probably need to delete the file. Bit 
strange that the operation does not err out with a read error. Maybe it doesn't 
because it waits for the unfound objects state to be resolved?

For all the other unfound objects, they are there somewhere - you didn't loose 
a disk or something. Try pushing ceph to scan the correct OSDs, for example, by 
restarting the newly added OSDs one by one or something similar. Sometimes 
exporting and importing a PG from one OSD to another forces a re-scan and 
subsequent discovery of unfound objects. It is also possible that ceph will 
find these objects along the way of recovery or when OSDs scrub or check for 
objects that can be deleted.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 17 September 2020 22:27:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] multiple OSD crash, unfound objects

Hi Frank,

Yes, it does sounds similar to your ticket.

I've tried a few things to restore the failed files:

* Locate a missing object with 'ceph pg $pgid list_unfound'

* Convert the hex oid to a decimal inode number

* Identify the affected file with 'find /ceph -inum $inode'

At this point, I know which file is affected by the missing object.  As
expected, attempts to read the file simply hang.  Unexpectedly, attempts
to 'ls' the file or its containing directory also hang.  I presume from
this that the stat() system call needs some information that is
contained in the missing object, and is waiting for the object to become
available.

Next I tried to remove the affected object with:

* ceph pg $pgid mark_unfound_lost delete

Now 'ceph status' shows one fewer missing objects, but attempts to 'ls'
or 'rm' the affected file continue to hang.

Finally, I ran a scrub over the part of the filesystem containing the
affected file:

ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive

Nothing seemed to come up during the scrub:

2020-09-17T14:56:15.208-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T14:58:58.013-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub start {path=/frames/postO3/hoft,prefix=scrub
start,scrubops=[recursive]} (starting...)
2020-09-17T14:58:58.013-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: active
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub queued for path: /frames/postO3/hoft
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: active [paths:/frames/postO3/hoft]
2020-09-17T14:59:02.535-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:00:12.520-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:02:32.944-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: idle
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a'
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub completed for path: /frames/postO3/hoft
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: idle


After the scrub completed, access to the file (ls or rm) continue to
hang.  The MDS reports slow reads:

2020-09-17T15:11:05.654-0500 7f39b9a1e700  0 log_channel(cluster) log
[WRN] : slow request 481.867381 seconds old, received at
2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309
getattr pAsLsXsFs #0x105b1c0 2020-09-17T15:03:03.787602-0500
caller_uid=0, caller_gid=0{}) currently dispatched

Does anyone have any suggestions on how else to clean up from a
permanently lost object?

--Mike

On 9/16/20 2:03 AM, Frank Schilder wrote:
> Sounds similar to this one: https://tracker.ceph.com/issues/46847
>
> If you have or can reconstruct the crush map from before adding the OSDs, you 
> might be able to discove

[ceph-users] vfs_ceph for CentOS 8

2020-09-17 Thread Frank Schilder
Hi all,

we are setting up a SAMBA share and would like to use the vfs_ceph module. 
Unfortunately, it seems not to be part of the common SAMBA packages on CentOS 
8. Does anyone know how to install vfs_ceph? The SAMBA version on CentOS 8 is 
samba-4.11.2-13 and the documentation says the module is part of it.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-16 Thread Frank Schilder
Sounds similar to this one: https://tracker.ceph.com/issues/46847

If you have or can reconstruct the crush map from before adding the OSDs, you 
might be able to discover everything with the temporary reversal of the crush 
map method.

Not sure if there is another method, i never got a reply to my question in the 
tracker.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 16 September 2020 01:27:19
To: ceph-users@ceph.io
Subject: [ceph-users] multiple OSD crash, unfound objects

Over the weekend I had multiple OSD servers in my Octopus cluster
(15.2.4) crash and reboot at nearly the same time.  The OSDs are part of
an erasure coded pool.  At the time the cluster had been busy with a
long-running (~week) remapping of a large number of PGs after I
incrementally added more OSDs to the cluster.  After bringing all of the
OSDs back up, I have 25 unfound objects and 75 degraded objects.  There
are other problems reported, but I'm primarily concerned with these
unfound/degraded objects.

The pool with the missing objects is a cephfs pool.  The files stored in
the pool are backed up on tape, so I can easily restore individual files
as needed (though I would not want to restore the entire filesystem).

I tried following the guide at
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
  I found a number of OSDs that are still 'not queried'.  Restarting a
sampling of these OSDs changed the state from 'not queried' to 'already
probed', but that did not recover any of the unfound or degraded objects.

I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
saw them get scrubbed.  I also tried doing a 'ceph pg force-recovery' on
the affected PGs, but only one seems to have been tagged accordingly
(see ceph -s output below).

The guide also says "Sometimes it simply takes some time for the cluster
to query possible locations."  I'm not sure how long "some time" might
take, but it hasn't changed after several hours.

My questions are:

* Is there a way to force the cluster to query the possible locations
sooner?

* Is it possible to identify the files in cephfs that are affected, so
that I could delete only the affected files and restore them from backup
tapes?

--Mike

ceph -s:

   cluster:
 id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad
 health: HEALTH_ERR
 1 clients failing to respond to capability release
 1 MDSs report slow requests
 25/78520351 objects unfound (0.000%)
 2 nearfull osd(s)
 Reduced data availability: 1 pg inactive
 Possible data damage: 9 pgs recovery_unfound
 Degraded data redundancy: 75/626645098 objects degraded
(0.000%), 9 pgs degraded
 1013 pgs not deep-scrubbed in time
 1013 pgs not scrubbed in time
 2 pool(s) nearfull
 1 daemons have recently crashed
 4 slow ops, oldest one blocked for 77939 sec, daemons
[osd.0,osd.41] have slow ops.

   services:
 mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
 mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
 mds: archive:1 {0=ceph4=up:active} 3 up:standby
 osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs

   task status:
 scrub status:
 mds.ceph4: idle

   data:
 pools:   9 pools, 2433 pgs
 objects: 78.52M objects, 298 TiB
 usage:   412 TiB used, 545 TiB / 956 TiB avail
 pgs: 0.041% pgs unknown
  75/626645098 objects degraded (0.000%)
  135224/626645098 objects misplaced (0.022%)
  25/78520351 objects unfound (0.000%)
  2421 active+clean
  5active+recovery_unfound+degraded
  3active+recovery_unfound+degraded+remapped
  2active+clean+scrubbing+deep
  1unknown
  1active+forced_recovery+recovery_unfound+degraded

   progress:
 PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
   []
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The confusing output of ceph df command

2020-09-10 Thread Frank Schilder
We might have the same problem. EC 6+2 on a pool for RBD images on spindles. 
Please see the earlier thread "mimic: much more raw used than reported". In our 
case, this seems to be a problem exclusively for RBD workloads and here, in 
particular, Windows VMs. I see no amplification at all on our ceph fs pool.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 10 September 2020 08:34:42
To: ceph-users@ceph.io
Subject: [ceph-users] Re: The confusing output of ceph df command

Anyone else met the same problem? Using EC instead of Replica is to save
spaces, but now it's worse than replica...

On 9/9/2020 上午7:30, norman kern wrote:
> Hi,
>
> I have changed most of pools from 3-replica to ec 4+2 in my cluster, when I 
> use
> ceph df command to show
>
> the used capactiy of the cluster:
>
> RAW STORAGE:
>   CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>   hdd   1.8 PiB 788 TiB 1.0 PiB  1.0 PiB 57.22
>   ssd   7.9 TiB 4.6 TiB 181 GiB  3.2 TiB 41.15
>   ssd-cache 5.2 TiB 5.2 TiB  67 GiB   73 GiB  1.36
>   TOTAL 1.8 PiB 798 TiB 1.0 PiB  1.0 PiB 56.99
>
> POOLS:
>   POOLID STORED  OBJECTS
> USED%USED MAX AVAIL
>   default-oss.rgw.control 1 0 B   8 0
> B 0   1.3 TiB
>   default-oss.rgw.meta2  22 KiB  97 3.9
> MiB 0   1.3 TiB
>   default-oss.rgw.log 3 525 KiB 223 621
> KiB 0   1.3 TiB
>   default-oss.rgw.buckets.index   4  33 MiB  34  33
> MiB 0   1.3 TiB
>   default-oss.rgw.buckets.non-ec  5 1.6 MiB  48 3.8
> MiB 0   1.3 TiB
>   .rgw.root6 3.8 KiB  16 720
> KiB 0   1.3 TiB
>   default-oss.rgw.buckets.data7 274 GiB 185.39k 450
> GiB  0.14   212 TiB
>   default-fs-metadata 8 488 GiB 153.10M 490
> GiB 10.65   1.3 TiB
>   default-fs-data09 374 TiB   1.48G 939
> TiB 74.71   212 TiB
>
>  ...
>
> The USED = 3 * STORED in 3-replica mode is completely right, but for EC 4+2 
> pool
> (for default-fs-data0 )
>
> the USED is not equal 1.5 * STORED, why...:(
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-31 Thread Frank Schilder
Looks like the image attachment got removed. Please find it here: 
https://imgur.com/a/3tabzCN

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 31 August 2020 14:42
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan and Mark,

sorry, took a bit longer. I uploaded a new archive containing files with the 
following format 
(https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - 
valid 60 days):

- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text 
against first dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, 
counters are reset

Only for the last couple of days are converted files included, post-conversion 
of everything simply takes too long.

Please find also attached a recording of memory usage on one of the relevant 
OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What 
is worrying is the self-amplifying nature of the leak. ts not a linear process, 
it looks at least quadratic if not exponential. What we are looking for is, 
given the comparably short uptime, probably still in the lower percentages with 
increasing rate. The OSDs just started to overrun their limit:

top - 14:38:49 up 155 days, 19:17,  1 user,  load average: 5.99, 4.59, 4.59
Tasks: 684 total,   1 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.9 sy,  0.0 ni, 89.6 id,  7.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 65727628 total,  6937548 free, 41921260 used, 16868820 buff/cache
KiB Swap: 93532160 total, 90199040 free,  120 used.  6740136 avail Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
4099023 ceph  20   0 5918704   3.8g   9700 S   1.7  6.1 378:37.01 
/usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+
4097639 ceph  20   0 5340924   3.0g  11428 S  87.1  4.7  14636:30 
/usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+
4097974 ceph  20   0 3648188   2.3g   9628 S   8.3  3.6   1375:58 
/usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+
4098322 ceph  20   0 3478980   2.2g   9688 S   5.3  3.6   1426:05 
/usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+
4099374 ceph  20   0 3446784   2.2g   9252 S   4.6  3.5   1142:14 
/usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+
4098679 ceph  20   0 3832140   2.2g   9796 S   6.6  3.5   1248:26 
/usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+
4100782 ceph  20   0 3641608   2.2g   9652 S   7.9  3.5   1278:10 
/usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+
4095944 ceph  20   0 3375672   2.2g   8968 S   7.3  3.5   1250:02 
/usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+
4096956 ceph  20   0 3509376   2.2g   9456 S   7.9  3.5   1157:27 
/usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+
4099731 ceph  20   0 3563652   2.2g   8972 S   3.6  3.5   1421:48 
/usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+
4096262 ceph  20   0 3531988   2.2g   9040 S   9.9  3.5   1600:15 
/usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+
4100442 ceph  20   0 3359736   2.1g   9804 S   4.3  3.4   1185:53 
/usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+
4096617 ceph  20   0 3443060   2.1g   9432 S   5.0  3.4   1449:29 
/usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+
4097298 ceph  20   0 3483532   2.1g   9600 S   5.6  3.3   1265:28 
/usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+
4100093 ceph  20   0 3428348   2.0g   9568 S   3.3  3.2   1298:53 
/usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+
4095630 ceph  20   0 3440160   2.0g   8976 S   3.6  3.2   1451:35 
/usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+

Generally speaking, increasing the cache minimum seems to help with keeping 
important information in RAM. Unfortunately, it also means that swap usage 
starts much earlier.

Best regards and thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread Frank Schilder
I was talking about on-disk cache, but, yes, the controller cache needs to be 
disabled too. The first can be done with smartctl or hdparm. Check cache status 
with something like  'smartctl -g wcache /dev/sda' and disable with something 
like 'smartctl -s wcache=off /dev/sda'.

Controller cache needs to be disabled in the BIOS. By the way, if you can't use 
pass-through, you should disable controller cache for every disk, including the 
HDDs. There are cases in the list demonstrating that controller cache enabled 
can lead to data loss on power outage.

As I recommended before, please search the ceph-user list, you will find 
detailed instructions and also links to explanations and typical benchmarks.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 14:44:07
To: Frank Schilder; 'ceph-users@ceph.io'
Subject: AW: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
journals)

We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
single disks as raid0 devices. Ceph should not be aware of cache status?
But digging deeper in to it it seems that 1 out of 4 serves is performing a lot 
better and has super low commit/applay rates while the other have a lot mor 
(20+) on heavy writes. This just applys fore the ssd. For the hdds I cant see a 
difference...

-Ursprüngliche Nachricht-
Von: Frank Schilder 
Gesendet: Montag, 31. August 2020 13:19
An: VELARTIS Philipp Dürhammer ; 'ceph-users@ceph.io' 

Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
journals)

Yes, they can - if volatile write cache is not disabled. There are many threads 
on this, also recent. Search for "disable write cache" and/or "disable volatile 
write cache".

You will also find different methods of doing this automatically.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to query status of scheduled commands.

2020-08-31 Thread Frank Schilder
Hi all,

can anyone help me with this? In mimic, for any of these commands:

ceph osd [deep-]scrub ID
ceph pg [deep-]scrub ID
ceph pg repair ID

an operation is scheduled asynchronously. How can I check the following states:

1) Operation is pending (scheduled, not started).
2) Operation is running.
3) Operation has completed.
4) Exit code and error messages if applicable.

Many thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-08-31 Thread Frank Schilder
Yes, they can - if volatile write cache is not disabled. There are many threads 
on this, also recent. Search for "disable write cache" and/or "disable volatile 
write cache".

You will also find different methods of doing this automatically.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: VELARTIS Philipp Dürhammer 
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
little slower or equal to the 60 hdd pool. 4K random as also sequential reads. 
All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on 
bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Mark and Dan,

I can generate text files. Can you let me know what you would like to see? 
Without further instructions, I can do a simple conversion and a conversion 
against the first dump as a base. I will upload an archive with converted files 
added tomorrow afternoon.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 August 2020 21:52
To: Frank Schilder; Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,


  I downloaded but haven't had time to get the environment setup yet
either.  It might be better to just generate the txt files if you can.


Thanks!

Mark


On 8/20/20 2:33 AM, Frank Schilder wrote:
> Hi Dan and Mark,
>
> could you please let me know if you can read the files with the version info 
> I provided in my previous e-mail? I'm in the process of collecting data with 
> more FS activity and would like to send it in a format that is useful for 
> investigation.
>
> Right now I'm observing a daily growth of swap of ca. 100-200MB on servers 
> with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS 
> manages to keep enough RAM available. Also the mempool dump still shows onode 
> and data cached at a seemingly reasonable level. Users report a more stable 
> performance of the FS after I increased the cach min sizes on all OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Frank Schilder 
> Sent: 17 August 2020 09:37
> To: Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Hi Dan,
>
> I use the container 
> docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I 
> can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, 
> its a Centos 7 build. The version is:
>
> # ceph -v
> ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>
> On Centos, the profiler packages are called different, without the "google-" 
> prefix. The version I have installed is
>
> # pprof --version
> pprof (part of gperftools 2.0)
>
> Copyright 1998-2007 Google Inc.
>
> This is BSD licensed software; see the source for copying conditions
> and license information.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
>
> It is possible to install pprof inside this container and analyse the 
> *.heap-files I provided.
>
> If this doesn't work for you and you want me to generate the text output for 
> heap-files, I can do that. Please let me know if I should do all files and 
> with what option (eg. against a base etc.).
>
> Best regards,
> =====
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 August 2020 10:38:57
> To: Frank Schilder
> Cc: Mark Nelson; ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> Hi Frank,
>
> I'm having trouble getting the exact version of ceph you used to
> create this heap profile.
> Could you run the google-pprof --text steps at [1] and share the output?
>
> Thanks, Dan
>
> [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
>
> On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
>> Hi Mark,
>>
>> here is a first collection of heap profiling data (valid 30 days):
>>
>> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
>>
>> This was collected with the following config settings:
>>
>>osd  dev  osd_memory_cache_min  
>> 805306368
>>osd  basicosd_memory_target 
>> 2147483648
>>
>> Setting the cache_min value seems to help keeping cache space available. 
>> Unfortunately, the above collection is for 12 days only. I needed to restart 
>> the OSD and will need to restart it soon again. I hope I can then run a 
>> longer sample. The profiling does cause slow ops though.
>>
>> Maybe you can see something already? It seems to have collected some leaked 
>> memory. Unfortunately, it was a period of extremely low load. Basically, 
>> with the day of recording the utilization dropped to almost zero.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Frank Schilder 
>> Sent: 21 July 2020 12:57:32
>> To: Mark Nelson; Dan van

[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Dan,

no worries. I checked and osd_map_dedup is set to true, the default value.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 20 August 2020 09:41
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I didn't get time yet. On our side, I was planning to see if the issue
persists after upgrading to v14.2.11 -- it includes some updates to
how the osdmap is referenced across OSD.cc.

BTW, do you happen to have osd_map_dedup set to false? We do, and that
surely increases the osdmap memory usage somewhat.

-- Dan



-- Dan

On Thu, Aug 20, 2020 at 9:33 AM Frank Schilder  wrote:
>
> Hi Dan and Mark,
>
> could you please let me know if you can read the files with the version info 
> I provided in my previous e-mail? I'm in the process of collecting data with 
> more FS activity and would like to send it in a format that is useful for 
> investigation.
>
> Right now I'm observing a daily growth of swap of ca. 100-200MB on servers 
> with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS 
> manages to keep enough RAM available. Also the mempool dump still shows onode 
> and data cached at a seemingly reasonable level. Users report a more stable 
> performance of the FS after I increased the cach min sizes on all OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 17 August 2020 09:37
> To: Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Hi Dan,
>
> I use the container 
> docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I 
> can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, 
> its a Centos 7 build. The version is:
>
> # ceph -v
> ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>
> On Centos, the profiler packages are called different, without the "google-" 
> prefix. The version I have installed is
>
> # pprof --version
> pprof (part of gperftools 2.0)
>
> Copyright 1998-2007 Google Inc.
>
> This is BSD licensed software; see the source for copying conditions
> and license information.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
>
> It is possible to install pprof inside this container and analyse the 
> *.heap-files I provided.
>
> If this doesn't work for you and you want me to generate the text output for 
> heap-files, I can do that. Please let me know if I should do all files and 
> with what option (eg. against a base etc.).
>
> Best regards,
> =====
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 August 2020 10:38:57
> To: Frank Schilder
> Cc: Mark Nelson; ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> Hi Frank,
>
> I'm having trouble getting the exact version of ceph you used to
> create this heap profile.
> Could you run the google-pprof --text steps at [1] and share the output?
>
> Thanks, Dan
>
> [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
>
> On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
> >
> > Hi Mark,
> >
> > here is a first collection of heap profiling data (valid 30 days):
> >
> > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
> >
> > This was collected with the following config settings:
> >
> >   osd  dev  osd_memory_cache_min  
> > 805306368
> >   osd  basicosd_memory_target 
> > 2147483648
> >
> > Setting the cache_min value seems to help keeping cache space available. 
> > Unfortunately, the above collection is for 12 days only. I needed to 
> > restart the OSD and will need to restart it soon again. I hope I can then 
> > run a longer sample. The profiling does cause slow ops though.
> >
> > Maybe you can see something already? It seems to have collected some leaked 
> > memory. Unfortunately, it was a period of extremely low load. Basically, 
> > with the day of recording the utilization dropped to almost zero.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Frank Schilder 
> > Sent: 21 July 

[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Dan and Mark,

could you please let me know if you can read the files with the version info I 
provided in my previous e-mail? I'm in the process of collecting data with more 
FS activity and would like to send it in a format that is useful for 
investigation.

Right now I'm observing a daily growth of swap of ca. 100-200MB on servers with 
16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS manages to 
keep enough RAM available. Also the mempool dump still shows onode and data 
cached at a seemingly reasonable level. Users report a more stable performance 
of the FS after I increased the cach min sizes on all OSDs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 17 August 2020 09:37
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan,

I use the container 
docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can 
see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a 
Centos 7 build. The version is:

# ceph -v
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)

On Centos, the profiler packages are called different, without the "google-" 
prefix. The version I have installed is

# pprof --version
pprof (part of gperftools 2.0)

Copyright 1998-2007 Google Inc.

This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

It is possible to install pprof inside this container and analyse the 
*.heap-files I provided.

If this doesn't work for you and you want me to generate the text output for 
heap-files, I can do that. Please let me know if I should do all files and with 
what option (eg. against a base etc.).

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 August 2020 10:38:57
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I'm having trouble getting the exact version of ceph you used to
create this heap profile.
Could you run the google-pprof --text steps at [1] and share the output?

Thanks, Dan

[1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/


On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
>
> Hi Mark,
>
> here is a first collection of heap profiling data (valid 30 days):
>
> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
>
> This was collected with the following config settings:
>
>   osd  dev  osd_memory_cache_min  
> 805306368
>   osd  basicosd_memory_target 
> 2147483648
>
> Setting the cache_min value seems to help keeping cache space available. 
> Unfortunately, the above collection is for 12 days only. I needed to restart 
> the OSD and will need to restart it soon again. I hope I can then run a 
> longer sample. The profiling does cause slow ops though.
>
> Maybe you can see something already? It seems to have collected some leaked 
> memory. Unfortunately, it was a period of extremely low load. Basically, with 
> the day of recording the utilization dropped to almost zero.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 21 July 2020 12:57:32
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Quick question: Is there a way to change the frequency of heap dumps? On this 
> page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
> HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
> configuring this. Is there a config parameter or a ceph daemon call to adjust 
> this?
>
> If not, can I change the dump path?
>
> Its likely to overrun my log partition quickly if I cannot adjust either of 
> the two.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 20 July 2020 15:19:05
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Dear Mark,
>
> thank you very much for the very helpful answers. I will raise 
> osd_memory_cache_min, leave everything else alone and watch what happens. I 
> will report back here.
>
> Thanks also for raising this as an issue.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, ru

[ceph-users] Re: OSD memory leak?

2020-08-17 Thread Frank Schilder
Hi Dan,

I use the container 
docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can 
see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a 
Centos 7 build. The version is:

# ceph -v
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)

On Centos, the profiler packages are called different, without the "google-" 
prefix. The version I have installed is

# pprof --version
pprof (part of gperftools 2.0)

Copyright 1998-2007 Google Inc.

This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

It is possible to install pprof inside this container and analyse the 
*.heap-files I provided.

If this doesn't work for you and you want me to generate the text output for 
heap-files, I can do that. Please let me know if I should do all files and with 
what option (eg. against a base etc.).

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 August 2020 10:38:57
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I'm having trouble getting the exact version of ceph you used to
create this heap profile.
Could you run the google-pprof --text steps at [1] and share the output?

Thanks, Dan

[1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/


On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
>
> Hi Mark,
>
> here is a first collection of heap profiling data (valid 30 days):
>
> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
>
> This was collected with the following config settings:
>
>   osd  dev  osd_memory_cache_min  
> 805306368
>   osd  basicosd_memory_target 
> 2147483648
>
> Setting the cache_min value seems to help keeping cache space available. 
> Unfortunately, the above collection is for 12 days only. I needed to restart 
> the OSD and will need to restart it soon again. I hope I can then run a 
> longer sample. The profiling does cause slow ops though.
>
> Maybe you can see something already? It seems to have collected some leaked 
> memory. Unfortunately, it was a period of extremely low load. Basically, with 
> the day of recording the utilization dropped to almost zero.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 21 July 2020 12:57:32
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Quick question: Is there a way to change the frequency of heap dumps? On this 
> page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
> HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
> configuring this. Is there a config parameter or a ceph daemon call to adjust 
> this?
>
> If not, can I change the dump path?
>
> Its likely to overrun my log partition quickly if I cannot adjust either of 
> the two.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 20 July 2020 15:19:05
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Dear Mark,
>
> thank you very much for the very helpful answers. I will raise 
> osd_memory_cache_min, leave everything else alone and watch what happens. I 
> will report back here.
>
> Thanks also for raising this as an issue.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Mark Nelson 
> Sent: 20 July 2020 15:08:11
> To: Frank Schilder; Dan van der Ster
> Cc: ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> On 7/20/20 3:23 AM, Frank Schilder wrote:
> > Dear Mark and Dan,
> >
> > I'm in the process of restarting all OSDs and could use some quick advice 
> > on bluestore cache settings. My plan is to set higher minimum values and 
> > deal with accumulated excess usage via regular restarts. Looking at the 
> > documentation 
> > (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/),
> >  I find the following relevant options (with defaults):
> >
> > # Automatic Cache Sizing
> > osd_memory_target {4294967296} # 4GB
> > osd_memory_base {805306368} # 768MB
> > 

[ceph-users] Re: OSD memory leak?

2020-08-11 Thread Frank Schilder
Hi Mark,

here is a first collection of heap profiling data (valid 30 days):

https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l

This was collected with the following config settings:

  osd  dev  osd_memory_cache_min  805306368
  osd  basicosd_memory_target 2147483648

Setting the cache_min value seems to help keeping cache space available. 
Unfortunately, the above collection is for 12 days only. I needed to restart 
the OSD and will need to restart it soon again. I hope I can then run a longer 
sample. The profiling does cause slow ops though.

Maybe you can see something already? It seems to have collected some leaked 
memory. Unfortunately, it was a period of extremely low load. Basically, with 
the day of recording the utilization dropped to almost zero.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 21 July 2020 12:57:32
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Quick question: Is there a way to change the frequency of heap dumps? On this 
page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
configuring this. Is there a config parameter or a ceph daemon call to adjust 
this?

If not, can I change the dump path?

Its likely to overrun my log partition quickly if I cannot adjust either of the 
two.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 20 July 2020 15:19:05
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Dear Mark,

thank you very much for the very helpful answers. I will raise 
osd_memory_cache_min, leave everything else alone and watch what happens. I 
will report back here.

Thanks also for raising this as an issue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 July 2020 15:08:11
To: Frank Schilder; Dan van der Ster
Cc: ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

On 7/20/20 3:23 AM, Frank Schilder wrote:
> Dear Mark and Dan,
>
> I'm in the process of restarting all OSDs and could use some quick advice on 
> bluestore cache settings. My plan is to set higher minimum values and deal 
> with accumulated excess usage via regular restarts. Looking at the 
> documentation 
> (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), 
> I find the following relevant options (with defaults):
>
> # Automatic Cache Sizing
> osd_memory_target {4294967296} # 4GB
> osd_memory_base {805306368} # 768MB
> osd_memory_cache_min {134217728} # 128MB
>
> # Manual Cache Sizing
> bluestore_cache_meta_ratio {.4} # 40% ?
> bluestore_cache_kv_ratio {.4} # 40% ?
> bluestore_cache_kv_max {512 * 1024*1024} # 512MB
>
> Q1) If I increase osd_memory_cache_min, should I also increase 
> osd_memory_base by the same or some other amount?


osd_memory_base is a hint at how much memory the OSD could consume
outside the cache once it's reached steady state.  It basically sets a
hard cap on how much memory the cache will use to avoid over-committing
memory and thrashing when we exceed the memory limit. It's not necessary
to get it right, it just helps smooth things out by making the automatic
memory tuning less aggressive.  IE if you have a 2 GB memory target and
a 512MB base, you'll never assign more than 1.5GB to the cache on the
assumption that the rest of the OSD will eventually need 512MB to
operate even if it's not using that much right now.  I think you can
probably just leave it alone.  What you and Dan appear to be seeing is
that this number isn't static in your case but increases over time any
way.  Eventually I'm hoping that we can automatically account for more
and more of that memory by reading the data from the mempools.

> Q2) The cache ratio options are shown under the section "Manual Cache 
> Sizing". Do they also apply when cache auto tuning is enabled? If so, is it 
> worth changing these defaults for higher values of osd_memory_cache_min?


They actually do have an effect on the automatic cache sizing and
probably shouldn't only be under the manual section.  When you have the
automatic cache sizing enabled, those options will affect the "fair
share" values of the different caches at each cache priority level.  IE
at priority level 0, if both caches want more memory than is available,
those ratios will determine how much each cache gets.  If there is more
memory available than requested, each cache gets as much as they want
and we move on to the next priority level and do the s

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-04 Thread Frank Schilder
If with monitor log you mean the cluster log /var/log/ceph/ceph.log, I should 
have all of it. Please find a tgz-file here: 
https://files.dtu.dk/u/tFCEZJzQhH2mUIRk/logs.tgz?l (valid 100 days).

Contents:

logs/ceph-2020-08-03.log  -  cluster log for the day of restart
logs/ceph-osd.145.2020-08-03.log  -  log of "old" OSD trimmed to day of restart
logs/ceph-osd.288.log  -  entire log of "new" OSD

Hope this helps.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 14:15:11
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Do you have any monitor / OSD logs from the maintenance when the issues 
occurred?


 Original message ----
From: Frank Schilder 
Date: 8/4/20 8:07 AM (GMT-05:00)
To: Eric Smith , ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for the clarification, I did misunderstand you.

> You should not have to move OSDs in and out of the CRUSH tree however
> in order to solve any data placement problems (This is the baffling part).

Exactly. Should I create a tracker issue? I think this is not hard to reproduce 
with a standard crush map where host-bucket=physical host and I would, in fact, 
expect that this scenario is part of the integration test.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 13:58:47
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order in terms of your CRUSH layout. You can speed up the 
rebalancing / scale-out operations by increasing the osd_max_backfills on each 
OSD (Especially during off hours). The unnecessary degradation is not expected 
behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing 
ongoing it's not unexpected. You should not have to move OSDs in and out of the 
CRUSH tree however in order to solve any data placement problems (This is the 
baffling part).

-Original Message-
From: Frank Schilder 
Sent: Tuesday, August 4, 2020 7:45 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Erik,

I added the disks and started the rebalancing. When I run into the issue, ca. 3 
days after start of rebalancing, it was about 25% done. The cluster does not go 
to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects 
misplaced" warning. The OSD crush locations for the logical hosts are in 
ceph.conf, the OSDs come up in the proper crush bucket.

> All seems in order then

In what sense?

The rebalancing is still ongoing and usually a very long operation. This time I 
added only 9 disks, but we will almost triple the number of disks of a larger 
pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this 
expansion to take months. Due to a memory leak, I need to restart OSDs 
regularly. Also, a host may restart or we might have a power outage during this 
window. In these situations, it will be a real pain if I have to play the crush 
move game with 300+ OSDs.

This unnecessary redundancy degradation on OSD restart cannot possibly be 
expected behaviour, or do I misunderstand something here?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order then - when you ran into your maintenance issue, how long 
was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it 
could trim PG history? Also did the OSDs just start back up in the wrong place 
in the CRUSH tree?

-Original Message-
From: Frank Schilder 
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

> Have you adjusted the min_size for pool sr-rbd-data-one-hdd

Yes. For all EC pools located in datacenter ServerRoom, we currently set 
min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are 
logical but not physical, disks in these buckets are co-located such that no 
more than 2 host buckets share the same physical host. With failure domain = 
host, we can ensure that no more than 2 EC shards are on the same physical 
host. With m=2 and min_size=k we have continued service with any 1 physical 
host down for maintenance and also recovery will happen if a physical host 
fails. Some objects will have no redundancy for a while then. We will increase 
min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs 
from buckets ceph-21/22 to these without rebalancing.

The distribution of disks and buckets is listed below as well (longer listing).

Thanks and best regar

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-04 Thread Frank Schilder
Hi Eric,

thanks for the clarification, I did misunderstand you.

> You should not have to move OSDs in and out of the CRUSH tree however
> in order to solve any data placement problems (This is the baffling part).

Exactly. Should I create a tracker issue? I think this is not hard to reproduce 
with a standard crush map where host-bucket=physical host and I would, in fact, 
expect that this scenario is part of the integration test.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 13:58:47
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order in terms of your CRUSH layout. You can speed up the 
rebalancing / scale-out operations by increasing the osd_max_backfills on each 
OSD (Especially during off hours). The unnecessary degradation is not expected 
behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing 
ongoing it's not unexpected. You should not have to move OSDs in and out of the 
CRUSH tree however in order to solve any data placement problems (This is the 
baffling part).

-Original Message-
From: Frank Schilder 
Sent: Tuesday, August 4, 2020 7:45 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Erik,

I added the disks and started the rebalancing. When I run into the issue, ca. 3 
days after start of rebalancing, it was about 25% done. The cluster does not go 
to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects 
misplaced" warning. The OSD crush locations for the logical hosts are in 
ceph.conf, the OSDs come up in the proper crush bucket.

> All seems in order then

In what sense?

The rebalancing is still ongoing and usually a very long operation. This time I 
added only 9 disks, but we will almost triple the number of disks of a larger 
pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this 
expansion to take months. Due to a memory leak, I need to restart OSDs 
regularly. Also, a host may restart or we might have a power outage during this 
window. In these situations, it will be a real pain if I have to play the crush 
move game with 300+ OSDs.

This unnecessary redundancy degradation on OSD restart cannot possibly be 
expected behaviour, or do I misunderstand something here?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order then - when you ran into your maintenance issue, how long 
was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it 
could trim PG history? Also did the OSDs just start back up in the wrong place 
in the CRUSH tree?

-Original Message-
From: Frank Schilder 
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

> Have you adjusted the min_size for pool sr-rbd-data-one-hdd

Yes. For all EC pools located in datacenter ServerRoom, we currently set 
min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are 
logical but not physical, disks in these buckets are co-located such that no 
more than 2 host buckets share the same physical host. With failure domain = 
host, we can ensure that no more than 2 EC shards are on the same physical 
host. With m=2 and min_size=k we have continued service with any 1 physical 
host down for maintenance and also recovery will happen if a physical host 
fails. Some objects will have no redundancy for a while then. We will increase 
min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs 
from buckets ceph-21/22 to these without rebalancing.

The distribution of disks and buckets is listed below as well (longer listing).

Thanks and best regards,
Frank

# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd

This is the relevant one:

# ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd 
crush-failure-domain=host crush-root=ServerRoom 
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Note that the pool sr-rbd-data-one (id 2) was created with this profile and 
later moved to SSD. Therefore, the crush rule does not match the profile's 
device class any more.

These two are under different roots:

# ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd 
crush-failure-domain=host crush-root=ContainerSquare 
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd 
crush-failure-domain=host crush-root=ContainerSquare 
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-04 Thread Frank Schilder
Hi Erik,

I added the disks and started the rebalancing. When I run into the issue, ca. 3 
days after start of rebalancing, it was about 25% done. The cluster does not go 
to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects 
misplaced" warning. The OSD crush locations for the logical hosts are in 
ceph.conf, the OSDs come up in the proper crush bucket.

> All seems in order then

In what sense?

The rebalancing is still ongoing and usually a very long operation. This time I 
added only 9 disks, but we will almost triple the number of disks of a larger 
pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this 
expansion to take months. Due to a memory leak, I need to restart OSDs 
regularly. Also, a host may restart or we might have a power outage during this 
window. In these situations, it will be a real pain if I have to play the crush 
move game with 300+ OSDs.

This unnecessary redundancy degradation on OSD restart cannot possibly be 
expected behaviour, or do I misunderstand something here?

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order then - when you ran into your maintenance issue, how long 
was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it 
could trim PG history? Also did the OSDs just start back up in the wrong place 
in the CRUSH tree?

-Original Message-
From: Frank Schilder 
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

> Have you adjusted the min_size for pool sr-rbd-data-one-hdd

Yes. For all EC pools located in datacenter ServerRoom, we currently set 
min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are 
logical but not physical, disks in these buckets are co-located such that no 
more than 2 host buckets share the same physical host. With failure domain = 
host, we can ensure that no more than 2 EC shards are on the same physical 
host. With m=2 and min_size=k we have continued service with any 1 physical 
host down for maintenance and also recovery will happen if a physical host 
fails. Some objects will have no redundancy for a while then. We will increase 
min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs 
from buckets ceph-21/22 to these without rebalancing.

The distribution of disks and buckets is listed below as well (longer listing).

Thanks and best regards,
Frank

# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd

This is the relevant one:

# ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd 
crush-failure-domain=host crush-root=ServerRoom 
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Note that the pool sr-rbd-data-one (id 2) was created with this profile and 
later moved to SSD. Therefore, the crush rule does not match the profile's 
device class any more.

These two are under different roots:

# ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd 
crush-failure-domain=host crush-root=ContainerSquare 
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd 
crush-failure-domain=host crush-root=ContainerSquare 
jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8


Full physical placement information for OSDs under tree "datacenter ServerRoom":


ceph-04

CONTID  BUCKET   SIZE  TYP
osd-phy0   243  ceph-04  1.8T  SSD
osd-phy1   247  ceph-21  1.8T  SSD
osd-phy2   254  ceph-04  1.8T  SSD
osd-phy3   256  ceph-04  1.8T  SSD
osd-phy4   286  ceph-04  1.8T  SSD
osd-phy5   287  ceph-04  1.8T  SSD
osd-phy6   288  ceph-04 10.7T  HDD
osd-phy748  ceph-04372.6G  SSD
osd-phy8   264  ceph-21  1.8T  SSD
osd-phy984  ceph-04  8.9T  HDD
osd-phy10   72  ceph-21  8.9T  HDD
osd-phy11  145  ceph-04  8.9T  HDD
osd-phy14  156  ceph-04  8.9T  HDD
osd-phy15  168  ceph-04  8.9T  HDD
osd-phy16  181  ceph-04  8.9T  HDD
osd-phy170  ceph-21  8.9T  HDD

ceph-05

CONTID  BUCKET   SIZE  TYP
osd-phy0   240  ceph-05  1.8T  SSD
osd-phy1   249  ceph-22  1.8T  SSD
osd-phy2   251  ceph-05  1.8T  SSD
osd-phy3   255  ceph-05  1.8T  SSD
osd-phy4   284  ceph-05  1.8T  SSD
osd-phy5   285  ceph-05  1.8T  SSD
osd-phy6   289  ceph-05 10.7T  HDD
osd-phy749  ceph-05372.6G  SSD
osd-phy8   265  ceph-22  1.

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-04 Thread Frank Schilder
8.9T  HDD
osd-phy16  183  ceph-07  8.9T  HDD
osd-phy173  ceph-22  8.9T  HDD

ceph-18

CONTID  BUCKET   SIZE  TYP
osd-phy0   241  ceph-18  1.8T  SSD
osd-phy1   248  ceph-18  1.8T  SSD
osd-phy241  ceph-18372.6G  SSD
osd-phy331  ceph-18372.6G  SSD
osd-phy4   277  ceph-18  1.8T  SSD
osd-phy5   278  ceph-21  1.8T  SSD
osd-phy653  ceph-21372.6G  SSD
osd-phy7   267  ceph-18  1.8T  SSD
osd-phy8   266  ceph-18  1.8T  SSD
osd-phy9   293  ceph-18 10.7T  HDD
osd-phy10   86  ceph-21  8.9T  HDD
osd-phy11  259  ceph-18 10.9T  HDD
osd-phy14  229  ceph-18  8.9T  HDD
osd-phy15  232  ceph-18  8.9T  HDD
osd-phy16  235  ceph-18  8.9T  HDD
osd-phy17  238  ceph-18  8.9T  HDD

ceph-19

CONTID  BUCKET   SIZE  TYP
osd-phy0   261  ceph-19  1.8T  SSD
osd-phy1   262  ceph-19  1.8T  SSD
osd-phy2   295  ceph-19 10.7T  HDD
osd-phy343  ceph-19372.6G  SSD
osd-phy4   275  ceph-19  1.8T  SSD
osd-phy5   294  ceph-22 10.7T  HDD
osd-phy651  ceph-22372.6G  SSD
osd-phy7   269  ceph-19  1.8T  SSD
osd-phy8   268  ceph-19  1.8T  SSD
osd-phy9   276  ceph-22  1.8T  SSD
osd-phy10   73  ceph-22  8.9T  HDD
osd-phy11  263  ceph-19 10.9T  HDD
osd-phy14  231  ceph-19  8.9T  HDD
osd-phy15  233  ceph-19  8.9T  HDD
osd-phy16  236  ceph-19  8.9T  HDD
osd-phy17  239  ceph-19  8.9T  HDD

ceph-20

CONTID  BUCKET   SIZE  TYP
osd-phy0   245  ceph-20  1.8T  SSD
osd-phy128  ceph-20372.6G  SSD
osd-phy244  ceph-20372.6G  SSD
osd-phy3   271  ceph-20  1.8T  SSD
osd-phy4   272  ceph-20  1.8T  SSD
osd-phy5   273  ceph-20  1.8T  SSD
osd-phy6   274  ceph-21  1.8T  SSD
osd-phy7   296  ceph-20 10.7T  HDD
osd-phy876  ceph-21  8.9T  HDD
osd-phy939  ceph-21372.6G  SSD
osd-phy10  270  ceph-20  1.8T  SSD
osd-phy11  260  ceph-20 10.9T  HDD
osd-phy14  228  ceph-20  8.9T  HDD
osd-phy15  230  ceph-20  8.9T  HDD
osd-phy16  234  ceph-20  8.9T  HDD
osd-phy17  237  ceph-20  8.9T  HDD

CONT is the container name and encodes the physical slot on the host where the 
OSD is located.

=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 04 August 2020 12:47:12
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can 
you send the output of "ceph osd erasure-code-profile ls" and for each EC 
profile, "ceph osd erasure-code-profile get "?

-Original Message-
From: Frank Schilder 
Sent: Monday, August 3, 2020 11:05 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Sorry for the many small e-mails: requested IDs in the commands, 288-296. One 
new OSD per host.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject: [ceph-users] Re: Ceph does not recover from OSD restart

Hi Eric,

the procedure for re-discovering all objects is:

# Flag: norebalance

ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20

# Wait until all PGs are peered and recovery is done. In my case, there was 
only little I/O, # no more than 50-100 objects had writes missing and recovery 
was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking 
space # and for draining OSDs.

ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 
host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move 
osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush 
move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd 
crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20

After peering, no degraded PGs/objects any more, just the misplaced ones as 
expected.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

You said you had to move some OSDs ou

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
Sorry for the many small e-mails: requested IDs in the commands, 288-296. One 
new OSD per host.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject: [ceph-users] Re: Ceph does not recover from OSD restart

Hi Eric,

the procedure for re-discovering all objects is:

# Flag: norebalance

ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20

# Wait until all PGs are peered and recovery is done. In my case, there was 
only little I/O,
# no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking 
space
# and for draining OSDs.

ceph osd crush move osd.288 host=ceph-04
ceph osd crush move osd.289 host=ceph-05
ceph osd crush move osd.290 host=ceph-06
ceph osd crush move osd.291 host=ceph-21
ceph osd crush move osd.292 host=ceph-07
ceph osd crush move osd.293 host=ceph-18
ceph osd crush move osd.295 host=ceph-19
ceph osd crush move osd.294 host=ceph-22
ceph osd crush move osd.296 host=ceph-20

After peering, no degraded PGs/objects any more, just the misplaced ones as 
expected.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

You said you had to move some OSDs out and back in for Ceph to go back to 
normal (The OSDs you added). Which OSDs were added?

-Original Message-
From: Frank Schilder 
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for your fast response. Below the output, shortened a bit as indicated. 
Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only 
pool with remapped PGs and is also the only pool experiencing the "loss of 
track" to objects. Every other pool recovers from restart by itself.

Best regards,
Frank


# ceph osd pool stats
pool sr-rbd-meta-one id 1
  client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr

pool sr-rbd-data-one id 2
  client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr

pool sr-rbd-one-stretch id 3
  nothing is going on

pool con-rbd-meta-hpc-one id 7
  nothing is going on

pool con-rbd-data-hpc-one id 8
  client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr

pool sr-rbd-data-one-hdd id 11
  53241814/346903376 objects misplaced (15.348%)
  client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr

pool con-fs2-meta1 id 12
  client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr

pool con-fs2-meta2 id 13
  client io 0 B/s wr, 0 op/s rd, 0 op/s wr

pool con-fs2-data id 14
  client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr

pool con-fs2-data-ec-ssd id 17
  nothing is going on

pool ms-rbd-one id 18
  client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr


# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash 
rjenkins pg_num 80 pgp_num 80 last_change 122597 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags 
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 
stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 
'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash 
rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 
compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 
'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 
application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 
object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 
flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 
5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive 
application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' eras

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
Hi Eric,

the procedure for re-discovering all objects is:

# Flag: norebalance

ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20

# Wait until all PGs are peered and recovery is done. In my case, there was 
only little I/O,
# no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking 
space
# and for draining OSDs.

ceph osd crush move osd.288 host=ceph-04
ceph osd crush move osd.289 host=ceph-05
ceph osd crush move osd.290 host=ceph-06
ceph osd crush move osd.291 host=ceph-21
ceph osd crush move osd.292 host=ceph-07
ceph osd crush move osd.293 host=ceph-18
ceph osd crush move osd.295 host=ceph-19
ceph osd crush move osd.294 host=ceph-22
ceph osd crush move osd.296 host=ceph-20

After peering, no degraded PGs/objects any more, just the misplaced ones as 
expected.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

You said you had to move some OSDs out and back in for Ceph to go back to 
normal (The OSDs you added). Which OSDs were added?

-Original Message-
From: Frank Schilder 
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith ; ceph-users 
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for your fast response. Below the output, shortened a bit as indicated. 
Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only 
pool with remapped PGs and is also the only pool experiencing the "loss of 
track" to objects. Every other pool recovers from restart by itself.

Best regards,
Frank


# ceph osd pool stats
pool sr-rbd-meta-one id 1
  client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr

pool sr-rbd-data-one id 2
  client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr

pool sr-rbd-one-stretch id 3
  nothing is going on

pool con-rbd-meta-hpc-one id 7
  nothing is going on

pool con-rbd-data-hpc-one id 8
  client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr

pool sr-rbd-data-one-hdd id 11
  53241814/346903376 objects misplaced (15.348%)
  client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr

pool con-fs2-meta1 id 12
  client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr

pool con-fs2-meta2 id 13
  client io 0 B/s wr, 0 op/s rd, 0 op/s wr

pool con-fs2-data id 14
  client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr

pool con-fs2-data-ec-ssd id 17
  nothing is going on

pool ms-rbd-one id 18
  client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr


# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash 
rjenkins pg_num 80 pgp_num 80 last_change 122597 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags 
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 
stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 
'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash 
rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 
compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 
'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash 
rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 
application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 
object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 
flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 
5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive 
application rbd
removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 
object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 
flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 
21990232200 stripe_width 24576 fast_read 1 compression_mode aggressive 
application rbd
removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
removed_snaps_q

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
As a side effect of the restart, also the leader sees blocked ops that never 
get cleared. I need to restart the mon daemon:

  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
noout,norebalance flag(s) set
53242005/1492479251 objects misplaced (3.567%)
Long heartbeat ping times on back interface seen, longest is 
13854.181 msec
Long heartbeat ping times on front interface seen, longest is 
13737.799 msec
1 pools nearfull
129 slow ops, oldest one blocked for 1699 sec, mon.ceph-01 has slow 
ops

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
 flags noout,norebalance

  data:
pools:   11 pools, 3215 pgs
objects: 177.4 M objects, 489 TiB
usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53242005/1492479251 objects misplaced (3.567%)
 2904 active+clean
 294  active+remapped+backfill_wait
 13   active+remapped+backfilling
 4active+clean+scrubbing+deep

  io:
client:   120 MiB/s rd, 50 MiB/s wr, 1.15 kop/s rd, 745 op/s wr

Sample OPS:

# ceph daemon mon.ceph-01 ops | grep -e description -e num_ops
"description": "osd_failure(failed timeout osd.241 
192.168.32.82:6814/2178578 for 38sec e186626 v186626)",
"description": "osd_failure(failed timeout osd.243 
192.168.32.68:6814/3358340 for 37sec e186626 v186626)",
[...]
"description": "osd_failure(failed timeout osd.286 
192.168.32.68:6806/3354298 for 37sec e186764 v186764)",
"description": "osd_failure(failed timeout osd.287 
192.168.32.68:6804/3353324 for 37sec e186764 v186764)",
"num_ops": 129

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 03 August 2020 15:54:48
To: Eric Smith; ceph-users
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for your fast response. Below the output, shortened a bit as indicated. 
Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only 
pool with remapped PGs and is also the only pool experiencing the "loss of 
track" to objects. Every other pool recovers from restart by itself.

Best regards,
Frank


# ceph osd pool stats
pool sr-rbd-meta-one id 1
  client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr

pool sr-rbd-data-one id 2
  client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr

pool sr-rbd-one-stretch id 3
  nothing is going on

pool con-rbd-meta-hpc-one id 7
  nothing is going on

pool con-rbd-data-hpc-one id 8
  client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr

pool sr-rbd-data-one-hdd id 11
  53241814/346903376 objects misplaced (15.348%)
  client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr

pool con-fs2-meta1 id 12
  client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr

pool con-fs2-meta2 id 13
  client io 0 B/s wr, 0 op/s rd, 0 op/s wr

pool con-fs2-data id 14
  client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr

pool con-fs2-data-ec-ssd id 17
  nothing is going on

pool ms-rbd-one id 18
  client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr


# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash 
rjenkins pg_num 80 pgp_num 80 last_change 122597 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
application rbd
removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags 
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 
stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2]
pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 
object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 
flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 
stripe_width 0 compression_mode aggressive application rbd
removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e]
pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 
object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags 
hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 
application rbd
removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 
object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 
flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 
5497558138880 stripe_width 32768 fast

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
 
 294  hdd   10.69229 osd.294up  1.0 
1.0 
 249 rbd_data1.74599 osd.249up  1.0 
1.0 
 250 rbd_data1.74599 osd.250up  1.0 
1.0 
 265 rbd_data1.74599 osd.265up  1.0 
1.0 
 276 rbd_data1.74599 osd.276up  1.0 
1.0 
 281 rbd_data1.74599 osd.281up  1.0 
1.0 
  51 rbd_meta0.36400 osd.51 up  1.0 
1.0 


# ceph osd crush rule dump # crush rules outside tree under "datacenter 
ServerRoom" removed for brevity
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "sr-rbd-data-one",
"ruleset": 5,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 50
},
{
"op": "set_choose_tries",
"num": 1000
},
{
"op": "take",
"item": -185,
"item_name": "ServerRoom~rbd_data"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "sr-rbd-data-one-hdd",
"ruleset": 9,
"type": 3,
"min_size": 3,
"max_size": 8,
"steps": [
    {
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -53,
    "item_name": "ServerRoom~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]


=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eric Smith 
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Can you post the output of these commands:

ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump


-Original Message-
From: Frank Schilder 
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users 
Subject: [ceph-users] Re: Ceph does not recover from OSD restart

After moving the newly added OSDs out of the crush tree and back in again, I 
get to exactly what I want to see:

  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
 flags norebalance,norecover

  data:
pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
 2902 active+clean
 299  active+remapped+backfill_wait
 8active+remapped+backfilling
 5active+clean+scrubbing+deep
 1active+clean+snaptrim

  io:
client:   69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr

Why does a cluster wi

[ceph-users] Re: Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
After moving the newly added OSDs out of the crush tree and back in again, I 
get to exactly what I want to see:

  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
norebalance,norecover flag(s) set
53030026/1492404361 objects misplaced (3.553%)
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs
 flags norebalance,norecover

  data:
pools:   11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 53030026/1492404361 objects misplaced (3.553%)
 2902 active+clean
 299  active+remapped+backfill_wait
 8active+remapped+backfilling
 5active+clean+scrubbing+deep
 1active+clean+snaptrim

  io:
client:   69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr

Why does a cluster with remapped PGs not survive OSD restarts without loosing 
track of objects?
Why is it not finding the objects by itself?

A power outage of 3 hosts will halt everything for no reason until manual 
intervention. How can I avoid this problem?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart

Dear cephers,

I have a serious issue with degraded objects after an OSD restart. The cluster 
was in a state of re-balancing after adding disks to each host. Before restart 
I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted 
all OSDs of one host and the cluster does not recover from that:

  cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded 
(0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs

  data:
pools:   11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
 45813194/1492348700 objects misplaced (3.070%)
 2903 active+clean
 209  active+remapped+backfill_wait
 73   active+undersized+degraded+remapped+backfill_wait
 9active+remapped+backfill_wait+backfill_toofull
 8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 4active+undersized+degraded+remapped+backfilling
 3active+remapped+backfilling
 3active+clean+scrubbing+deep
 1active+clean+scrubbing
 1active+undersized+remapped+backfilling
 1active+clean+snaptrim

  io:
client:   47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s

After restarting there should only be a small number of degraded objects, the 
ones that received writes during OSD restart. What I see, however, is that the 
cluster seems to have lost track of a huge amount of objects, the 0.456% 
degraded are 1-2 days worth of I/O. I did reboots before and saw only a few 
thousand objects degraded at most. The output of ceph health detail shows a lot 
of lines like these:

[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data 
redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 
pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 
1 pools nearfull
OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%)
PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded 
(0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting 
[159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting 
[234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Deg

[ceph-users] Ceph does not recover from OSD restart

2020-08-03 Thread Frank Schilder
Dear cephers,

I have a serious issue with degraded objects after an OSD restart. The cluster 
was in a state of re-balancing after adding disks to each host. Before restart 
I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted 
all OSDs of one host and the cluster does not recover from that:

  cluster:
id: xxx
health: HEALTH_ERR
45813194/1492348700 objects misplaced (3.070%)
Degraded data redundancy: 6798138/1492348700 objects degraded 
(0.456%), 85 pgs degraded, 86 pgs undersized
Degraded data redundancy (low space): 17 pgs backfill_toofull
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 297 osds: 272 up, 272 in; 307 remapped pgs

  data:
pools:   11 pools, 3215 pgs
objects: 177.3 M objects, 489 TiB
usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
pgs: 6798138/1492348700 objects degraded (0.456%)
 45813194/1492348700 objects misplaced (3.070%)
 2903 active+clean
 209  active+remapped+backfill_wait
 73   active+undersized+degraded+remapped+backfill_wait
 9active+remapped+backfill_wait+backfill_toofull
 8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 4active+undersized+degraded+remapped+backfilling
 3active+remapped+backfilling
 3active+clean+scrubbing+deep
 1active+clean+scrubbing
 1active+undersized+remapped+backfilling
 1active+clean+snaptrim

  io:
client:   47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
recovery: 195 MiB/s, 48 objects/s

After restarting there should only be a small number of degraded objects, the 
ones that received writes during OSD restart. What I see, however, is that the 
cluster seems to have lost track of a huge amount of objects, the 0.456% 
degraded are 1-2 days worth of I/O. I did reboots before and saw only a few 
thousand objects degraded at most. The output of ceph health detail shows a lot 
of lines like these:

[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data 
redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 
pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 
1 pools nearfull
OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%)
PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded 
(0.455%), 85 pgs degraded, 86 pgs undersized
pg 11.9 is stuck undersized for 815.188981, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[60,148,2147483647,263,76,230,87,169]
8...9
pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting 
[159,60,180,263,237,3,2147483647,72]
pg 11.4a is stuck undersized for 851.162862, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[182,233,87,228,2,180,63,2147483647]
[...]
pg 11.22e is stuck undersized for 851.162402, current state 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting 
[234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
pg 11.24 is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[230,259,2147483647,1,144,159,233,146]
[...]
pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting 
[84,259,183,170,85,234,233,2]
pg 11.225 is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[236,183,1,2147483647,2147483647,169,229,230]
pg 11.22e is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)

It looks like a lot of PGs are not receiving theire complete crush map 
placement, as if the peering is incomplete. This is a serious issue, it looks 
like the cluster will see a total storage loss if just 2 more hosts reboot - 
without actually having lost any storage. The pool in question is a 6+2 EC pool.

What is going on here? Why are the PG-maps not restored to their values from 
before the OSD reboot? The degraded PGs should receive the missing OSD IDs, 
everything is up exactly as it was before the reboot.

Thanks for your help and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mimic: much more raw used than reported

2020-08-03 Thread Frank Schilder
Hi all,

quick update: looks like copying OSDs does indeed deflate the objects with 
partial overwrites in an EC pool again:

   osd df tree   blue stats
  ID   SIZEUSE alloc  store
  878.96.6   6.64.6 <-- old disk with inflated objects
 294 111.9   1.92.0 <-- new disk (still beckfilling)

Even the small effect of compression is visible.

I need to migrate everything to LVM at some point any ways. Seems like all 
static data will get cleaned up along the way. It was probably the copy process 
with too small write size causing the trouble. Unfortunately, the tool we are 
using does not have an option to change that.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 01 August 2020 10:53:29
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Hi Frank,

On 7/31/2020 10:31 AM, Frank Schilder wrote:
> Hi Igor,
>
> thanks. I guess the problem with finding the corresponding images is, that it 
> happens on bluestore and not on object level. Even if I listed all rados 
> objects and added their sizes I would not see the excess storage.
>
> Thinking about working around this issue, would re-writing the objects 
> deflate the exces usage? For example, evacuating an OSD and adding it back to 
> the pool after it was empty, would this re-write the objects on this OSD 
> without the overhead?
May be but I can't say for sure..
>
> Or simply copying an entire RBD image, would the copy be deflated?
>
> Although the latter options sound a bit crazy, one could do this without 
> (much) downtime of VMs and it might get us through this migration.

Also you might want to try pg export/import using ceph-objectstore-tool.
See https://ceph.io/geen-categorie/incomplete-pgs-oh-my/ for some hints
how to do that.

But again I'm not certain if it's helpful. Preferably to try with some
non-production cluster first...

>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Igor Fedotov 
> Sent: 30 July 2020 15:40
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] mimic: much more raw used than reported
>
> Hi Frank,
>
> On 7/30/2020 11:19 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> thanks for looking at this. Here a few thoughts:
>>
>> The copy goes to NTFS. I would expect between 2-4 meta data operations per 
>> write, which would go to few existing objects. I guess the difference 
>> bluestore_write_small-bluestore_write_small_new are mostly such writes and 
>> are susceptible to the partial overwrite amplification. A first question is, 
>> how many objects are actually affected? 3 small writes does not mean 
>> 3 objects have partial overwrites.
>>
>> The large number of small_new is indeed strange, although these would not 
>> lead to excess allocations. It is possible that the write size of the copy 
>> tool is not ideal, was wondering about this too. I will investigate.
> small_new might relate to small tailing chunks that presumably appear
> when doing unaligned appends. Each such append triggers small_new write...
>
>
>> To know more, I would need to find out which images these small writes come 
>> from, we have more than one active. Is there a low-level way to find out 
>> which objects are affected by partial overwrites and which image they belong 
>> to? In your post you were describing some properties like being 
>> shared/cloned etc. Can one search for such objects?
> IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log
> inspection is likely to be the only mean to learn which objects OSD is
> processing... Be careful - this produces significant amount of data and
> negatively impact the performance.
>> On a more fundamental level, I'm wondering why RBD images issue sub-object 
>> size writes at all. I naively assumed that every I/O operation to RBD always 
>> implies full object writes, even just changing a single byte (thinking of an 
>> object as the equivalent of a sector on a disk, the smallest atomic unit). 
>> If this is not the case, what is the meaning of object size then? How does 
>> it influence on I/O patterns? My benchmarks show that object size matters a 
>> lot, but it becomes a bit unclear now why.
> Not sure I can provide good enough answer on the above. But I doubt that
> RBD unconditionally operates on full objects.
>
>
>> Thanks and best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 

[ceph-users] Re: mimic: much more raw used than reported

2020-07-31 Thread Frank Schilder
Hi Igor,

thanks. I guess the problem with finding the corresponding images is, that it 
happens on bluestore and not on object level. Even if I listed all rados 
objects and added their sizes I would not see the excess storage.

Thinking about working around this issue, would re-writing the objects deflate 
the exces usage? For example, evacuating an OSD and adding it back to the pool 
after it was empty, would this re-write the objects on this OSD without the 
overhead?

Or simply copying an entire RBD image, would the copy be deflated?

Although the latter options sound a bit crazy, one could do this without (much) 
downtime of VMs and it might get us through this migration.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 30 July 2020 15:40
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Hi Frank,

On 7/30/2020 11:19 AM, Frank Schilder wrote:
> Hi Igor,
>
> thanks for looking at this. Here a few thoughts:
>
> The copy goes to NTFS. I would expect between 2-4 meta data operations per 
> write, which would go to few existing objects. I guess the difference 
> bluestore_write_small-bluestore_write_small_new are mostly such writes and 
> are susceptible to the partial overwrite amplification. A first question is, 
> how many objects are actually affected? 3 small writes does not mean 
> 3 objects have partial overwrites.
>
> The large number of small_new is indeed strange, although these would not 
> lead to excess allocations. It is possible that the write size of the copy 
> tool is not ideal, was wondering about this too. I will investigate.

small_new might relate to small tailing chunks that presumably appear
when doing unaligned appends. Each such append triggers small_new write...


> To know more, I would need to find out which images these small writes come 
> from, we have more than one active. Is there a low-level way to find out 
> which objects are affected by partial overwrites and which image they belong 
> to? In your post you were describing some properties like being shared/cloned 
> etc. Can one search for such objects?
IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log
inspection is likely to be the only mean to learn which objects OSD is
processing... Be careful - this produces significant amount of data and
negatively impact the performance.
>
> On a more fundamental level, I'm wondering why RBD images issue sub-object 
> size writes at all. I naively assumed that every I/O operation to RBD always 
> implies full object writes, even just changing a single byte (thinking of an 
> object as the equivalent of a sector on a disk, the smallest atomic unit). If 
> this is not the case, what is the meaning of object size then? How does it 
> influence on I/O patterns? My benchmarks show that object size matters a lot, 
> but it becomes a bit unclear now why.

Not sure I can provide good enough answer on the above. But I doubt that
RBD unconditionally operates on full objects.


>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 29 July 2020 16:25:36
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] mimic: much more raw used than reported
>
> Frank,
>
> so you have pretty high amount of small writes indeed. More than a half
> of the written volume (in bytes) is done via small writes.
>
> And 6x times more small requests.
>
>
> This looks pretty odd for sequential write pattern and is likely to be
> the root cause for that space overhead.
>
> I can see approx 1.4GB additionally lost per each of these 3 OSDs since
> perf dump reset  ( = allocated_new - stored_new - (allocated_old -
> stored_old))
>
> Below are some speculations on what might be happening by for sure I
> could be wrong/missing something. So please do not consider this as a
> 100% valid analysis.
>
> Client does writes in 1MB chunks. This is split into 6 EC chunks (+2
> added) which results in approx 170K writing block to object store ( =
> 1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing
> one. Resulting in 3x64K allocations.
>
> The next client adjacent write results in another 128K blob, one more
> "small" tailing blob and heading blob which partially overlaps with the
> previous tailing 42K chunk. Overlapped chunks are expected to be merged.
> But presumably this doesn't happen due to that "partial EC overwrites"
> issue. So instead additional 64K blob is allocated for overlapped range.
>
> I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blob and 2x

[ceph-users] Re: mimic: much more raw used than reported

2020-07-30 Thread Frank Schilder
Hi Igor,

thanks for looking at this. Here a few thoughts:

The copy goes to NTFS. I would expect between 2-4 meta data operations per 
write, which would go to few existing objects. I guess the difference 
bluestore_write_small-bluestore_write_small_new are mostly such writes and are 
susceptible to the partial overwrite amplification. A first question is, how 
many objects are actually affected? 3 small writes does not mean 3 
objects have partial overwrites.

The large number of small_new is indeed strange, although these would not lead 
to excess allocations. It is possible that the write size of the copy tool is 
not ideal, was wondering about this too. I will investigate.

To know more, I would need to find out which images these small writes come 
from, we have more than one active. Is there a low-level way to find out which 
objects are affected by partial overwrites and which image they belong to? In 
your post you were describing some properties like being shared/cloned etc. Can 
one search for such objects?

On a more fundamental level, I'm wondering why RBD images issue sub-object size 
writes at all. I naively assumed that every I/O operation to RBD always implies 
full object writes, even just changing a single byte (thinking of an object as 
the equivalent of a sector on a disk, the smallest atomic unit). If this is not 
the case, what is the meaning of object size then? How does it influence on I/O 
patterns? My benchmarks show that object size matters a lot, but it becomes a 
bit unclear now why.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 29 July 2020 16:25:36
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Frank,

so you have pretty high amount of small writes indeed. More than a half
of the written volume (in bytes) is done via small writes.

And 6x times more small requests.


This looks pretty odd for sequential write pattern and is likely to be
the root cause for that space overhead.

I can see approx 1.4GB additionally lost per each of these 3 OSDs since
perf dump reset  ( = allocated_new - stored_new - (allocated_old -
stored_old))

Below are some speculations on what might be happening by for sure I
could be wrong/missing something. So please do not consider this as a
100% valid analysis.

Client does writes in 1MB chunks. This is split into 6 EC chunks (+2
added) which results in approx 170K writing block to object store ( =
1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing
one. Resulting in 3x64K allocations.

The next client adjacent write results in another 128K blob, one more
"small" tailing blob and heading blob which partially overlaps with the
previous tailing 42K chunk. Overlapped chunks are expected to be merged.
But presumably this doesn't happen due to that "partial EC overwrites"
issue. So instead additional 64K blob is allocated for overlapped range.

I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blob and 2x64K
blobs for the range where two writes adjoined. 64K wasted!

And similarly +64K space overhead per each additional append to this object.


Again I'm not completely sure the above analysis is 100% valid and this
doesn't explain that large amount of small requests. But you might want
to check/tune/experiment on client writing size. E.g. increase it to 4M
if it' less or make divisible by 6.

Hope this helps.

Thanks,

Igor

On 7/29/2020 4:06 PM, Frank Schilder wrote:

> Hi Igor,
>
> thanks! Here a sample extract for one OSD, time stamp (+%F-%H%M%S) in file 
> name. For the second collection I let it run for about 10 minutes after reset:
>
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_big": 10216689,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_big_bytes": 
> 992602882048,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_big_blobs": 
> 10758603,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small": 63863813,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_bytes": 
> 1481631167388,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_unused": 
> 17279108,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_deferred": 
> 13629951,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_pre_read": 
> 13629951,
> perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_new": 
> 32954754,
> perf_dump_2020-07-29-142739.osd181:"compress_success_count": 1167212,
> perf_dump_2020-07-29-142739.osd181:"compress_rejected_count": 1493508,
> perf_dump_2020-07-29-142739.osd181:"bluestore_compressed&q

[ceph-users] Re: mimic: much more raw used than reported

2020-07-29 Thread Frank Schilder
*$//g" | awk 'BEGIN 
{printf("%18s\n", "osd df tree")} /root default/ {o=0} /datacenter ServerRoom/ 
{o=1} (o==1 && $2=="hdd") {s+=$5;u+=$7;printf("%4s  %5s  %5s\n", $1, $5, $7)} 
f==0 {printf("%4s  %5s  %5s\n", $1, $5, $6);f=1} END {printf("%4s  %5.1f  
%5.1f\n", "SUM", s, u)}')"

OSDS=( $(echo "$df_tree_data" | tail -n +3 | awk '/SUM/ {next} {print $1}') )

bs_data="$(blue_stats "${OSDS[@]}")"

paste -d " " <(echo "$df_tree_data") <(echo "$bs_data")

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 27 July 2020 13:31
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Frank,

suggest to start with perf counter analysis as per the second part of my
previous email...


Thanks,

Igor

On 7/27/2020 2:30 PM, Frank Schilder wrote:
> Hi Igor,
>
> thanks for your answer. I was thinking about that, but as far as I 
> understood, to hit this bug actually requires a partial rewrite to happen. 
> However, these are disk images in storage servers with basically static 
> files, many of which very large (15GB). Therefore, I believe, the vast 
> majority of objects is written to only once and should not be affected by the 
> amplification bug.
>
> Is there any way to  confirm/rule out that/check how much  amplification is 
> happening?
>
> I'm wondering if I might be observing something else. Since "ceph osd df 
> tree" does report the actual utilization and I have only one pool on these 
> OSDs, there is no problem with accounting allocated storage to a pool. I know 
> its all used by this one pool. I'm more wondering if its not the known 
> amplification but something else (at least partly) that plays a role here.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 27 July 2020 12:54:02
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] mimic: much more raw used than reported
>
> Hi Frank,
>
> you might be being hit by https://tracker.ceph.com/issues/44213
>
> In short the root causes are  significant space overhead due to high
> bluestore allocation unit (64K) and EC overwrite design.
>
> This is fixed for upcoming Pacific release by using 4K alloc unit but it
> is unlikely to be backported to earlier releases due to its complexity.
> To say nothing about the need for OSD redeployment. Hence please expect
> no fix for mimic.
>
>
> And your raw usage reports might still be not that good since mimic
> lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454.
> I.e. your actual raw space usage is higher than reported. To estimate
> proper raw usage one can use bluestore perf counters (namely
> bluestore_stored and bluestore_allocated). Summing bluestore_allocated
> over all involved OSDs will give actual RAW usage. Summing
> bluestore_stored will provide actual data volume after EC processing,
> i.e. presumably it should be around 158TiB.
>
>
> Thanks,
>
> Igor
>
> On 7/26/2020 8:43 PM, Frank Schilder wrote:
>> Dear fellow cephers,
>>
>> I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD 
>> pool backed by HDDs. These disks are not in any other pool. I noticed that 
>> the total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk 
>> recently from 300TiB to 200TiB. Part but by no means all of this can be 
>> explained by imbalance of the data distribution.
>>
>> When I compare the output of "ceph df detail" and "ceph osd df tree", I find 
>> 69TiB raw capacity used but not accounted for; see calculations below. These 
>> 69TiB raw are equivalent to 20% usable capacity and I really need it back. 
>> Together with the imbalance, we loose about 30% capacity.
>>
>> What is using these extra 69TiB and how can I get it back?
>>
>>
>> Some findings:
>>
>> These are the 5 largest images in the pool, accounting for a total of 97TiB 
>> out of 119TiB usage:
>>
>> # rbd du :
>> NAMEPROVISIONED   USED
>> one-133  25 TiB 14 TiB
>> NAMEPROVISIONEDUSED
>> one-153@222  40 TiB  14 TiB
>> one-153@228  40 TiB 357 GiB
>> one-153@235  40 TiB 797 GiB
>> one-153@241  40 TiB 509 GiB
>> one-153@242  40 TiB  43 GiB
>> one-153@243  40 TiB  16 MiB
>> one-153@244  40 TiB  16 MiB
>> one-153@245  40 TiB 324 MiB
&g

[ceph-users] Re: mimic: much more raw used than reported

2020-07-27 Thread Frank Schilder
Hi Igor,

thanks for your answer. I was thinking about that, but as far as I understood, 
to hit this bug actually requires a partial rewrite to happen. However, these 
are disk images in storage servers with basically static files, many of which 
very large (15GB). Therefore, I believe, the vast majority of objects is 
written to only once and should not be affected by the amplification bug.

Is there any way to  confirm/rule out that/check how much  amplification is 
happening?

I'm wondering if I might be observing something else. Since "ceph osd df tree" 
does report the actual utilization and I have only one pool on these OSDs, 
there is no problem with accounting allocated storage to a pool. I know its all 
used by this one pool. I'm more wondering if its not the known amplification 
but something else (at least partly) that plays a role here.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 27 July 2020 12:54:02
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Hi Frank,

you might be being hit by https://tracker.ceph.com/issues/44213

In short the root causes are  significant space overhead due to high
bluestore allocation unit (64K) and EC overwrite design.

This is fixed for upcoming Pacific release by using 4K alloc unit but it
is unlikely to be backported to earlier releases due to its complexity.
To say nothing about the need for OSD redeployment. Hence please expect
no fix for mimic.


And your raw usage reports might still be not that good since mimic
lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454.
I.e. your actual raw space usage is higher than reported. To estimate
proper raw usage one can use bluestore perf counters (namely
bluestore_stored and bluestore_allocated). Summing bluestore_allocated
over all involved OSDs will give actual RAW usage. Summing
bluestore_stored will provide actual data volume after EC processing,
i.e. presumably it should be around 158TiB.


Thanks,

Igor

On 7/26/2020 8:43 PM, Frank Schilder wrote:
> Dear fellow cephers,
>
> I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD pool 
> backed by HDDs. These disks are not in any other pool. I noticed that the 
> total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk 
> recently from 300TiB to 200TiB. Part but by no means all of this can be 
> explained by imbalance of the data distribution.
>
> When I compare the output of "ceph df detail" and "ceph osd df tree", I find 
> 69TiB raw capacity used but not accounted for; see calculations below. These 
> 69TiB raw are equivalent to 20% usable capacity and I really need it back. 
> Together with the imbalance, we loose about 30% capacity.
>
> What is using these extra 69TiB and how can I get it back?
>
>
> Some findings:
>
> These are the 5 largest images in the pool, accounting for a total of 97TiB 
> out of 119TiB usage:
>
> # rbd du :
> NAMEPROVISIONED   USED
> one-133  25 TiB 14 TiB
> NAMEPROVISIONEDUSED
> one-153@222  40 TiB  14 TiB
> one-153@228  40 TiB 357 GiB
> one-153@235  40 TiB 797 GiB
> one-153@241  40 TiB 509 GiB
> one-153@242  40 TiB  43 GiB
> one-153@243  40 TiB  16 MiB
> one-153@244  40 TiB  16 MiB
> one-153@245  40 TiB 324 MiB
> one-153@246  40 TiB 276 MiB
> one-153@247  40 TiB  96 MiB
> one-153@248  40 TiB 138 GiB
> one-153@249  40 TiB 1.8 GiB
> one-153@250  40 TiB 0 B
> one-153  40 TiB 204 MiB
>   40 TiB  16 TiB
> NAME   PROVISIONEDUSED
> one-391@3   40 TiB 432 MiB
> one-391@9   40 TiB  26 GiB
> one-391@15  40 TiB  90 GiB
> one-391@16  40 TiB 0 B
> one-391@17  40 TiB 0 B
> one-391@18  40 TiB 0 B
> one-391@19  40 TiB 0 B
> one-391@20  40 TiB 3.5 TiB
> one-391@21  40 TiB 5.4 TiB
> one-391@22  40 TiB 5.8 TiB
> one-391@23  40 TiB 8.4 TiB
> one-391@24  40 TiB 1.4 TiB
> one-391 40 TiB 2.2 TiB
>  40 TiB  27 TiB
> NAME   PROVISIONEDUSED
> one-394@3   70 TiB 1.4 TiB
> one-394@9   70 TiB 2.5 TiB
> one-394@15  70 TiB  20 GiB
> one-394@16  70 TiB 0 B
> one-394@17  70 TiB 0 B
> one-394@18  70 TiB 0 B
> one-394@19  70 TiB 383 GiB
> one-394@20  70 TiB 3.3 TiB
> one-394@21  70 TiB 5.0 TiB
> one-394@22  70 TiB 5.0 TiB
> one-394@23  70 TiB 9.0 TiB
> one-394@24  70 TiB 1.6 TiB
> one-394 70 TiB 2.5 TiB
>  70 TiB  31 TiB
> NAMEPROVISIONEDUSED
> one-434  25 TiB 9.1 TiB
>
> The large 70TiB images one-391 and one-394 are 

[ceph-users] mimic: much more raw used than reported

2020-07-26 Thread Frank Schilder
46
 158  hdd8.90999  1.0 8.9 TiB 5.6 TiB 5.5 TiB 183 MiB   17 GiB 3.4 
TiB 62.30 1.90 109 osd.158
 170  hdd8.90999  1.0 8.9 TiB 5.7 TiB 5.6 TiB 205 MiB   18 GiB 3.2 
TiB 63.53 1.94 112 osd.170
 182  hdd8.90999  1.0 8.9 TiB 4.7 TiB 4.6 TiB 105 MiB   14 GiB 4.3 
TiB 52.27 1.60  92 osd.182
  63  hdd8.90999  1.0 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB   15 GiB 4.2 
TiB 52.74 1.61  98 osd.63
 148  hdd8.90999  1.0 8.9 TiB 5.2 TiB 5.1 TiB 119 MiB   16 GiB 3.8 
TiB 57.82 1.77 100 osd.148
 159  hdd8.90999  1.0 8.9 TiB 4.0 TiB 4.0 TiB  89 MiB   12 GiB 4.9 
TiB 44.61 1.36  79 osd.159
 172  hdd8.90999  1.0 8.9 TiB 5.1 TiB 5.1 TiB 173 MiB   16 GiB 3.8 
TiB 57.22 1.75  98 osd.172
 183  hdd8.90999  1.0 8.9 TiB 6.0 TiB 6.0 TiB 135 MiB   19 GiB 2.9 
TiB 67.35 2.06 118 osd.183
 229  hdd8.90999  1.0 8.9 TiB 4.6 TiB 4.6 TiB 127 MiB   15 GiB 4.3 
TiB 52.05 1.59  93 osd.229
 232  hdd8.90999  1.0 8.9 TiB 5.2 TiB 5.2 TiB 158 MiB   17 GiB 3.7 
TiB 58.22 1.78 101 osd.232
 235  hdd8.90999  1.0 8.9 TiB 4.1 TiB 4.1 TiB 103 MiB   13 GiB 4.8 
TiB 45.96 1.40  79 osd.235
 238  hdd8.90999  1.0 8.9 TiB 5.4 TiB 5.4 TiB 120 MiB   17 GiB 3.5 
TiB 60.47 1.85 104 osd.238
 259  hdd   10.91399  1.0  11 TiB 6.2 TiB 6.2 TiB 140 MiB   19 GiB 4.7 
TiB 56.54 1.73 120 osd.259
 231  hdd8.90999  1.0 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB   16 GiB 3.8 
TiB 56.90 1.74 101 osd.231
 233  hdd8.90999  1.0 8.9 TiB 5.5 TiB 5.5 TiB 123 MiB   17 GiB 3.4 
TiB 61.78 1.89 106 osd.233
 236  hdd8.90999  1.0 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB   16 GiB 3.8 
TiB 57.53 1.76 101 osd.236
 239  hdd8.90999  1.0 8.9 TiB 4.2 TiB 4.2 TiB  95 MiB   13 GiB 4.7 
TiB 47.41 1.45  86 osd.239
 263  hdd   10.91399  1.0  11 TiB 5.3 TiB 5.3 TiB 178 MiB   17 GiB 5.6 
TiB 48.73 1.49 102 osd.263
 228  hdd8.90999  1.0 8.9 TiB 5.1 TiB 5.1 TiB 113 MiB   16 GiB 3.8 
TiB 57.10 1.74  96 osd.228
 230  hdd8.90999  1.0 8.9 TiB 4.9 TiB 4.9 TiB 144 MiB   16 GiB 4.0 
TiB 55.20 1.69  99 osd.230
 234  hdd8.90999  1.0 8.9 TiB 5.6 TiB 5.6 TiB 164 MiB   18 GiB 3.3 
TiB 63.29 1.93 109 osd.234
 237  hdd8.90999  1.0 8.9 TiB 4.8 TiB 4.8 TiB 110 MiB   15 GiB 4.1 
TiB 54.33 1.66  97 osd.237
 260  hdd   10.91399  1.0  11 TiB 5.4 TiB 5.4 TiB 152 MiB   17 GiB 5.5 
TiB 49.35 1.51 104 osd.260
   0  hdd8.90999  1.0 8.9 TiB 5.2 TiB 5.2 TiB 157 MiB   16 GiB 3.7 
TiB 58.28 1.78 102 osd.0
   2  hdd8.90999  1.0 8.9 TiB 5.3 TiB 5.2 TiB 122 MiB   16 GiB 3.6 
TiB 59.05 1.80 106 osd.2
  72  hdd8.90999  1.0 8.9 TiB 4.4 TiB 4.4 TiB 145 MiB   14 GiB 4.5 
TiB 49.89 1.52  89 osd.72
  76  hdd8.90999  1.0 8.9 TiB 5.1 TiB 5.1 TiB 178 MiB   16 GiB 3.8 
TiB 56.89 1.74 102 osd.76
  86  hdd8.90999  1.0 8.9 TiB 4.6 TiB 4.5 TiB 155 MiB   14 GiB 4.3 
TiB 51.18 1.56  94 osd.86
   1  hdd8.90999  1.0 8.9 TiB 4.9 TiB 4.9 TiB 141 MiB   15 GiB 4.0 
TiB 54.73 1.67  95 osd.1
   3  hdd8.90999  1.0 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB   15 GiB 4.2 
TiB 52.40 1.60  94 osd.3
  73  hdd8.90999  1.0 8.9 TiB 5.0 TiB 4.9 TiB 146 MiB   16 GiB 3.9 
TiB 55.68 1.70 102 osd.73
  85  hdd8.90999  1.0 8.9 TiB 5.6 TiB 5.5 TiB 192 MiB   18 GiB 3.3 
TiB 62.46 1.91 109 osd.85
  87  hdd8.90999  1.0 8.9 TiB 5.0 TiB 5.0 TiB 189 MiB   16 GiB 3.9 
TiB 55.91 1.71 102     osd.87

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-07-21 Thread Frank Schilder
Quick question: Is there a way to change the frequency of heap dumps? On this 
page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
configuring this. Is there a config parameter or a ceph daemon call to adjust 
this?

If not, can I change the dump path?

Its likely to overrun my log partition quickly if I cannot adjust either of the 
two.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 20 July 2020 15:19:05
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Dear Mark,

thank you very much for the very helpful answers. I will raise 
osd_memory_cache_min, leave everything else alone and watch what happens. I 
will report back here.

Thanks also for raising this as an issue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 July 2020 15:08:11
To: Frank Schilder; Dan van der Ster
Cc: ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

On 7/20/20 3:23 AM, Frank Schilder wrote:
> Dear Mark and Dan,
>
> I'm in the process of restarting all OSDs and could use some quick advice on 
> bluestore cache settings. My plan is to set higher minimum values and deal 
> with accumulated excess usage via regular restarts. Looking at the 
> documentation 
> (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), 
> I find the following relevant options (with defaults):
>
> # Automatic Cache Sizing
> osd_memory_target {4294967296} # 4GB
> osd_memory_base {805306368} # 768MB
> osd_memory_cache_min {134217728} # 128MB
>
> # Manual Cache Sizing
> bluestore_cache_meta_ratio {.4} # 40% ?
> bluestore_cache_kv_ratio {.4} # 40% ?
> bluestore_cache_kv_max {512 * 1024*1024} # 512MB
>
> Q1) If I increase osd_memory_cache_min, should I also increase 
> osd_memory_base by the same or some other amount?


osd_memory_base is a hint at how much memory the OSD could consume
outside the cache once it's reached steady state.  It basically sets a
hard cap on how much memory the cache will use to avoid over-committing
memory and thrashing when we exceed the memory limit. It's not necessary
to get it right, it just helps smooth things out by making the automatic
memory tuning less aggressive.  IE if you have a 2 GB memory target and
a 512MB base, you'll never assign more than 1.5GB to the cache on the
assumption that the rest of the OSD will eventually need 512MB to
operate even if it's not using that much right now.  I think you can
probably just leave it alone.  What you and Dan appear to be seeing is
that this number isn't static in your case but increases over time any
way.  Eventually I'm hoping that we can automatically account for more
and more of that memory by reading the data from the mempools.

> Q2) The cache ratio options are shown under the section "Manual Cache 
> Sizing". Do they also apply when cache auto tuning is enabled? If so, is it 
> worth changing these defaults for higher values of osd_memory_cache_min?


They actually do have an effect on the automatic cache sizing and
probably shouldn't only be under the manual section.  When you have the
automatic cache sizing enabled, those options will affect the "fair
share" values of the different caches at each cache priority level.  IE
at priority level 0, if both caches want more memory than is available,
those ratios will determine how much each cache gets.  If there is more
memory available than requested, each cache gets as much as they want
and we move on to the next priority level and do the same thing again.
So in this case the ratios end up being sort of more like fallback
settings for when you don't have enough memory to fulfill all cache
requests at a given priority level, but otherwise are not utilized until
we hit that limit.  The goal with this scheme is to make sure that "high
priority" items in each cache get first dibs at the memory even if it
might skew the ratios.  This might be things like rocksdb bloom filters
and indexes, or potentially very recent hot items in one cache vs very
old items in another cache.  The ratios become more like guidelines than
hard limits.


When you change to manual mode, you set an overall bluestore cache size
and each cache gets a flat percentage of it based on the ratios.  With
0.4/0.4 you will always have 40% for onode, 40% for omap, and 20% for
data even if one of those caches does not use all of it's memory.


>
> Many thanks for your help with this. I can't find answers to these questions 
> in the docs.
>
> There might be two reasons for high osd_map memory usage. One is, that our 
> OSDs seem to hold a large number of OSD

[ceph-users] Re: OSD memory leak?

2020-07-20 Thread Frank Schilder
Dear Mark,

thank you very much for the very helpful answers. I will raise 
osd_memory_cache_min, leave everything else alone and watch what happens. I 
will report back here.

Thanks also for raising this as an issue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 July 2020 15:08:11
To: Frank Schilder; Dan van der Ster
Cc: ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

On 7/20/20 3:23 AM, Frank Schilder wrote:
> Dear Mark and Dan,
>
> I'm in the process of restarting all OSDs and could use some quick advice on 
> bluestore cache settings. My plan is to set higher minimum values and deal 
> with accumulated excess usage via regular restarts. Looking at the 
> documentation 
> (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), 
> I find the following relevant options (with defaults):
>
> # Automatic Cache Sizing
> osd_memory_target {4294967296} # 4GB
> osd_memory_base {805306368} # 768MB
> osd_memory_cache_min {134217728} # 128MB
>
> # Manual Cache Sizing
> bluestore_cache_meta_ratio {.4} # 40% ?
> bluestore_cache_kv_ratio {.4} # 40% ?
> bluestore_cache_kv_max {512 * 1024*1024} # 512MB
>
> Q1) If I increase osd_memory_cache_min, should I also increase 
> osd_memory_base by the same or some other amount?


osd_memory_base is a hint at how much memory the OSD could consume
outside the cache once it's reached steady state.  It basically sets a
hard cap on how much memory the cache will use to avoid over-committing
memory and thrashing when we exceed the memory limit. It's not necessary
to get it right, it just helps smooth things out by making the automatic
memory tuning less aggressive.  IE if you have a 2 GB memory target and
a 512MB base, you'll never assign more than 1.5GB to the cache on the
assumption that the rest of the OSD will eventually need 512MB to
operate even if it's not using that much right now.  I think you can
probably just leave it alone.  What you and Dan appear to be seeing is
that this number isn't static in your case but increases over time any
way.  Eventually I'm hoping that we can automatically account for more
and more of that memory by reading the data from the mempools.

> Q2) The cache ratio options are shown under the section "Manual Cache 
> Sizing". Do they also apply when cache auto tuning is enabled? If so, is it 
> worth changing these defaults for higher values of osd_memory_cache_min?


They actually do have an effect on the automatic cache sizing and
probably shouldn't only be under the manual section.  When you have the
automatic cache sizing enabled, those options will affect the "fair
share" values of the different caches at each cache priority level.  IE
at priority level 0, if both caches want more memory than is available,
those ratios will determine how much each cache gets.  If there is more
memory available than requested, each cache gets as much as they want
and we move on to the next priority level and do the same thing again.
So in this case the ratios end up being sort of more like fallback
settings for when you don't have enough memory to fulfill all cache
requests at a given priority level, but otherwise are not utilized until
we hit that limit.  The goal with this scheme is to make sure that "high
priority" items in each cache get first dibs at the memory even if it
might skew the ratios.  This might be things like rocksdb bloom filters
and indexes, or potentially very recent hot items in one cache vs very
old items in another cache.  The ratios become more like guidelines than
hard limits.


When you change to manual mode, you set an overall bluestore cache size
and each cache gets a flat percentage of it based on the ratios.  With
0.4/0.4 you will always have 40% for onode, 40% for omap, and 20% for
data even if one of those caches does not use all of it's memory.


>
> Many thanks for your help with this. I can't find answers to these questions 
> in the docs.
>
> There might be two reasons for high osd_map memory usage. One is, that our 
> OSDs seem to hold a large number of OSD maps:


I brought this up in our core team standup last week.  Not sure if
anyone has had time to look at it yet though.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-07-20 Thread Frank Schilder
  "osd": {
"items": 96,
"bytes": 1115904
},
"osd_mapbl": {
"items": 80,
"bytes": 8501746
},
"osd_pglog": {
"items": 328703,
"bytes": 117673864
},
"osdmap": {
"items": 12101478,
"bytes": 210941392
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 23145696,
"bytes": 526245301
}
}
}

# ceph daemon osd.211 heap stats
osd.211 tcmalloc heap stats:
MALLOC: 1727399344 ( 1647.4 MiB) Bytes in use by application
MALLOC: +   532480 (0.5 MiB) Bytes in page heap freelist
MALLOC: +262860912 (  250.7 MiB) Bytes in central cache freelist
MALLOC: + 11693568 (   11.2 MiB) Bytes in transfer cache freelist
MALLOC: + 29694944 (   28.3 MiB) Bytes in thread cache freelists
MALLOC: + 14024704 (   13.4 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =   2046205952 ( 1951.4 MiB) Actual memory used (physical + swap)
MALLOC: +229212160 (  218.6 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =   2275418112 ( 2170.0 MiB) Virtual address space used
MALLOC:
MALLOC: 145115  Spans in use
MALLOC: 32  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


# ceph daemon osd.211 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 4691828,
"bytes": 37534624
},
"bluestore_cache_data": {
"items": 894,
"bytes": 163053568
},
"bluestore_cache_onode": {
"items": 165536,
"bytes": 94024448
},
"bluestore_cache_other": {
"items": 33936718,
"bytes": 233428234
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 110,
"bytes": 75680
},
"bluestore_writing_deferred": {
"items": 38,
"bytes": 6061245
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 9956,
"bytes": 189640
},
"buffer_anon": {
"items": 293298,
        "bytes": 59950954
},
"buffer_meta": {
    "items": 1005,
"bytes": 64320
},
"osd": {
"items": 98,
"bytes": 1139152
},
"osd_mapbl": {
"items": 80,
"bytes": 8501690
},
"osd_pglog": {
"items": 350517,
"bytes": 132253139
},
"osdmap": {
"items": 633498,
"bytes": 10866360
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
   

[ceph-users] Re: OSD memory leak?

2020-07-16 Thread Frank Schilder
Dear Dan, cc Mark,

this sounds exactly like the scenario I'm looking at. We have rolling snapshots 
on RBD images on currently ca. 200 VMs and increasing. Snapshots are daily with 
different retention periods.

We have two pools with separate hardware backing RBD and cephfs. The mem stats 
I sent are from an OSD backing cephfs, which does not have any snaps currently. 
So the snaps on other OSDs influence the memory usage of OSDs that have nothing 
to do with the RBDs. I also noticed a significant drop of memory usage across 
the cluster after restarting the OSDs on just one host. Not sure if this is 
expected either.

Looks like the OSDs do collect dead baggage quite fast and the memory_target 
reduces the caches in an attempt to accommodate for that. The fact that the 
kernel swaps this out in favour of disk buffers on a system with low swappiness 
where the only disk access is local syslog indicates that this is allocated but 
never used - a quite massive leak. It currently looks like that after only a 
couple of days the leakage exceeds the mem target already.

I don't want to have the occasional OOM killer on my operations team. For now I 
will probably adopt a reverse strategy, give up on memory_target doing 
something useful, increase the minimum cache limits to ensure at least some 
caching, have swap take care of the leak and restart OSDs regularly (every 2-3 
months).

Would be good if this could be looked at. Please let me know if there is some 
data I can provide.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 15 July 2020 18:36:06
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

On 7/15/20 9:58 AM, Dan van der Ster wrote:
> Hi Mark,
>
> On Mon, Jul 13, 2020 at 3:42 PM Mark Nelson  wrote:
>> Hi Frank,
>>
>>
>> So the osd_memory_target code will basically shrink the size of the
>> bluestore and rocksdb caches to attempt to keep the overall mapped (not
>> rss!) memory of the process below the target.  It's sort of "best
>> effort" in that it can't guarantee the process will fit within a given
>> target, it will just (assuming we are over target) shrink the caches up
>> to some minimum value and that's it. 2GB per OSD is a pretty ambitious
>> target.  It's the lowest osd_memory_target we recommend setting.  I'm a
>> little surprised the OSD is consuming this much memory with a 2GB target
>> though.
>>
>> Looking at your mempool dump I see very little memory allocated to the
>> caches.  In fact the majority is taken up by osdmap (looks like you have
>> a decent number of OSDs) and pglog.  That indicates that the memory
> Do you know if this high osdmap usage is known already?
> Our big block storage cluster generates a new osdmap every few seconds
> (due to rbd snap trimming) and we see the osdmap mempool usage growing
> over a few months until osds start getting OOM killed.
>
> Today we proactively restarted them because the osdmap_mempool was
> using close to 700MB.
> So it seems that whatever is supposed to be trimming is not working.
> (This is observed with nautilus 14.2.8 but iirc it has been the same
> even when we were running luminous and mimic too)
>
> Cheers, Dan


Hrm, it hasn't been on my radar, though looking back through the mailing
list there appears to be various reports over the years of high usage
(some of which theoretically have been fixed).  Maybe submit a tracker
issue?  700MB seems quite high for osdmap, but I don't really know the
retention rules so someone else who knows that code better will have to
chime in.


>
>> autotuning is probably working but simply can't do anything more to
>> help.  Something else is taking up the memory. Figure you've got a
>> little shy of 500MB for the mempools.  RocksDB will take up more (and
>> potentially quite a bit more if you have memtables backing up waiting to
>> be flushed to L0) and potentially some other things in the OSD itself
>> that could take up memory.  If you feel comfortable experimenting, you
>> could try changing the rocksdb WAL/memtable settings.  By default we
>> have up to 4 256MB WAL buffers.  Instead you could try something like 2
>> 64MB buffers, but be aware this could cause slow performance or even
>> temporary write stalls if you have fast storage.  Still, this would only
>> give you up to ~0.9GB back.  Since you are on mimic, you might also want
>> to check what your kernel's transparent huge pages configuration is.  I
>> don't remember if we backported Patrick's fix to always avoid THP for
>> ceph processes.  If your kernel is set to "always", you might consider
>> trying it with "madvise".
>>
>> Alte

[ceph-users] Re: mon_osd_down_out_subtree_limit not working?

2020-07-15 Thread Frank Schilder
Hi Dan,

I now added it to ceph.conf and restarted all MONs. The running config now 
shows as:

# ceph config show mon.ceph-01 | grep -e NAME -e mon_osd_down_out_subtree_limit
NAME   VALUE  SOURCE   
OVERRIDESIGNORES
mon_osd_down_out_subtree_limit host   file 
(mon[host])

The config DB entry moved from column ignores to overrides, that is, it is 
still not used. Looks like a priority bug to me. On startup, the config DB 
setting should have higher priority than source "default" (and lower than 
"file" as is the case). Should I open a tracker ticket?

I tested a shutdown of all OSDs on a host and it works now as expected and 
desired.

Thanks!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Frank Schilder 
Sent: 15 July 2020 10:15:12
To: Dan van der Ster
Cc: Anthony D'Atri; ceph-users
Subject: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?

Setting it in ceph.conf is exactly what I wanted to avoid :). I will give it a 
try though. I guess this should become an issue in the tracker?

Is it, by any chance, required to restart *all* daemons or should MONs be 
enough?

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 15 July 2020 10:10:44
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users
Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?

Hrmm that is strange.

We set it via /etc/ceph/ceph.conf, not the config framework. Maybe try that?

-- dan

On Wed, Jul 15, 2020 at 9:59 AM Frank Schilder  wrote:
>
> Hi Dan,
>
> it still does not work. When I execute
>
> # ceph config set global mon_osd_down_out_subtree_limit host
> 2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set 
> mon_osd_down_out_subtree_limit = host: Configuration option 
> 'mon_osd_down_out_subtree_limit' may not be modified at runtime
>
> I get now a warning that one cannot change the value at run time. However, a 
> restart of all monitors still does not apply the value:
>
> # ceph config show mon.ceph-01 | grep -e NAME -e 
> mon_osd_down_out_subtree_limit | sed -e "s/  */\t/g"
> NAMEVALUE   SOURCE  OVERRIDES   IGNORES
> mon_osd_down_out_subtree_limit  rackdefault mon
>
> so the setting in the config data base is still ignored. Any ideas? I cannot 
> shut down the entire cluster for something that simple.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Dan van der Ster 
> Sent: 14 July 2020 17:38:27
> To: Frank Schilder
> Cc: Anthony D'Atri; ceph-users
> Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?
>
> Seems that
>
> ceph config set mon mon_osd_down_out_subtree_limit
>
> isn't working. (I've seen this sort of config namespace issue in the past).
>
> I'd try `ceph config set global mon_osd_down_out_subtree_limit host`
> then restart the mon and check `ceph daemon mon.ceph-01 config get
> mon_osd_down_out_subtree_limit` again.
>
> -- dan
>
>
> On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > thanks for your reply. There is still a problem.
> >
> > Firstly, I did indeed forget to restart the mon even though I looked at the 
> > help for mon_osd_down_out_subtree_limit and it says it requires a restart. 
> > Stupid me. Well, now I did a restart and it still doesn't work. Here is the 
> > situation:
> >
> > # ceph config dump | grep subtree
> >   mon  advanced mon_osd_down_out_subtree_limithost  
> >*
> >   mon  advanced mon_osd_reporter_subtree_level
> > datacenter
> >
> > # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit
> > host
> >
> > # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit
> > {
> > "mon_osd_down_out_subtree_limit": "rack"
> > }
> >
> > # ceph config show mon.ceph-01 | grep subtree
> > mon_osd_down_out_subtree_limit rack   default   
> > mon
> > mon_osd_reporter_subtree_level datacenter mon
> >
> > The default overrides the mon config database setting. What is going on 
> > here? I restarted all 3 monitors.
> >
> > Best regards and thanks for your help,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
&

[ceph-users] Re: mon_osd_down_out_subtree_limit not working?

2020-07-15 Thread Frank Schilder
Setting it in ceph.conf is exactly what I wanted to avoid :). I will give it a 
try though. I guess this should become an issue in the tracker?

Is it, by any chance, required to restart *all* daemons or should MONs be 
enough?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 15 July 2020 10:10:44
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users
Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?

Hrmm that is strange.

We set it via /etc/ceph/ceph.conf, not the config framework. Maybe try that?

-- dan

On Wed, Jul 15, 2020 at 9:59 AM Frank Schilder  wrote:
>
> Hi Dan,
>
> it still does not work. When I execute
>
> # ceph config set global mon_osd_down_out_subtree_limit host
> 2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set 
> mon_osd_down_out_subtree_limit = host: Configuration option 
> 'mon_osd_down_out_subtree_limit' may not be modified at runtime
>
> I get now a warning that one cannot change the value at run time. However, a 
> restart of all monitors still does not apply the value:
>
> # ceph config show mon.ceph-01 | grep -e NAME -e 
> mon_osd_down_out_subtree_limit | sed -e "s/  */\t/g"
> NAMEVALUE   SOURCE  OVERRIDES   IGNORES
> mon_osd_down_out_subtree_limit  rackdefault mon
>
> so the setting in the config data base is still ignored. Any ideas? I cannot 
> shut down the entire cluster for something that simple.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Dan van der Ster 
> Sent: 14 July 2020 17:38:27
> To: Frank Schilder
> Cc: Anthony D'Atri; ceph-users
> Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?
>
> Seems that
>
> ceph config set mon mon_osd_down_out_subtree_limit
>
> isn't working. (I've seen this sort of config namespace issue in the past).
>
> I'd try `ceph config set global mon_osd_down_out_subtree_limit host`
> then restart the mon and check `ceph daemon mon.ceph-01 config get
> mon_osd_down_out_subtree_limit` again.
>
> -- dan
>
>
> On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > thanks for your reply. There is still a problem.
> >
> > Firstly, I did indeed forget to restart the mon even though I looked at the 
> > help for mon_osd_down_out_subtree_limit and it says it requires a restart. 
> > Stupid me. Well, now I did a restart and it still doesn't work. Here is the 
> > situation:
> >
> > # ceph config dump | grep subtree
> >   mon  advanced mon_osd_down_out_subtree_limithost  
> >*
> >   mon  advanced mon_osd_reporter_subtree_level
> > datacenter
> >
> > # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit
> > host
> >
> > # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit
> > {
> > "mon_osd_down_out_subtree_limit": "rack"
> > }
> >
> > # ceph config show mon.ceph-01 | grep subtree
> > mon_osd_down_out_subtree_limit rack   default   
> > mon
> > mon_osd_reporter_subtree_level datacenter mon
> >
> > The default overrides the mon config database setting. What is going on 
> > here? I restarted all 3 monitors.
> >
> > Best regards and thanks for your help,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Dan van der Ster 
> > Sent: 14 July 2020 10:53:13
> > To: Frank Schilder
> > Cc: Anthony D'Atri; ceph-users
> > Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?
> >
> > mon_osd_down_out_subtree_limit has been working well here. Did you
> > restart the mon's after making that config change?
> > Can you do this just to make sure it took effect?
> >
> >ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit
> >
> > -- dan
> >
> > On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder  wrote:
> > >
> > > Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got 
> > > remapped and recovery/rebalancing started as usual. In the past, I did 
> > > service on servers with the flag noout set and would expect that 
> > > mon_osd_down_out_subtree_limit=host has the same effect when shutting 
>

[ceph-users] Re: mon_osd_down_out_subtree_limit not working?

2020-07-15 Thread Frank Schilder
Hi Dan,

it still does not work. When I execute

# ceph config set global mon_osd_down_out_subtree_limit host
2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set 
mon_osd_down_out_subtree_limit = host: Configuration option 
'mon_osd_down_out_subtree_limit' may not be modified at runtime

I get now a warning that one cannot change the value at run time. However, a 
restart of all monitors still does not apply the value:

# ceph config show mon.ceph-01 | grep -e NAME -e mon_osd_down_out_subtree_limit 
| sed -e "s/  */\t/g"
NAMEVALUE   SOURCE  OVERRIDES   IGNORES
mon_osd_down_out_subtree_limit  rackdefault mon

so the setting in the config data base is still ignored. Any ideas? I cannot 
shut down the entire cluster for something that simple.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 July 2020 17:38:27
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users
Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?

Seems that

ceph config set mon mon_osd_down_out_subtree_limit

isn't working. (I've seen this sort of config namespace issue in the past).

I'd try `ceph config set global mon_osd_down_out_subtree_limit host`
then restart the mon and check `ceph daemon mon.ceph-01 config get
mon_osd_down_out_subtree_limit` again.

-- dan


On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder  wrote:
>
> Hi Dan,
>
> thanks for your reply. There is still a problem.
>
> Firstly, I did indeed forget to restart the mon even though I looked at the 
> help for mon_osd_down_out_subtree_limit and it says it requires a restart. 
> Stupid me. Well, now I did a restart and it still doesn't work. Here is the 
> situation:
>
> # ceph config dump | grep subtree
>   mon  advanced mon_osd_down_out_subtree_limithost
>  *
>   mon  advanced mon_osd_reporter_subtree_level
> datacenter
>
> # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit
> host
>
> # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit
> {
> "mon_osd_down_out_subtree_limit": "rack"
> }
>
> # ceph config show mon.ceph-01 | grep subtree
> mon_osd_down_out_subtree_limit rack   default 
>   mon
> mon_osd_reporter_subtree_level datacenter mon
>
> The default overrides the mon config database setting. What is going on here? 
> I restarted all 3 monitors.
>
> Best regards and thanks for your help,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 July 2020 10:53:13
> To: Frank Schilder
> Cc: Anthony D'Atri; ceph-users
> Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?
>
> mon_osd_down_out_subtree_limit has been working well here. Did you
> restart the mon's after making that config change?
> Can you do this just to make sure it took effect?
>
>ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit
>
> -- dan
>
> On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder  wrote:
> >
> > Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got 
> > remapped and recovery/rebalancing started as usual. In the past, I did 
> > service on servers with the flag noout set and would expect that 
> > mon_osd_down_out_subtree_limit=host has the same effect when shutting down 
> > an entire host. Unfortunately, in my case these two settings behave 
> > differently.
> >
> > If I understand the documentation correctly, the OSDs should not get marked 
> > out automatically.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Anthony D'Atri 
> > Sent: 14 July 2020 04:32:05
> > To: Frank Schilder
> > Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working?
> >
> > Did it start rebalancing?
> >
> > > On Jul 13, 2020, at 4:29 AM, Frank Schilder  wrote:
> > >
> > > if I shut down all OSDs on this host, these OSDs should not be marked out 
> > > automatically after mon_osd_down_out_interval(=600) seconds. I did a test 
> > > today and, unfortunately, the OSDs do get marked as out. Ceph status was 
> > > showing 1 host down as expected.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Poor Windows performance on ceph RBD.

2020-07-15 Thread Frank Schilder
Dear all,

a few more results regarding virtio-version, RAM size and ceph RBD caching.

I got some wrong information from our operators. We are using 
virtio-win-0.1.171 and found that this version might have a regression that 
affects performance: 
https://forum.proxmox.com/threads/big-discovery-on-virtio-performance.62728/. 
We are considering to downgrade all machines to virtio-win-0.1.164-2 until 
virtio-win-0.1.185-1 is marked stable. Our tests show that with both of these 
versions, Windows server version 2016 and 2019 perform equally well.

We also experimented with the memory size for these machines. They used to have 
4GB only. With 4GB, both versions eventually run into stalled I/O. After 
increasing this to 8GB we don't see stalls any more.

Ceph RBD caching should have been set to writeback. Not sure why caching was 
disabled by default. It does not have much if any effect on write performance, 
although transfer rates seem more steady. I mainly want to enable caching to 
reduce read operations, which compete with writes on OSD level. This should 
give much better overall experience. We will change this setting during 
forthcoming service windows.

Looks like we more or less got it sorted. Hints in this thread helped 
pinpointing issues.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 13 July 2020 15:38:58
To: André Gemünd; ceph-users
Subject: [ceph-users] Re: Poor Windows performance on ceph RBD.

> If I may ask, which version of the virtio drivers do you use?

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/latest-virtio/virtio-win.iso

Looks like virtio-win-0.1.185.*

> And do you use caching on libvirt driver level?

In the ONE interface, we use

  DISK = [ driver = "raw" , cache = "none"]

which translates to


  

in the XML. We have no qemu settings in the ceph.conf. Looks like caching is 
disabled. Not sure if this is the recommended way though and why caching is 
disabled by default.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: André Gemünd 
Sent: 13 July 2020 11:18
To: Frank Schilder
Subject: Re: [ceph-users] Re: Poor Windows performance on ceph RBD.

If I may ask, which version of the virtio drivers do you use?

And do you use caching on libvirt driver level?

Greetings
André

- Am 13. Jul 2020 um 10:43 schrieb Frank Schilder fr...@dtu.dk:

>> > To anyone who is following this thread, we found a possible explanation for
>> > (some of) our observations.
>
>> If someone is following this, they probably want the possible
>> explanation and not the knowledge of you having the possible
>> explanation.
>
>> So you are saying if you do eg. a core installation (without gui) of
>> 2016/2019 disable all services. The fio test results are signficantly
>> different to eg. a centos 7 vm doing the same fio test? Are you sure
>> this is not related to other processes writing to disk?
>
> Right, its not an explanation but rather a further observation. We don't 
> really
> have an explanation yet.
>
> Its an identical installation of both server versions, same services 
> configured.
> Our operators are not really into debugging Windows, that's why we were asking
> here. Their hypothesis is, that the VD driver for accessing RBD images has
> problems with Windows servers newer than 2016. I'm not a Windows guy, so can't
> really comment on this.
>
> The test we do is a simple copy-test of a single 10g file and we monitor the
> transfer speed. This info was cut out of this e-mail, the original report for
> reference is:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/ANHJQZLJT474B457VVM4ZZZ6HBXW4OPO/
> .
>
> We are very sure that it is not related to other processes writing to disk, we
> monitor that too. There is also no competition on the RBD pool at the time of
> testing.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Marc Roos 
> Sent: 13 July 2020 10:24
> To: ceph-users; Frank Schilder
> Subject: RE: [ceph-users] Re: Poor Windows performance on ceph RBD.
>
>>> To anyone who is following this thread, we found a possible
> explanation for
>>> (some of) our observations.
>
> If someone is following this, they probably want the possible
> explanation and not the knowledge of you having the possible
> explanation.
>
> So you are saying if you do eg. a core installation (without gui) of
> 2016/2019 disable all services. The fio test results are signficantly
> different to eg. a centos 7 vm doing the same fio tes

[ceph-users] Re: OSD memory leak?

2020-07-14 Thread Frank Schilder
Hi Anthony and Mark,

thanks for your answers.

I have seen recommendations derived from test clusters with bluestore OSDs that 
read 16GB base line + 1GB per HDD + 4GB per SSD OSD, probably from the times 
when bluestore had a base-line+stress dependent. I would actually consider this 
already quite something. I understand that for high-performance requirements 
one adds RAM etc. to speed things up.

For a mostly cold data store with a thin layer of warm/hot data, however, this 
is quite a lot compared with what standard disk controllers can do with a cheap 
CPU, 4GB of RAM and 16 drives connected. Essentially, ceph is turning a server 
into a disk controller and it should be possible to run a configuration that 
does not require much more than an ordinary hardware controller per disk 
delivering reasonable performance. I'm thinking along the lines of 25MB/s 
throughput and maybe 10IOP/s per NL-SAS HDD OSD to the user side (simple 
collocated deployment, EC pool). This ought to be possible in a way similar to 
a RAID controller with comparably moderate hardware requirements.

Good aggregated performance then comes from scale and because the layer of hot 
data per disk is only a few GB per drive (a full re-write of just the hot data 
is only a few minutes). I thought this was the idea of ceph. Instead of trying 
to accommodate high-performance wishes for ridiculously small ceph clusters (I 
do see these "I have 3 servers with 3 disks each, why is it so slow" kind of 
complaints, which I would simply ignore), one talks about scale-out systems 
with thousands of OSDs. Something like 20 hosts serving 200 disks each would 
count as a small cluster. If the warm/hot data is only 1% or even less, such a 
system will be quite satisfying.

For low-cost scale-out we have ceph. For performance, we have technologies like 
Lustre (which by the way has much more moderate minimum hardware requirements).

For anything that requires higher performance one can then start using tiering, 
WAL/DB devices, SSD only pools, lots of RAM, whatever. However, there should be 
a stable, well-tested and low-demanding base line config for a cold store use 
case with hardware requirements similar to a NAS box per storage unit (one 
server+JBODs). I start missing support for the latter. 2 or even 4GB and 
1core-GHz per HDD is really a lot compared with such systems.

Please don't take this as a start of a long discussion. Its just a wish from my 
side to have low-demanding configs available that scale easily and are easy to 
administrate at an overall low cost.

I will look into memory profiling of some OSDs. It doesn't look like a 
performance killer.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 14 July 2020 17:29
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSD memory leak?

>>  In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD.

There was a rule of thumb of 1GB RAM *per TB* of HDD Filestore OSD, perhaps you 
were influenced by that?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-07-14 Thread Frank Schilder
Dear Mark,

thanks for the info. I forgot a few answers:

THPes are disabled (set to "never"). The kernel almost certainly doesn't 
reclaim because there is not enough pressure yet.

We have 268 OSDs. I would not consider this much. We plan to triple that 
soonish. In the past, the minimum recommendation was 1GB RAM per HDD blue store 
OSD. I'm actually not really happy about that this has been quadrupled for not 
really convincing reasons. Compared with other storage systems, the increase in 
minimum requirements really start making ceph expensive.

We have set the OSDs to use the bitmap allocator. Is the fact that we get 
tcmalloc stats a contradiction to this?

I did not consider upgrading from mimic, because a lot of people report 
stability issues that might be caused by a regression in the message queueing. 
There was a longer e-mail about clusters from nautilus and higher collapsing 
under trivial amounts of rebalancing, pool deletion and other admin tasks. 
Before I consider upgrading, I want to test this on a lab cluster we plan to 
set up soon.

I will look at the memory profiling. If one can use this on a production 
system, I will give it a go.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 14 July 2020 14:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,


These might help:


https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

https://gperftools.github.io/gperftools/heapprofile.html

https://gperftools.github.io/gperftools/heap_checker.html


Regarding the mempools, they don't track all of the memory usage in
Ceph, only things that were allocated using mempools.  There are many
other things (rocksdb block cache for example) that don't use them.
It's only giving you a partial picture of memory usage.  In your example
below, that byte value from the thread cache freelist looks very wrong.
Ignoring that for a moment though, there's a ton of memory that's been
unmapped and released to the OS, but hasn't been reclaimed by the
kernel.  That's either because the kernel doesn't have enough memory
pressure to bother reclaiming it, or because it's all fragmented chunks
of a huge page that the kernel can't fully reclaim.  That tells me you
should definitely be looking at the transparent huge page (THP)
configuration on your nodes.  Looking back at batrick's PR that disables
THP for Ceph, it looks like we only backported it to nautilus but not
mimic.  On that topic, have you considered upgrading to Nautilus?


Mark


On 7/14/20 2:56 AM, Frank Schilder wrote:
> Dear Mark,
>
> thanks for the quick answer. I would try the memory profiler if I could find 
> any documentation on it. In fact, I just guessed the "heap stats" command and 
> have a hard time finding anything on the OSD daemon commands. Could you 
> possibly point me to something? Also how to interpret the mempools? Is it 
> correct to say that out of the memory_target only the mempools total is 
> actually used and the remaining memory is lost due to leaks?
>
> For example, for OSD 256 I get the stats below after just 2 months uptime. Am 
> I looking at a 5.5GB memory leak here?
>
> # ceph config get osd.256 osd_memory_target
> 8589934592
>
> # ceph daemon osd.256 heap stats
> osd.256 tcmalloc heap stats:
> MALLOC: 7216067616 ( 6881.8 MiB) Bytes in use by application
> MALLOC: +   229376 (0.2 MiB) Bytes in page heap freelist
> MALLOC: +   1222913888 ( 1166.3 MiB) Bytes in central cache freelist
> MALLOC: +   278016 (0.3 MiB) Bytes in transfer cache freelist
> MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache 
> freelists
> MALLOC: + 52166656 (   49.8 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =   8475041792 ( 8082.4 MiB) Actual memory used (physical + swap)
> MALLOC: +   2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =  10485506048 ( .8 MiB) Virtual address space used
> MALLOC:
> MALLOC: 765182  Spans in use
> MALLOC: 48  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
>
> # ceph daemon osd.256 dump_mempools
> {
>  "mempool": {
>  "by_pool": {
>  "bloom_filter": {
>  "items": 0,
>  "bytes": 0
>  },
>  "bluestore_alloc": {
>  "items": 2300682,
>  "bytes": 18405456
>  },
>  

[ceph-users] Re: mon_osd_down_out_subtree_limit not working?

2020-07-14 Thread Frank Schilder
Hi Dan,

thanks for your reply. There is still a problem.

Firstly, I did indeed forget to restart the mon even though I looked at the 
help for mon_osd_down_out_subtree_limit and it says it requires a restart. 
Stupid me. Well, now I did a restart and it still doesn't work. Here is the 
situation:

# ceph config dump | grep subtree
  mon  advanced mon_osd_down_out_subtree_limithost  
   *
  mon  advanced mon_osd_reporter_subtree_leveldatacenter

# ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit
host

# ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit
{
"mon_osd_down_out_subtree_limit": "rack"
}

# ceph config show mon.ceph-01 | grep subtree
mon_osd_down_out_subtree_limit rack   default   
mon
mon_osd_reporter_subtree_level datacenter mon

The default overrides the mon config database setting. What is going on here? I 
restarted all 3 monitors.

Best regards and thanks for your help,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 July 2020 10:53:13
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users
Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working?

mon_osd_down_out_subtree_limit has been working well here. Did you
restart the mon's after making that config change?
Can you do this just to make sure it took effect?

   ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit

-- dan

On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder  wrote:
>
> Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got 
> remapped and recovery/rebalancing started as usual. In the past, I did 
> service on servers with the flag noout set and would expect that 
> mon_osd_down_out_subtree_limit=host has the same effect when shutting down an 
> entire host. Unfortunately, in my case these two settings behave differently.
>
> If I understand the documentation correctly, the OSDs should not get marked 
> out automatically.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Anthony D'Atri 
> Sent: 14 July 2020 04:32:05
> To: Frank Schilder
> Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working?
>
> Did it start rebalancing?
>
> > On Jul 13, 2020, at 4:29 AM, Frank Schilder  wrote:
> >
> > if I shut down all OSDs on this host, these OSDs should not be marked out 
> > automatically after mon_osd_down_out_interval(=600) seconds. I did a test 
> > today and, unfortunately, the OSDs do get marked as out. Ceph status was 
> > showing 1 host down as expected.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-07-14 Thread Frank Schilder
Dear Mark,

thanks for the quick answer. I would try the memory profiler if I could find 
any documentation on it. In fact, I just guessed the "heap stats" command and 
have a hard time finding anything on the OSD daemon commands. Could you 
possibly point me to something? Also how to interpret the mempools? Is it 
correct to say that out of the memory_target only the mempools total is 
actually used and the remaining memory is lost due to leaks?

For example, for OSD 256 I get the stats below after just 2 months uptime. Am I 
looking at a 5.5GB memory leak here?

# ceph config get osd.256 osd_memory_target
8589934592

# ceph daemon osd.256 heap stats
osd.256 tcmalloc heap stats:
MALLOC: 7216067616 ( 6881.8 MiB) Bytes in use by application
MALLOC: +   229376 (0.2 MiB) Bytes in page heap freelist
MALLOC: +   1222913888 ( 1166.3 MiB) Bytes in central cache freelist
MALLOC: +   278016 (0.3 MiB) Bytes in transfer cache freelist
MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache 
freelists
MALLOC: + 52166656 (   49.8 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =   8475041792 ( 8082.4 MiB) Actual memory used (physical + swap)
MALLOC: +   2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  10485506048 ( .8 MiB) Virtual address space used
MALLOC:
MALLOC: 765182  Spans in use
MALLOC: 48  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


# ceph daemon osd.256 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 2300682,
"bytes": 18405456
},
"bluestore_cache_data": {
"items": 52390,
"bytes": 306843648
},
"bluestore_cache_onode": {
"items": 256153,
"bytes": 145494904
},
"bluestore_cache_other": {
"items": 92199353,
"bytes": 656620069
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 4,
"bytes": 2752
},
"bluestore_writing_deferred": {
"items": 122,
"bytes": 1864924
},
"bluestore_writing": {
"items": 3673,
"bytes": 18440192
},
"bluefs": {
"items": 11867,
"bytes": 220504
},
"buffer_anon": {
"items": 353734,
"bytes": 1180837372
},
"buffer_meta": {
"items": 91646,
"bytes": 5865344
},
"osd": {
"items": 134,
"bytes": 1557616
},
"osd_mapbl": {
"items": 84,
"bytes": 8479562
},
"osd_pglog": {
"items": 487004,
"bytes": 166094788
},
"osdmap": {
"items": 117697,
"bytes": 2080280
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
    },
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 95874543,
"bytes": 2512807411
}
}
}

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 13 July 2020 15:39:50
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSD memory lea

[ceph-users] Re: mon_osd_down_out_subtree_limit not working?

2020-07-14 Thread Frank Schilder
Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got 
remapped and recovery/rebalancing started as usual. In the past, I did service 
on servers with the flag noout set and would expect that 
mon_osd_down_out_subtree_limit=host has the same effect when shutting down an 
entire host. Unfortunately, in my case these two settings behave differently.

If I understand the documentation correctly, the OSDs should not get marked out 
automatically.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 14 July 2020 04:32:05
To: Frank Schilder
Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working?

Did it start rebalancing?

> On Jul 13, 2020, at 4:29 AM, Frank Schilder  wrote:
>
> if I shut down all OSDs on this host, these OSDs should not be marked out 
> automatically after mon_osd_down_out_interval(=600) seconds. I did a test 
> today and, unfortunately, the OSDs do get marked as out. Ceph status was 
> showing 1 host down as expected.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Poor Windows performance on ceph RBD.

2020-07-13 Thread Frank Schilder
> If I may ask, which version of the virtio drivers do you use?

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/latest-virtio/virtio-win.iso

Looks like virtio-win-0.1.185.*

> And do you use caching on libvirt driver level?

In the ONE interface, we use

  DISK = [ driver = "raw" , cache = "none"]

which translates to


  

in the XML. We have no qemu settings in the ceph.conf. Looks like caching is 
disabled. Not sure if this is the recommended way though and why caching is 
disabled by default.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: André Gemünd 
Sent: 13 July 2020 11:18
To: Frank Schilder
Subject: Re: [ceph-users] Re: Poor Windows performance on ceph RBD.

If I may ask, which version of the virtio drivers do you use?

And do you use caching on libvirt driver level?

Greetings
André

- Am 13. Jul 2020 um 10:43 schrieb Frank Schilder fr...@dtu.dk:

>> > To anyone who is following this thread, we found a possible explanation for
>> > (some of) our observations.
>
>> If someone is following this, they probably want the possible
>> explanation and not the knowledge of you having the possible
>> explanation.
>
>> So you are saying if you do eg. a core installation (without gui) of
>> 2016/2019 disable all services. The fio test results are signficantly
>> different to eg. a centos 7 vm doing the same fio test? Are you sure
>> this is not related to other processes writing to disk?
>
> Right, its not an explanation but rather a further observation. We don't 
> really
> have an explanation yet.
>
> Its an identical installation of both server versions, same services 
> configured.
> Our operators are not really into debugging Windows, that's why we were asking
> here. Their hypothesis is, that the VD driver for accessing RBD images has
> problems with Windows servers newer than 2016. I'm not a Windows guy, so can't
> really comment on this.
>
> The test we do is a simple copy-test of a single 10g file and we monitor the
> transfer speed. This info was cut out of this e-mail, the original report for
> reference is:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/ANHJQZLJT474B457VVM4ZZZ6HBXW4OPO/
> .
>
> We are very sure that it is not related to other processes writing to disk, we
> monitor that too. There is also no competition on the RBD pool at the time of
> testing.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Marc Roos 
> Sent: 13 July 2020 10:24
> To: ceph-users; Frank Schilder
> Subject: RE: [ceph-users] Re: Poor Windows performance on ceph RBD.
>
>>> To anyone who is following this thread, we found a possible
> explanation for
>>> (some of) our observations.
>
> If someone is following this, they probably want the possible
> explanation and not the knowledge of you having the possible
> explanation.
>
> So you are saying if you do eg. a core installation (without gui) of
> 2016/2019 disable all services. The fio test results are signficantly
> different to eg. a centos 7 vm doing the same fio test? Are you sure
> this is not related to other processes writing to disk?
>
>
>
> -Original Message-
> From: Frank Schilder [mailto:fr...@dtu.dk]
> Sent: maandag 13 juli 2020 9:28
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Poor Windows performance on ceph RBD.
>
> To anyone who is following this thread, we found a possible explanation
> for (some of) our observations.
>
> We are running Windows servers version 2016 and 2019 as storage servers
> exporting data on an rbd image/disk. We recently found that Windows
> server 2016 runs fine. It is still not as fast as Linux + SAMBA share on
> an rbd image (ca. 50%), but runs with a reasonable sustained bandwidth.
> With Windows server 2019, however, we observe near-complete stall of
> file transfers and time-outs using standard copy tools (robocopy). We
> don't have an explanation yet and are downgrading Windows servers where
> possible.
>
> If anyone has a hint what we can do, please let us know.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Dipl.-Inf. André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemu...@scai.fraunhofer.de
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD memory leak?

2020-07-13 Thread Frank Schilder
lication
MALLOC: +  3727360 (3.6 MiB) Bytes in page heap freelist
MALLOC: + 25493688 (   24.3 MiB) Bytes in central cache freelist
MALLOC: + 17101824 (   16.3 MiB) Bytes in transfer cache freelist
MALLOC: + 20301904 (   19.4 MiB) Bytes in thread cache freelists
MALLOC: +  5242880 (5.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =   1245863936 ( 1188.1 MiB) Actual memory used (physical + swap)
MALLOC: + 20488192 (   19.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =   1266352128 ( 1207.7 MiB) Virtual address space used
MALLOC:
MALLOC:  54160  Spans in use
MALLOC: 33  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


Am I looking at a memory leak here or are these heap stats expected?

I don't mind the swap usage, it doesn't have impact. I'm just wondering if I 
need to restart OSDs regularly. The "leakage" above occurred within only 2 
months.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


<    1   2   3   4   5   6   7   8   >